Building a custom tiered storage and query layer is fascinating, but not for everyone.
I love deep technical posts explaining how smart engineers solve big problems. As with any solution, these are tailor made to the needs of their business and may or may not apply directly to you. However, there is so much knowledge in these posts that as I read through, I find my mind racing trying to comprehend it all. I have so many questions and no one to ask.
In this post I want to share my raw thoughts with you as I parse through Zendesk Engineering’s post explaining how they moved from DynamoDB to a tiered storage with MySQL + S3. Lets jump in…
TLDR
Zendesk needed a cheaper way to serve low latency queries from their application. Their current solution using DynamoDB was too expensive and they wanted a solution that gave them both low latency and low cost. To get the best of both worlds they designed a solution using MySQL, S3 and a few services that implement data movement between systems and cleanup tasks. Even more interesting is the a serving layer they built to route and optimize user queries - it’s not every day that you see this.
The solution also implemented a few interesting techniques to improve query performance. Those same techniques are some of what already exists today in popular analytics tools.
Who is this for?
As an engineer, this is a great post to learn how to design a custom tiered storage system. Many tools on the market today already offer this, but it’s helpful to understand this pattern in a little more detail.
In addition, much of the optimization discussion stems from the need to create a custom serving layer for the Zendesk application. For analytics use cases we often use off-the-shelf tools, but for embedded analytics it may be more efficient to build your own serving/query service. As a data and software engineer, this is worth paying attention.
This post is not for someone who’s looking for a turnkey solution or a set of products that come pre-built with tiered storage and low latency serving layers for the purpose of analytics or even embedded use cases.
This solution is home-grown and requires deeper understanding of query engines, query optimizations and data layout design, which is not well covered in the post. If you have a skilled engineering team that could build and support a solution like this and can’t seem to find an existing tool to fit your needs, then Zendesk’s post is a good place to start.
Or just read it to learn something new!
Understanding the use case
I’m a bit surprised that little detail is shared about the business use case or what customer-facing feature this solution is actually serving. However, the author, Shane Hender, gives us a hint by explaining that it’s designed to ingest a stream of application events that users can later query and analyze. Not much, but it will do.
Being a customer facing capability, most likely exposed through the Zendesk app, round-trip latency and responsiveness is super important, which makes sense why they originally chose DynamoDB (DDB) - for it’s low latency and dynamic scaling.
Why the original design no longer works?
The original design includes two home-grown services. The first consumes events from Kafka and stores them in DDB. And the second, enables the Zendesk app to issue queries against the data in DDB. This implies lots of concurrent reads and writes to DDB.
I would love to know what is the scale they’re working with. How many requests per second or daily data volume? They mentioned saving over 50%, but without a baseline it’s hard to tell if it’s significant saving or not. You’d hate to rebuild a solution for minimal ROI.
Anyways, DDB charges users based on Read and Write Capacity Units consumed. Kind of like having to pay every time you speak a sentence and every time you hear a sentence. Imagine you have lots of friends, that’s one expensive party💰
To mitigate some of the cost, Zendesk implemented DDB’s provisioned-capacity billing which basically puts a cap on costs. If the application tries to consume more RCU/WCUs than provisioned, it will get throttled.
Throttling is important to prevent things from crashing and burning, but if not handled properly in your application, it could result in service degradation when you least expect it. I’ve been there more than I care to admit 😭
Zendesk also implemented Global Secondary Indexes (GSI) to, I assume, improve lookup performance for large tables that span multiple DDB partitions. GSI is charged separately and as the base table grows so does the GSI and its costs 💰💰
Long story short, DDB got too expensive and would only get worse as their usage increases. Time to find something else…
Alternatives that didn’t quite make the cut
Shane briefly lists a few alternative tools they evaluated, but didn’t share any detail about his requirements. Reading between the lines and with my own understanding of the high-level needs of a data-intensive customer facing apps - volume, flexibility, cost and latency, it’s easy to assume why S3, Hudi (on S3), Elasticsearch and MySQL alone wouldn’t make the cut.
My first thought was, why not Clickhouse? Clickhouse is designed for this exact use case. It has built in Kafka ingestion and low latency queries, indices (and other lookup optimizations) and is cost effective - you can self-manage or deploy in Clickhouse Cloud.
A few of the comments on the Zendesk post suggested Amazon Athena as an alternative because it makes it easy to query data in S3. I’m a big fan of Athena 👸, but it’s not really designed for low latency use cases. Plus they’d have build a solution to continuously optimize files in S3 and update metadata in the AWS Glue Data Catalog to get the best performance.Or they can easily automated the whole thing using a product like Upsolver.
Shane’s comment about Hudi delaying data by 24 hours wasn’t clear to me. I did some more digging but couldn’t find a technical reason for Hudi to cause this delay. Furthermore, there is no mention of them evaluating Hudi’s StreamSync utility, which is designed for this exact use case by ingesting streaming events from Kafka to a Hudi table. Zendesk even went the extra step to build a custom index and bloom filter to improve file lookup and query performance, something Hudi already provides out of the box. I’m not sure why they didn’t consider this from the beginning.
Understanding the final solution
They settled on a final solution that did 3 things:
Ingest new event data from Kafka into MySQL.
Every hour dump rows from MySQL to S3.
Every 4 hours delete old rows from MySQL.
This in essence is the tiered storage solution. This solution is a bit more complicated than it seems. There a few additional steps required to keep track of tiering process.
New events are written to a buffer table
The hourly job reads 10K rows and writes them to S3
The job then updates a Metadata table, marking the last row processed
Periodically update bloom filter and count-min sketches in Metadata table
Using a lookup table for tracking is a common pattern and can work well at scale. However, in this case, the lookup table is also used at query time to identify which files, in S3, contain the rows required to satisfy the query.
Generally this is ok, but over time I expect the Metadata table will grow, assuming Zendesk data volume will also continue to grow. My concern is that this table will become a bottle neck and impact query performance. But maybe it’s not an immediate concern.
Another thought that came to mind is that they introduced an artificial delay of 1 hour before new data is available in S3. It’s not clear to me how “Uploader” service works. Does it use JDBC to read the rows into memory and then write them to S3? Maybe it uses AWS SDK with RDS Data API or Aurora’s SELECT INTO OUTFILE S3 command (seems unlikely)? I would think there are better ways to handle this without writing/managing custom code. I’d love to know more about this service.
Also, why not use CDC? Building a service to listen to CDC events on the “buffer” table, batch them into ideal-sized chunks (10K?) and write them to S3, would be more efficient and make new data available in S3 almost as soon as it is ingested, why wait 1 hour.
Lots of questions about this design…
Querying the data in the new solution
The first part of the post covered the data ingestion and tiering to S3 (tracking, cleanup, etc.). The second part, Shane shares more detail on how they query the data via a custom serving layer.
What is a serving layer you ask? In traditional analytics, the serving layer is the query engine (Athena, Snowflake, etc.). They expose a SQL interface that you query and after some secret magic results are returned. In the case of embedded analytics, which is the Zendesk use case, they built a custom service that listens to API request from the client application and using some home-grown magic finds and returns the relevant rows.
The home-grown magic discussed in the post uses the Metadata table in MySQL to identify which files contain the rows that will satisfy the user query. Then using S3-Select, they execute a SQL SELECT query on the actual files to retrieve the rows.
The first question that came to mind was, “wait, why aren’t you querying the MySQL Buffer table first?” Since the data is freshest in the Buffer table, I would expect them to query that table first. If the time window specified by the query is older than 1 hour, than query S3 directly. But the post has it the other way - query S3 first, if no results, then query MySQL. Maybe users most often query historical data so it’s more efficient to start in S3?
The second thing is, I’m super impressed that they built their own serving layer. This is a really powerful, but advance, design pattern to unify access to data stored across multiple systems and storage tiers. I don’t see this discussed very often 🎩
There are a few common challenges with this approach tho:
Query latency is not consistent between sources, potentially resulting in bad user experience. Shane did include a latency graph that shows S3-Select consistently performing better than Aurora, which is very surprising and 👀. It possibly makes this less of an issue for them.
Security and access control is different between source systems and may require duplicating some data, partitioning or single-tenant designs to ensure privacy and avoid exfiltration. There is no mention of any security requirements in the post, so I’m not sure how they are handling data privacy.
Support a rich enough query syntax that would enable you to offer differentiated capabilities to users. In this case, MySQL’s syntax is very rich and mature, but S3-Select is significantly limited. As a product owner, I would want parity and flexibility so I can quickly experiment with new capabilities.
Performance optimization is a full-time job, especially when implemented for multiple source systems. Expect duplication of effort when trying to solve problems already solved by OSS or commercial engines.
Overall the approach is technically sound, but I expect it will require a lot of hand-holding and constant tuning and tweaking. This is not a solution I would recommend for most people. Tools like Apache Presto & Trino, Amazon Athena and Starburst offer federated querying with a consistent query syntax and tons of perf optimizations built-in. However, they may not exactly meet Zendesk’s end-to-end latency requirements. Clickhouse would be a better alternative with its ability to query multiple tiers and serve requests with low latency.
The optimization game…
As an engineer, this is where things get exciting 🤩. Over my career I spent ungodly amount of time optimizing queries, jobs, SQL, etc. and there are lots of techniques and data structures that could make or break your day. So I pay close attention when others share their optimization war stories 🤔
The new solution was designed to optimize queries that filter data on user defined windows of time. However, the team quickly (I assume it was quick) realized that users wanted to also filter the data based on other fields like user ID, etc. Since the serving layer only optimized for time range queries, filtering on any other column would result in scanning significantly more data than needed. That’s slow and expensive.
This problem isn’t new. I see it often with Spark, Hive, Presto/Trino and others. To solve this problem I typically partition or cluster/bucket the data based on fields used in the WHERE clause. This is helpful, but static to those fields only. If a new field becomes popular to filter data, you’d need to manually repartition/recluster the data - this is quite expensive for large datasets. Zendesk initially attempted this, but found it quickly got complicated with the number of filter variations and the copies of raw logs and Metadata tables they need to maintain.
To give more flexibility to how data is filtered, they introduced a Bloom Filter that is used to check if an item does NOT exists in a set. This allows them to check if a specific row exist in a file based on a variety of fields, not just timestamps. The filter can also be updated which makes it user friendly to product changes.
Next there was the problem of rows being scattered across lots of files. If you wanted to query the logs for user-id = 5, that user-id may have data across 100’s of files. To get all the rows, you’ll need to scan every single file which is too slow. To solve this, they introduced another data structure called Count-Min Sketch. This function works by consuming events, one at a time, and counting the frequency of different events. Basically, it’s a tally of how many different events were seen. But because it's a probabilistic structure, when querying it, you will get an estimate of the frequencies. There is no guarantee of correctness. That’s probably ok in this scenario, but could become problematic as the number of files on S3 increase and the chance of false positives also increases.
Personally I’m a big fan of probabilistic functions/sketches when estimating for the purpose of reporting. But when they are used in query planning, it’s a bit scary and requires building some guardrails to protect against hairy edges.
The custom ingestion service Zendesk built, is responsible for maintaining these data structures by updating them every time data is dumped from MySQL to S3.
When I read this section of the post I loved seeing how these data structures are used, but kept thinking, why not use Apache Parquet? The Apache Parquet file format includes a lot of goodies to optimize access to data, similar to what Zendesk built. You can configure a bloom filter to optimize predicate pushdowns, define column indices to skip reading file pages (groups of rows), and use min/max/count column statistics to skip reading entire files that do not contain the data you need.
Conclusion
I really enjoyed the technical detail shared in this post. I wanted more detail about the use case, the kind of data and volume/rate of events to give me some idea of scale, but I realize not every company is comfortable sharing this information.
The solution they came up with aligns with much of the best practices I see in the industry but I’m surprised they chose to build everything rather than leverage more native features and OSS components.
I wouldn’t recommend building something like this unless you have the people and skillsets to maintain it. There are a lot of moving parts, custom code and logic that would ultimately be your, or your successor’s tech debt.
Disclaimer: My review and unpacking of the original post is intended for the purpose of sharing my perspective, opinions and experiences as they relate to the technical solution, approaches and design decisions. It is not intended to put down, dismiss, or mislead. If you feel my assessment is incorrect or want to share more information, please reach out to me directly.