AWS S3 Tables and the race for managed storage

Dec 10, 2024

Gru's Plan Meme | Create S3 Table Bucket; Create S3 Table - Iceberg; Set access permissions; Query it? | image tagged in memes,gru's plan | made w/ Imgflip meme maker

Brief intro to managed storage for analytics

Managed storage is not a new concept. Operational databases and popular data warehouses are responsible for organizing, storing and optimizing how data is stored in their attached, and more recently decoupled storage layers. The MPP architecture on which most modern warehouse solutions are based on, heavily depend on the ability to manage - store, shuffle, sort, cluster and index data objects across large number of compute nodes as they scale up and down.

Traditional data lakes enabled warehouse vendors to retrofit their managed storage on top of cloud object stores as Snowflake and Redshift did so well. But because object stores are barebones storage layers, vendors were forced to build a ton of management code and infrastructure to provide the same capabilities, performance and reliability of their coupled storage layers for a decoupled one. However, the cost savings and customer benefits were well worth the investment…and it paid off in dividend.

In parallel, data engineers at Netflix, Uber, Apple and Alibaba experienced a similar pain as they attempted to further decouple their data platforms, scale them to massive amounts of data and do it all in the most cost-efficient way possible. From that effort we got Apache Iceberg, Apache Hudi and Apache Paimon. Databrick’s Delta Lake came out a little earlier.

How open table formats changed our view

With these new standard open table formats (Iceberg, Hudi, Delta), data engineering teams and vendors alike can build their own managed storage layers to help users minimize vendor lock-in and reduce their data storage and processing related costs.

They provide the best of both worlds. Users can use whatever query engine they want, but leave the data storage and management to a standard that is interoperable, simple and open.

Although open table formats provide APIs and procedures to manage data like compacting small files, expiring old versions (snapshots) and deleting orphaned files, it remains the user's responsibility to know how to configure these procedures, when to run them and how to handle failures and conflicts. In a small lake deployment with 100+ tables, it’s manageable, but as lakes grow to 1000’s or even 100,000+ tables it becomes a full time job for a sizable engineering team.

You’re probably thinking “I can do this, it’s not that hard” (I heard this hundreds of times in the last year). But when you have tons of other requests piling up, is this really what you want to spend your time doing? I don’t.

Vendors recognized this impending challenge, scale beyond a small lake, and began to offer managed storage that would perform these tasks on behalf of the user. Iceberg Tables for Snowflake, Google BigLake Tables and recently BigQuery managed tables for Iceberg and Databricks Delta Lake Tables. Azure utilizes DeltaLake embedded into Fabric Lakehouse but doesn’t necessarily treat is as a “managed storage”. Even Upsolver (the company I work for), provides a fully managed Iceberg service with 0-ETL ingestion, Live Tables and management (optimizations, etc.). Notably missing from the party, however, is AWS 🤷‍♂️

But wait, haven’t we gone full circle? Kinda. The main difference is instead of storing data in proprietary file formats with major vendor lock-in, we’re now using open table and file formats that’s interoperable and portable between engine and cloud vendors.

With all the noise around open table formats, lets pause to recognize that as an industry we’ve taken a net-positive step forward 🥳

Ok lets dive into AWS S3 Tables!

What are AWS S3 Tables

AWS S3 Tables is a special type of bucket designed to store and access tabular data using Apache Iceberg. S3 Tables comprise of the following:

Built-in, managed, data catalog with new set of APIs and a catalog library
A Table Bucket resource to decouple access and storage
Managed table optimizations and management
Integrations with AWS Glue Data Catalog and Lake Formation

Built-in, managed catalog

The built-in catalog provides three main functions:

First, it maps tables to backend storage buckets and paths - you can see the mapping with the s3tables get-table-metadata-location API.

Second, it provides APIs to manage S3 Table bucket resources like creating namespaces and tables. It also provides a client library JAR for tools to use, which converts standard Iceberg catalog APIs to S3 Tables APIs - this is what it means when AWS tells you “it speaks Iceberg”. It also makes using S3 Tables simple with little to no initial integration work.

Third, the catalog allows you to create namespaces. This is a concept introduced by the Iceberg catalog spec which provides flexibility in how tables are organized. Typically you’ll have a catalog containing some number of databases or schemas and under each you’ll have a number of tables. In the Iceberg model, databases and schemas are replaced with namespaces and gain the flexibility to include additional nesting. However, currently, S3 Tables catalog only supports one level of namespace nesting, so you lose some of the flexibility introduced by the Iceberg spec.

Fourth, you can configure S3 resource policies on namespaces and tables to control the client’s access to the underlying data and metadata. This follows the standard S3 resource control policy language that you’ve grown to love over the years.

There are three major drawbacks to this approach tho.

First, AWS defined their own unique set of APIs that are not compatible with the Iceberg catalog REST specification. This forces users to integrate the new client library and not use the Iceberg catalog client already integrated in lots of open source and commercial engines. This alone will limit initial adoption and make the integration story much more complex.

Second, embedding a catalog in a storage layer (object store) isn’t ideal. It’s not ideal because it’s not core to the S3 business and outside the scope of what a scalable, resilient and fast object store needs to focus on. Now, I can accept the argument that this is simply a mapping service from a table construct to an object construct which is reasonable to include in the storage layer. However, that’s not how users see it.

Third, the way S3 Tables catalog and Glue Data Catalog are integrated is through proprietary catalog federation which doesn’t work with any other catalogs. This is extremely limiting because, for starters, if I’m only using Glue Catalog, why do I even need to configure and deal with the S3 Tables catalog? I should be able to just use Glue. Also, if I choose to use Apache Polaris or Nessie or Unity Catalog, which provide far more capabilities than S3 Tables catalog and Glue Catalog, I’m out of luck😔

A Table Bucket resource

Similar in a way to Directory Buckets (maps hierarchical data layout to S3 standard flat layout), Table Buckets store data files into Iceberg’s /data/ path, including partitions, and manifest files into /metadata/ path. Data is stored in a dynamically generated standard-class S3 bucket hidden from the user. Currently, it seems these buckets are hosted and managed by AWS and not in a customer account. Users can use the S3 APIs to read and write to this bucket, but much of the other functionality is restricted at this time. Engines can retrieve the manifest path using the S3 Table catalog and the data paths by reading the Iceberg manifest files, then simply call standard S3 GET and PUT APIs to read and write files.

This new Table Bucket resource does not create or update manifest files. That is still the job of the engine that creates the Iceberg tables. This means engines like Spark, Trino and Snowflake, need to know how to read and write Iceberg tables - creating and updating manifests, committing transactions, updating and deleting rows, etc. - S3 Tables do none of that for you.

Interestingly, when a new table is created using S3 Tables, a bucket is created using a random UUID-like string as the name, for example 6ed3a8e7-7d85-4d65-7sg781dtb533egimgkqzkmti6htqnuse1b--table-s3. The random name, as an S3 best practice, minimize hotspots as S3 spreads the read/write requests across more infrastructure. Something that is difficult for users to implement on their own. It’s also important to remember that not all of your tables will be colocated in the same managed S3 bucket. Tables will be distributed across different random buckets, sometimes colocated sometimes not. This also aligns with the 10x TPS performance improvement that AWS claims for S3 Tables.

I don’t see any major drawbacks to this design, I think it’s useful because it can hide and automatically implement S3 best practices that will be difficult for users to do. Besides that however, I don’t think it adds much more value. A standard Iceberg engine will get the table metadata from a catalog, as you would do with S3 Tables, and then use that information to find and access data in a standard S3 bucket. There is no complexity here that needs to be abstracted, even if you consider other formats like Apache Hudi or Delta Lake…is the added management overhead worth the benefits?

Lastly, this special Table Bucket is about 14% more expensive then standard S3 bucket. The actual bucket resource, creating/managing buckets for data, etc., in my opinion, doesn’t give enough value to warrant a higher price, so maybe the premium is to cover the cost of managing another catalog? Glue Data Catalog is free for the first 1MM objects and requests and $1 for every additional 100K objects. So not sure this 14% premium to support another catalog is justified.

Managed optimizations

Optimizations and table maintenance are some of the most important capabilities of Iceberg and provide the levers users need to control the performance and cost of their tables. Iceberg provides numerous table properties that users need to understand, configure and tune - on every single table. It also provides several built-in functions that implement compaction, snapshot expiration, deleting of orphaned files and more.

S3 Tables provides a managed table optimization engine that performs some of the important functions needed to maintain Iceberg tables. Maintenance can be configured on the Table Bucket or Table resource itself. Currently, it only supports small file compaction, snapshot expiration and removal of orphaned files. It does not support rewriting manifest files or handling delete files.

Configuring maintenance is done via the put-table-bucket-maintenance-configuration and put-table-maintenance-configuration API calls.

Looking through the documentation and trying different configurations, it seems the current implementation is lacking in a few ways:

Requires explicit configuration via API, rather than reading a table’s properties. Also the configuration property names don’t align with Iceberg’s naming which makes it more difficult to know what’s what.
Lacks granularity and control - e.g. snapshot expiration time is in hours (no minutes which is needed for high volume ingest) or compaction only allows to set output file size, no sorting or compression or frequency of compaction - hopefully these are on the roadmap.
Some maintenance tasks (deleting orphan files) are at table-bucket level and others (compaction and snapshot expiration) are at a table level. An example of how this effects you, with delete orphan files enabled, at a bucket level, if you accidentally drop a table, you have as long as the deletion frequency to try and recover your data or else it will all be deleted. But since the table is dropped you don’t know the S3 path to the data, or if there are no other tables in that bucket, it will be inaccessible to you.
Monitoring is sub-par, simply offering time of execution and a success/fail indicator. No actionable way to dive deeper or troubleshoot.
```
"icebergSnapshotManagement": {
     "status": "Failed",
     "lastRunTimestamp": "2024-12-09T18:44:19.390000+00:00"
 }
```

Of course, there is the cost aspect to managed optimization. Cost is broken into 3:

Object monitoring at $0.025 / 1,000 objects
Objects processed at $0.004 / 1,000 objects
Data processed at $0.05 / GB

To be honest, I don’t know why AWS chose this complicated model, but I assume it’s to charge back to all the components in the S3 backend, with transparency: Data monitoring compute to track when to process files. Objects processed which probably covers the DynamoDB (I assume) usage needed to track pre/post processed objects. Data processed for the actual compute needed to read, compact and write objects.

They should have just bundled these costs into a single line item.

Because there are no mechanisms for you to control when to run compaction or which files to compact, this pricing model can become expensive quickly, especially for tables fed by CDC or streaming sources.

Integrations with AWS Analytics services

Integrating between S3 Tables and the rest of AWS Analytics services is done via the Glue Data Catalog. By initiating the integration wizard, AWS automates lots of the IAM permission creation and assignment which has been a nightmare in the past, so it’s nice to see they put some thought and care into this process. Once integrated, S3 Table resources are mapped to Glue Data Catalog resources as shown below, from the documentation.

The ways table resources are represented in AWS Glue Data Catalog

The integration, aside from creating and assigning some IAM roles and permissions, creates a Lake Formation resource link that connect the S3 Table Bucket resource to a new catalog + bucket named entity, for example s3tablescatalog/royon-table-bucket-demo. Each Table Bucket you create will also create a new resource link automatically in Lake Formation.

OK, PAUSE! We flip flop between Glue Data Catalog (GDC) and Lake Formation (LF). It’s important to understand that the metadata structure, like catalogs, databases and tables are maintained in the GDC and exposed via the LF console. Currently GDC console and APIs aren’t able to return any information pertaining to the resources below the federated catalog we created. The only way to them is via the LF console. This is problematic in two ways, first, engines that use GDC as their catalog can’t traverse this hierarchy and second, it’s confusing as hell to the user.

Maybe I’m missing something but I can’t explore beyond the catalog level using Glue APIs…

aws glue get-databases --catalog-id 123456789012:s3tablescatalog                 

An error occurred (EntityNotFoundException) when calling the GetDatabases operation: The specified bucket does not exist. (Service: S3Tables, Status Code: 404, Request ID: 7a6d32fa-3680-4102-b1af-698b9b6e89ae)

Carry on…this federated setup is fairly simple for small scale deployments, but with lots of Table Buckets, namespaces and tables, the process and naming convention can get difficult for users to manage and get right. Utilizing a proper Iceberg catalog with nested namespaces and proper federation would have made this much simpler.

Furthermore, since the integration is strictly with Lake Formation, engines that have not integrated with Lake Formation like Snowflake, wouldn’t work only by integrating with GDC. The integration bar is much higher now with lots of additional complexity that I think overshadows the added value.

Lastly, once you’ve completed the integration, you need to assign permissions for users to access the tables in your federated catalog/namespace. For that you’ll use Lake Formation which is pretty straight forward and flexible. But unfortunately only works with AWS native analytics services today.

Managing access to S3 Tables

This will require a dedicated post, but in short, securing access to S3 Tables is split into S3 resource policies and Lake Formation. When using the S3 Table catalog, users need to configure S3 resource policies using IAM. This gives coarse-grained controls over API usage and table level access to IAM entities like Roles and Users. You can configure access at the Table Bucket or Table levels.

If you completed the integration, you can now control access to tables via Lake Formation which provides more fine-grained controls, Tag-based access controls and decent auditing capabilities. Unlike with GDC where you can effectively turn off permissions and pass control over to Lake Formation, with S3 Tables, you don’t have that control. This means that you could have S3 resource permissions denying access to a table, while allowing access to the same table via Lake Formation. In my testing, S3 permissions take precedence.

The permission surface area is large and complex and it will be challenging for users to keep it straight, troubleshoot issues and ensure proper access to data.

I think managing access controls will be one of the main Achilles heels for S3 Table adoption.

How are S3 Tables different than other managed Iceberg offerings

Existing managed Iceberg table offerings from Snowflake, BigQuery and to some degree Databricks combine catalog, optimizations and DML (ability to read and write to tables) into the managed offering. Storage remains unbundled. In addition, they support the ability to read (only) external tables, which are tables created and managed by other engines. Reading these tables can be done through integration with external catalogs like Apache Polaris, Unity Catalog, Glue Data Catalog and BigLake Metastore.

The value proposition for users is fully managed Iceberg tables with little to no additional effort on their part. However, this comes at a cost in both additional compute fees and vendor lock-in. For example, Databricks allows you to read and write Delta Lake tables, then convert them to Iceberg (using Uniform) when you need to query (read only) them from Iceberg-compatible engines. Optimizations are performed on the Delta table, for a fee, and secure access (catalog + RBAC) to data is available only via Unity Catalog. Snowflake is similar in which you create an Iceberg table that is stored in your S3 bucket but can only be written to and managed by Snowflake. You can sync the table with external catalog to provide other engines a read-only view of the data.

AWS took a different approach. Rather than bundle the compute, optimizations and catalog, they bundled the storage, optimizations and catalog. This means that you’re not tied to only AWS compute engines and that any integrated engine can provide full DML support for both S3 Tables (managed) and standard Iceberg tables (self-managed on S3 standard storage).

This is an important distinction💥. S3 Tables provide a way to store Iceberg tables without handicapping the engine. With Snowflake, for example, you need to choose “managed Iceberg tables” if you want DML and optimizations from Snowflake or “unmanaged, external Iceberg tables” if you want to do everything on your own and only query with Snowflake. There is nothing wrong with either approach, it all depends on what you need from the solution.

However, with S3 deciding to bundle the catalog and optimizations into the storage layer, we’re left to consider whether this approach adds enough value to be worth considering at all.

My argument boils down to:

Embedding a catalog into S3 is short sighted because, although it’s convenient for development, it is not addressing how organizations need to deploy and manage Lakehouses at scale.
Requiring federation between S3 Tables’ catalog and Glue/Lake Formation creates an unnecessary lock-in with a ton of management overhead and complexity
Access controls are a nightmare because of this split-brain catalog implementation
Bundling table optimizations with the storage layer sounds good on the surface, however it directly competes with Lake Formation Iceberg optimization features and requires a completely different user experience to manage, monitor and troubleshoot than standard S3 functionality.

With all that, why not just use S3 standard storage for your Iceberg tables and your choice of data catalog - lots of great options are coming on the market.

Why S3 Tables now?

Now that you know what S3 Tables can and cannot do and how they are different than the alternatives on the market, the question you’d most likely want to answer is - why did AWS decide to release S3 Tables now? My take is that they fell behind the competition and needed to catch up, simple. Iceberg and Lakehouses are a big deal for many customers. Iceberg is a key differentiator that AWS’ primary competitors, Databricks and Snowflake already offer in a managed and simple to use (for the most part) manner, however before S3 Tables, AWS could only offer a disconnected, yet open alternative (tables on S3, Glue Catalog and Lake Formation table optimizations).

So S3 Tables to the rescue. However, the approach and implementation is far from ideal and I honestly think it will set AWS back, rather than spring them forward.

This is why good product managers are important 🤯

The race for storage optimization

In the data infrastructure business, vendors compete on functionality, performance, ease of use and cost. When it comes to open table formats, Iceberg is solving for functionality, cost and ease of use. What’s left is to solve for performance (and cost, can always improve).

The next race will be table optimizations and maintenance. Doesn’t sound very sexy, I know, but whoever does the best job will deliver the best query performance, lowest costs (compute and storage) and make data true utility for the business and their users.

This is where I and Upsolver have been investing heavily for the last 7+ years (even before I joined them). The Iceberg tables we generate and optimize are on par with Snowflake native file format when querying from Snowflake and other engines. This is huge. If performance is your primary deciding factor, and for many it is, with Iceberg and a good optimizer you have a lot of choice and control over your future.

It doesn’t end with optimizations tho. Table management is a big ocean of needs from retention, data tiering and archiving, usage auditing, format portability, security and much more.

AWS (S3 Tables, Lake Formation), Upsolver, Snowflake, Dremio, Databricks, Starburst and others are investing heavily in table optimizations and management to provide the best experience and since Iceberg leveled the playing field for warehouse functionality on top of object stores, these table services are the new frontier.

What can you look forward to

With everything I said, I think S3 Tables are a kernel of something that could be amazing. The question is, will AWS stay focused on delivering real value together with the broader ecosystem or create another dark corner only the bravest of engineers are willing to explore.

Here are a few things I’m excited about (if they come true 🤞):

The S3 Tables APIs give us hint that Iceberg is only the first format to be supported. S3 Table is structured in a way that could support Hudi and Delta Lake or other formats in the future. This is exciting because I don’t think there will ever be a single winner, each format offers unique value and is more useful in different scenarios. Supporting all is a good idea.
The catalog client library today only supports the S3 Tables’ catalog, but if executed right, it could expand to include Glue Data Catalog and Iceberg catalog REST-spec APIs which will give us an uber-client that works with data across catalogs, platforms and storage systems, seamlessly.
Lake Formation is a powerful service for fine-grained access control, but unfortunately it hasn’t found its market fit within the AWS analytics portfolio. I expect that S3 Tables will push Lake Formation to the forefront and encourage more AWS services and 3rd party vendors to integrate with it. This would allow users to manage their access controls in a single place and have them enforced wherever they access data.

I hope this long post shed some light on why, how and what of AWS S3 Tables. Good luck!

RoyOnData

Discussion about this post