What's Iceberg catalog & why it's the future
Iceberg REST, Glue Data Catalog, Apache Polaris or Databricks Unity, which to use?
An Iceberg catalog is a metastore used to manage and track changes to a collection of Iceberg tables. Clients use a standard REST API interface to communicate with the catalog and to create, update and delete tables. Iceberg catalogs can use any backend store like Postgres or DynamoDB. Some popular Iceberg catalogs are Apache Polaris, AWS Glue Data Catalog and the new Rust-based Lakekeeper.
I previously wrote about the race to own open data where I explained the importance of a data catalog and more specifically an Iceberg catalog in a future where open table formats are how we store, share and manage on large datasets across engines and clouds.
In recent weeks I’ve been having conversations with folks where it seems it’s still unclear what exactly is an Iceberg catalog and why do you need it for your Lakehouse to be functional and scalable. So lets dive right in…
A technical metastore with standard APIs
Often people conflate a catalog with a metastore. Although they have similar roles, their functionality is different. Metastore is primarily a database to store technical metadata like schema/db/table hierarchy structure, table schema, partitions and stats. A metastore’s primary function is to enable query engines to find, understand and access data files. A catalog stores similar information to a metastore, plus some business related metadata like tags, annotations and even taxonomies. A catalog’s function is to serve as a dataset discovery and exploration tool for business users.
Iceberg REST catalog (IRC) in particular, is a metastore that exposes a set of standard REST-based APIs that are consistent across catalog implementations. This makes it easy for clients like Sparks, Flink, DuckDB, StarRocks, ClickHouse and others to integrate once, enabling support for multiple IRC solutions.
Users benefit from this standard REST interface because it makes it simpler to secure, monitor and interoperate with different catalogs, clients and engines. In particular, the interoperability enables users to move from one catalog to another and bring their own engine with ease.
Iceberg catalog enables better compatibility
One of the major benefits of the open table format and lakehouses is compatibility between clients and engines. Although Iceberg clients allow you to query tables by pointing to the specific file path of the current snapshot, a catalog abstracts this away making finding and querying large number of tables super simple - using their name.
This compatibility is enabled by using RESTful APIs that separate concerns and state between clients and servers. REST APIs are standard across the web and is how all modern Internet services communicate. Unfortunately, metastores used by data tools tend to utilize JDBC and even Apache Thrift (Hive Metastore). These are much more complex to implement and use, hindering development and adoption from more clients and engines.
Ultimately as more engines and tools support both the Iceberg table and catalog specifications, compatibility will inch closer to completion. This will lead to seamless, cross-platform and cross-cloud ubiquitous access to data.
Now, to be clear, you can use a non Iceberg-REST catalog, like Hive Metastore or Glue Data Catalog (without their new Iceberg endpoint). These catalogs do support creating, updating and deleting Iceberg tables. However, clients will require special integration with these catalogs, typically via custom (sometimes less battle-tested) JDBC driver. Less of an issue for big data tools like Spark and Flink, but far less ideal for emerging tools like DuckDB, Polars, ClickHouse and others.
Iceberg catalog enables consistent behavior
One of the key benefits of the Iceberg table format is ACID guarantees for create, update and delete transactions on data files stored in an object store. These guarantees can be provided in part due to the catalog.
Write operations on an Iceberg table executed by the client goes thought:
Create new data and delete files.
Create manifest and snapshot files.
Atomically swap the current table pointer in the catalog to the new pointer - the location of the most recent snapshot file.
Because of this atomic swap behavior, if a client attempts to query a table that is actively being updated by another client, it will only see the table as of its last well-known state. If during the query execution the table is updated, the querying client will not be affected. Only after rerunning the query will the client see the newly committed changes.
IRC provides a standard mechanism for clients to enable ACID guarantees without implementing custom or non-standard approaches that could ultimately violate these guarantees. Users of IRC can expect a consistent and standard behavior from compatible clients which makes onboarding new tools far less risky.
Furthermore, IRC assists in controlling concurrent writes by a single engine or by multiple engines to a single table. This is crucial as you start to scale your workloads beyond a single, serial write workflow. A common use case for concurrent writes is when you’re ingesting CDC events (change events from RDBMS) in near real-time. When writing CDC, the engine continuously commits changes to your Iceberg table, while in the background executes compaction operations to optimize table data. Both operations execute as Iceberg transactions and utilize the catalog to ensure conflicts are handled correctly.
With high volume ingestion, multi-table transactions and table optimizations and maintenance across hundreds or thousands of tables, having a compliant IRC is critical to deliver a consistent and predictable behavior to users reading and writing tables.
Iceberg catalog simplifies access controls
In data governance, specifically access controls, there are two primary systems in play: the policy decision point (PDP) and the policy enforcement point (PEP). Traditional warehouses like Snowflake, Redshift and BigQuery combined the PDP and PEP into the same platform giving them total control over how data is protected.
As we move to the Lakehouse architecture, the decoupling of storage, compute and catalog creates a situation by which control is no longer managed in one place but rather split between multiple systems.
The Iceberg REST catalog provides a single place to host the PDP and PEP functions allowing you to define and enforce coarse-grained access controls.
Fine-grained access controls, on the other hand, like column and row level permissions are more complicated to implement in this model because they require trusted collaboration with query engines. To restrict access to a table, the catalog simply denies a client GetTable request notifying it that permission is denied. To restrict access to specific columns, the catalog can return only the allowed columns to the query engine (excluding the restricted ones). However, once the engine has access to the data files, they can see all columns. It’s up to the engine to do the right thing and only return column data for those allowed columns. If the engine isn’t secure, data could be exfiltrated. This is a risk data engineers and leadership need to be aware of when selecting tools users can use to access sensitive information.
A different benefit of IRC is credential vending. Credential vending is when the catalog provides an authenticated client with storage credentials, like S3 access and secret keys, to access the underlying data files of a table. Credentials are scoped to only the permissions the user is allowed. For example, if a user is only allowed to read from table customers, than the storage credentials the catalog vends to the client will only include permissions for GetObject and not PutObject or DeleteObject. Admins no longer need to configure storage level permissions on clients making it super simple for users to bring their own engine and to collaborate on data securely.
One challenge not addressed by IRC is the access control definition language and data model. Apache Polaris uses a simple RBAC model following the one inside Snowflake. Databricks UnityCatalog expands on standard RBAC for tables with other resources like connections, functions, external locations and shares. Lakekeeper utilizes a relationship-based access control model using OpenFGA. There is nothing wrong with either of these approaches, however it makes it difficult for users to move datasets between systems since permissions can’t easily be carried over with the data. Pay attention when choosing a catalog and implementing access controls as they are not easily portable.
Iceberg catalog enabled advanced performance
Moving to more futuristic benefits, the IRC is well positioned to capture important information about query and access patterns from different engines and users.
IRC already exposes a REST endpoint for engines to send metrics reports. The engine includes information about the plan it decided to execute, which filters it applied to the query and metrics about the data it scanned. Today, that information is helpful to manually debug and optimize query performance. But in the future, this could enable the IRC to provide server-side query planning capabilities because it already knows a lot about the table, data, access patterns and permissions. Furthermore, these metrics can help table management systems, like Upsolver, to optimize data files to better fit query patterns.
Server-side query planning would level the playing field across engines, enabling consistent performance and behavior whether you’re using Spark or Pandas.
The big question is where will performance tuning and query planning innovation come from? Traditionally, this was done by query engine vendors. Do we believe catalog vendors are better suited to provide this? Maybe, maybe not. Still TBD how this will work out, but exciting for sure.
Interoperability vs. compatibility
Multi-engine access and collaboration is a major benefit of Iceberg and IRC. It materializes in the ability to read and write tables using different types of engines. For example, ingesting data with Upsolver, transforming it using Databricks and querying it using Snowflake, all on top of a shared storage based on Iceberg. Lots of organizations manage heterogeneous data platforms. IRC plays a critical role in enabling these platforms, however vendors can be deceiving in how they market their solutions.
Compatibility is the a state in which multiple systems (clients, catalogs, compute engines, etc.) can coexist and be able to communicate with each other without problem or significant compromise. Interoperability is the ability for those same systems to work together.
There is a nuance here that’s important to understand. Compatibility implies two systems can work together without special adaptation or modification. Interoperability implies that adaptation or modification is allowed as long as the two systems are able to communicate and share information.
An example where this nuance is manifested is with Databricks UnityCatalog that provides the ability for non-Databricks engines to query DeltaLake tables as Iceberg tables. The way it accomplishes this is by converting DeltaLake tables using Uniform into Iceberg tables on the fly. This indeed enables interoperability between Databricks and engines that don’t support Delta but support Iceberg. However, it breaks compatibility because these Iceberg-only systems are handicapped. For example, they are not able to modify the Iceberg table, can’t leverage MoR tables (no support for deletes) and can’t control how data is optimized or cleaned up.
Don’t let the marketing FUD of interoperability blind you from what is actually happening and instead, focus on compatibility.
Fully complying with the IRC and Iceberg table specification enables ecosystem compatibility which will ultimately deliver resilient, scalable and cost-effective data architectures without costly vendor lock-ins.
What should you do now?
If you already have a data lake and want to use Iceberg you can do it today. Hive Metastore (HMS), AWS Glue Data Catalog, Google BigLake and Dataproc Metastores all support Iceberg tables. They aren’t IRC or may not include all of its capabilities but you can still use them with Iceberg tables. No need to migrate to IRC just yet. Take your time to evaluate options and slowly migrate.
If you’re starting new with lakehouses, choose an IRC. It will set you up for a streamlined growth and enable new capabilities quicker. Apache Polaris, Snowflake Open Catalog (based on Polaris) and AWS Glue Data Catalog are solid options that can be used in production today. If you’re a customer of Snowflake, choosing their Open Catalog is a good move because it will enable tighter integration and be more familiar to engineers. If you’re primarily using Databricks, choosing UnityCatalog is a valid option but be aware it isn’t very compatible with the rest of the Iceberg ecosystem yet. Expect to be working primarily with DeltaLake.
With Azure Fabric, the entire platform is built on top of DeltaLake, given their relationship with Databricks. However, they offer a similar solution to Uniform using Apache X-Table to convert Delta tables to Iceberg for read-only use cases. There is also a silver lining in which Azure will begin to support native Iceberg without the need for conversion.
Google provides BigQuery Metastore which is a non-IRC catalog that works with GCP analytics services. If you’re heavily invested in GCP and only use BigQuery, then it would make sense to enable managed Iceberg tables, but it isn’t compatible with the rest of the Iceberg ecosystem. BigLake Metastore is an Iceberg only catalog based on Hive Metastore, exposing custom APIs using REST. It is not an IRC. If you are planning to use open source big data tools, BL Metastore is an option, but again not compatible with the broader Iceberg ecosystem.
Conclusion
Iceberg REST Catalog is the future! When building new, choose IRC. When expanding existing lakes (Hive based) to include Iceberg tables, use your existing catalog and migrate to IRC when you’re ready to take advantage of its capabilities. Don’t fall pray for partial or proprietary implementations that will lead to further lock-in and higher costs.
I haven't come across a post on catalogs as in-depth as this so far. It's very insightful and a must-read before picking up the catalog.
Regarding server-side query planning, most clients, like Trino, already use it. Do you mean enriching the existing server-side query planning with more insights? i.e., fetching metric reports from catalog data.