The race to own open data

The fight for metadata and access control in the Lakehouse

May 12, 2024

Metadata is quickly becoming more valuable than the actual data it represents. Data catalog solutions are springing up to collect and capture all of this metadata from every nook and cranny of the data stack. But that is not the real problem.

We often talk about catalogs in generic terms, but it’s not always clear what type of catalog people are referring to. There are 3 types of catalogs - technical, business and operational.

Technical data catalog

The technical catalog is one responsible for collecting and storing metadata from data systems like schemas including column names, data types, primary key indicator, etc. It also includes location information that tells query engines where to find the actual data files representing the values of the table.

We use this information to understand the structure of the table, what type of data it holds and how to access the data files.

In data warehouses or databases, this information is well defined and maintained by the system. In data lakes and lakehouses this information is maintained by externally managed catalog. For many years the responsibility was delegated to the Hive Metastore (HMS).

What is a Hive Metastore

A brief history is required before we continue. The Apache Hive project was released in 2010 by Meta (then Facebook) to provide warehouse capabilities for processing massive amounts of data using Hadoop - distributed big data processing framework.

One of the key components of Hive is an external data catalog (HMS) that allowed users to do two main things:

Run multiple isolated compute clusters by reusing table metadata stored in a central catalog.
Run ephemeral clusters that would be spun up temporarily to perform arbitrarily long operations and then be terminated.

Over the years, all of the big data tools we know and love like Spark and Trino continue to build on the shoulders of these two well established giants - Hadoop and Hive, making HMS a critical component in a modern, distributed data processing architectures.

As a central repository for all technical information about tables in the lake, query and processing engines heavily depend on HMS to plan query execution, identify table partitions and pinpoint the location where data files required to satisfy the query are stored. If HMS fails, your data lake grinds to a halt.

So what’s the issue? well the issue is that HMS has seen little to no innovation over the past 10+ years. The idea of a central catalog is sound, however HMS isn’t cutting it anymore when it comes to scale and complexity of data, or making it easy to connect non-Hadoop, non-Java tools, or having reliable managed options to choose from.

Business data catalog

The business catalog is probably the most talked about type of catalog because it is often the most user facing and thus attractive as a startup idea. The business catalog is intended to collect and present metadata about available datasets in a way that makes it easy to understand what each table is about and whether it can help the user answer the business questions they need to answer. It’s Google for your enterprise metadata.

For a business catalog to be effective, it needs to understand the technical metadata first. This information is usually crawled and extracted from downstream systems like the technical catalog, databases and data warehouses. On top of this technical metadata, a business catalog layers taxonomies, tags, annotations and lots of other business information to enrich users’ understanding of datasets.

What’s the issue? the issue is that a business catalog operates independently of all other data systems. It relies on crawling and scraping information to stay in sync. Modern business catalogs expose plugins or connectors but in essence they do the same thing as crawlers. This means that data teams must keep business catalogs in sync with downstream systems or users will see incorrect and old information. Another architectural bolt-on.

Operational data catalog

Data observability have gone mainstream as a way to bring monitoring, alerting and troubleshooting to physical data, not just the tools that operate on it. Data observability tools crawl data systems, extract table metadata and execute test queries against them to verify different aspect of the table’s health, like row count, field uniqueness and schema drift.

If you look at this from the prospective of monitoring, these tools provide dashboards, alerts and custom monitors. If you look at it from the prospective of metadata management, these tools are catalogs containing table metadata enriched with health and quality information. This is similar to business catalogs, only serving a different user - data engineer.

Some modern business catalogs (Acryl Data) recognized this overlap and began integrating observability with technical and business metadata to provide a comprehensive view to every user.

So again, what’s the issue? the issue is that we are forced to manage yet another catalog that only serves a narrow set of use cases for a small group of people in the organization. Other catalogs or monitoring tools build dependencies on this operational catalog for health and quality metadata and alerts which further embeds it into our already overwhelming data stack.

But wait, there is one more catalog…

In the traditional data lake, tables were divided into metadata, stored in HMS and data, stored in physical files. In the modern data lake, or Lakehouse, tables are divided into state, maintained in a catalog, metadata, stored in manifest files and data also stored in physical files.

This is a monumental shift in how metadata is managed for large, decoupled datasets in the lake. There are several key advantages:

Ubiquitous - everything you need to know to query a table is stored together and accessible to any tool
Flexible - tables can be dynamically changed by evolving schemas, updating partitions and rewinding to different points in time.
Consistent - tables in the lake are offered the same reliability and consistency guarantees as those in traditional databases.
Shareable - stored as a package, tables can easily be shared across tools, vendors and clouds.

All of this goodness is a benefit of open table formats like Apache Iceberg.

Iceberg, being primarily an open table format, also introduced a REST catalog. This isn’t some magical new innovation. The REST catalog is simply a technical catalog implementation with a REST API frontend that makes it easy for any application to use. The reason it’s exciting is because we finally have a community acceptable way to replace the old HMS in the data lake stack.

It’s actually more than just a modern replacement to HMS. Iceberg catalog also enables some unique capabilities desperately needed for large, multi-engine lakes.

Earlier I mentioned that lakehouse tables (Iceberg) are divided into state, metadata and data. The state component of the table enables transactions, controlling concurrency and resolving conflicts between multiple writers and traversing a versioned timeline of table changes.

Transactions and concurrency controls are things traditional databases and warehouses supported and was difficult to offer on object stores with open file formats like Apache Parquet. With Iceberg, lakehouse users get all of these for free.

The real problem is the “open” gatekeepers

As an industry, we’ve entered the era of consolidation. Companies realize that there are many different aspects of running a data platform, but at a high level it’s no different than managing any other platform or infrastructure where multiple teams and stakeholders have varying dependencies and requirements. Development, testing, deploying, managing and monitoring is similar and can utilize most of the same tools.

Catalogs are no different. Well, actually there is one important difference.

As mentioned earlier, a technical catalog stores metadata about tables, but is also used by engines to plan queries and figure out where to find the files it needs to read. This gatekeeper behavior doesn’t exist with business or operational catalogs. If a business catalog is unavailable, your Spark, Trino or Snowflake queries don’t stop working.

This single point of control, together with an Iceberg REST catalog reference implementation available for everyone, enables catalog vendors and query engines and data warehouses to become gatekeepers to your data stored in open formats.

We’re heading towards a world where gatekeepers force users and workloads into their own solutions for managing data in open formats or risk forfeiting key features and performance optimizations.

Lets look at some examples…

Snowflake announced support for two types of Iceberg tables. First is using external catalog. This implementation allows Snowflake to only read tables, but not write or update them. Second is using a Snowflake internally managed catalog that allows full DML on Iceberg tables by Snowflake only. So now, users need to make a decision, read/write from Snowflake and read-only by external tools, or read-only from Snowflake and read/write from external tools. It’s expected that Snowflake will incentivize users to let it manage their Iceberg tables using the internal catalog. Additionally, Snowflake released a catalog SDK to encourage other engines to support their internal catalog and further their control over open data.

Databricks Unity Catalog supports the Delta Lake table format (functionality similar to Iceberg). When using Unity Catalog, users get full read/write and concurrency controls for multiple writers to the same table. Delta Lake also works with other catalogs but with limited functionality. Databricks is encouraging engines and users to integrate with the Unity Catalog to get the full set of capabilities. Again keeping Delta Lake tightly coupled with all the benefits of Unity Catalog.

Dremio, Starburst and modern lakehouse query engines are following suit by offering managed Iceberg tables when using their version of the Iceberg REST catalog, limiting functionality to those tables not managed by their catalog.

Iceberg simplified the technical catalog requirements and eliminated the heavy Hadoop and Java dependencies. It’s now easy for anyone to build and host their own catalog. Recognizing the critical control point in the architecture, data platform vendors are fighting for control of your seemingly open tables.

How can catalog vendors help?

Aside from legacy workloads, Hive Metastore’s reign as we know it is coming to an end and the industry is converging on a standard, open and unbiased replacement - Iceberg REST catalog.

Catalog vendors, in particular ones with technical catalog experience have an opportunity to embed Iceberg REST catalog into their platform and provide a unified experience for technical, business and operational needs. Expanding further, these solutions can offer unified access controls, data sharing, collaboration and compliance assurance features across thousands of tables created by different tools and services.

More importantly, these universal catalogs reduce the control and influence “data cloud” vendors have over our data, minimizing lock-in.

Where do we go from here?

I can’t predict the future, yet 😉 but I have a hunch…

First, data cloud and unified query vendors will continue to expand their support for fully managed Iceberg tables using their own catalog. They will push hard for partners and open source tools to support their catalog and build out their own microcosm.

Second, technical, business and operational catalog solutions will consolidate offering all of the key capabilities in a unified user experience. Some single-purpose business and operational (observability) catalogs will remain, but the ones offering a unified experience will drive more value to more users.

Finally, new breed of catalogs will emerge supporting open table formats and offering the gamut of features needed for multi-table, mutable, collaborative and performant lakehouses. Hopefully some will be offered by existing unified catalog vendor (I’m looking at you Acryl Data).

More importantly, keeping an open, decoupled lakehouse architecture with a catalog that offers unfeathered collaboration enables a healthy ecosystem to drive more analytics and AI/ML use cases faster. At Upsolver, where I’m the VP of Product, we offer data ingestion and table management for Iceberg lakes and are big believers in the open lakehouse future. There is nothing more important than keeping open data truly open.

RoyOnData

Discussion about this post