AI is driving a massive shift towards lake-based data architectures
Not sure why your AI projects aren't going anywhere? This could be a reason.
As the pace of innovation around AI and ML accelerates, companies are eager to quickly adopt new tech and therefore are willing to invest in implementing new architectures and building new infrastructure to support them.
New design patterns and solutions quickly emerge to address pains and optimizations on the path to delivering the ideal AI-driven experience. One of these is RAG - an architecture that allows applications, aka agents, to lookup additional information inside the organization to enrich LLM conversations. Along with RAG, a new workflow was introduced by which unstructured or semi-structured data is parsed, embeddings are generated and stored in a new type of database, the vector database.
RAG, vector databases and unstructured ETL are unique to LLMs, but aren’t new ideas. They are simply new ways to refer to things data engineers have been doing for years.
The point here is this 👇
We’re being blinded by the big shiny object, GenAI, blissfully following the “best practice” design patterns in order to quickly take advantage of the promises of AI. At the end, we’re left with a bunch of tools and products, codebases, processes and a whole lot of operational overhead, not to mention unnecessary costs.
Overhead is what’s stoping your AI projects from becoming real products.
You need to think about integrating GenAI into your existing platforms in a way that is cost-effective and sustainable, and I guarantee you that your company will be able to start realizing real value from AI much sooner and for far less💰
The cost of RAG on engineers
Retrieval-Augmented Generation (RAG) is an approach for enriching LLM prompts with context specific information to improve the model’s ability to reason on tasks it wasn’t explicitly trained on. I go in more detail about RAG and how data engineers should reason about it in A Guide to AI Agents for Data Engineers.
RAG systems are commonly used in agentic workflows to look up supporting information to a user or an agent prompt. In current data platform architectures, different types of data reside in dedicated stores that are best suited in terms of performance, scale and access patterns. Documents are stored in MongoDB, customer records and tabular data in the data warehouse, vector embeddings from unstructured and semi-structured data in a vector database and so on.
There are several challenges with this approach:
Data engineers need to maintain multiple data systems, each with its own operational, scaling, performance and cost characteristics and overhead.
Data engineers need to build and maintain bespoke ingestion and processing pipelines to load different types of data into appropriate data stores, ensuring to adhere to each system’s specific best practices.
Data engineers need to secure and manage access to data across all of these systems, each with its own unique access control model and language.
Software engineers need to understand each of the different data systems, which types of data it stores, how to query it and how to manage the latency and performance overhead.
Multi-modal stores are not a new problem, however the incredible pace in which GenAI is blowing through every industry and organization, combined with the massive amounts of data it demands, is forcing us to pay attention and find scalable and cost-effective alternatives to single-use systems.
There is a better way, right there👇
A modern AI + Analytics architecture or MAnAA
Much of what RAG proposes has already been solved by data engineers. Parsing raw data, storing it in an efficient manner and providing APIs for clients and users to securely query it, are concepts already built into modern data platforms. You can save yourself a lot of time and energy not reinventing them.
We already know that specialized databases for each type of data is possible and in some cases desirable. However, for agentic workflows it’s increasingly difficult to manage, scale and overly expensive making it challenging for organizations to implement and productionalize AI.
We also know that data warehouses, although very versatile, can be expensive to process and access vast amounts of data with low latency. Users are also limited to access patterns enabled by the warehouse vendor. If their preferred method isn’t available, users are forced to make a copy of the data leading to a long list of issues.
If you follow my content, you know that I’m a big proponent of using object stores and open formats as a universal storage layer. Databricks built their empire on this idea. AWS, Snowflake, GCP and Azure are doubling down on it and slowly so is the rest of the data industry.
To only name a few of the benefits for using an object store:
Universal, every one and every thing works well with it
Super cheap (compared to alternatives)
Massive scale with minimal engineering overhead
Compatible with all types of data
At a high level, this is what I’m proposing:
Lets get into the detail…
Ingestion, with new unstructured data pipelines
Data ingestion is an underrated components of any data architecture, however it’s one of the most critical to successful analytics and AI projects. Data ingestion must support the following sources:
SaaS - on average, organizations use a 100+ SaaS services (BetterCloud via PRNewswire) in their day to day. There is a wealth of information stored in these systems that can be leveraged for both analytics and AI.
Operational - every organization has at least one database they use to store customer records and application interactions. Oftentimes there are multiple such databases used for inventory tracking, security and access logs, web traffic logs, etc. This data is key to analytics and AI, and often updates frequently so supporting CDC based ingestion is critical part of the ingestion pipeline.
Streaming - not always needed for analytics, streaming sources like user behavior interactions and IoT are especially important for AI as RAG systems require current data to enrich prompts and immediately act on user feedback. Much of the recent adoption of streaming technology has been driven by AI projects needing access to real-time event data.
Enterprise documents - before GenAI this area was untapped and often ignored by data engineers because it was difficult to handle and didn’t work well with traditional tools. However, enterprises hold a treasure chest of valuable data inside PDFs, DOCs, Excels, PPTs and other types of documents that AI wants to gobble gobble 🦃 up.
Images, audio and video - although not often used by traditional enterprises, these sources are becoming more commonplace as we integrate multimedia into everything. Zoom meetings, call recordings, X-Rays, assembly line cameras, dash-cams, Ring cameras, etc. For the purpose of AI agents, this data is converted into vector embeddings that’s easier to store and search through, but ingesting this data still pose a challenge due to it’s size.
Working with unstructured data is something data engineers need to become much more comfortable with. It requires a whole new set of skills few currently possess.
Each of these sources require different set of tools to ingest and process. Current data platforms already solve for SaaS, operational and streaming sources. Documents and unstructured data is new, but more tools are coming to market to help address these.
Unfortunately there isn’t much from existing big data tools like Spark in the unstructured space, you can find a few extensions, but your choice today is mostly Rust and Python libraries or commercial tools. This pose a challenge because distributed processing isn’t natively built into these tools forcing engineers to productionalize and scale unstructured ingestion on their own. Not ideal, but things are slowly improving.
LangChain includes built-in document loaders, Pixeltable is a fantastic project with a ton of goodies, and many of the popular ML toolsets include methods to load images, videos, audio and documents. But as I mentioned, scaling these processes is still not well defined and requires data engineers to master.
This is probably the newest part for data engineers, but at the end, you’re adding new ingestion pipelines, using relevant tools, to your existing ingestion solution.
Now that ingestion is handled, lets talk about storing all of this data.
Persistence, with open table formats
This is where things start to diverge from the traditional approaches. Commonly, you’d want to store data in a persistence layer that is best suited for the type of data and for your expected access patterns and requirements. For analytics we usually store in the DWH, for operational and application data we store in Postrges or similar, for documents in MongoDB and for vector embeddings in a vector DB like Pinecone.
Both analytics and AI are very powerful trends driving consolidation around a common persistence layer.
Analytics is consolidating operational data from SaaS, DBs and streams into a warehouse and increasingly the lakehouse using open table formats like Apache Iceberg.
AI is consolidating all of the analytics data and unstructured data, its metadata and its vector representation into a still to be fully defined persistence layer.
Open table formats like Apache Iceberg present an exciting opportunity for us to consolidate AI and analytics data under a single, highly scalable, cost-effective and versatile persistence layer.
Open table formats like Iceberg, DeltaLake and Hudi provide an abstraction layer on top of physical data files that allows us to manage different representations of data using a consistent set of APIs and access patterns.
For example, using Iceberg we can save data for:
Operational - using Apache Avro data files we can store row-wise records into tables that are optimized for single-row operations and high volume reads/writes.
Analytical - using Apache Parquet data file we can store records using columnar format into tables that are optimized for selecting subsets of records and full sequential reads of large blocks of data.
Vector embeddings - using Parquet or Avro we can efficiently and cost-effectively store large vector arrays that can be queried using similarity search algorithms.
Metadata - using Parquet we can store large amounts of metadata about unstructured data that can be used for analytics and AI use cases when trying to extract meaning from videos, images and documents.
Blobs - using newly defined Puffin file spec in Iceberg we can store blob data. Today it’s used for Sketches and delete vectors but there are many possibilities like indexes and even binary blobs like images or audio 🤯
With only these few building blocks you are able to build a production-grade lakehouse that supports most of your AI and analytics use cases in a single place. No more storing data in different places, learning to scale and operate lots of different database and spending money left and right.
It’s not to say you shouldn’t put data in speciality stores, but do that only when it’s absolutely required. Also, when choosing a speciality database, pay close attention if it supports reading from an open table format (like DuckDB and Postgres). This will reduce complexity, minimize data duplication and the need to build and manage new ingestion pipelines.
I’m really excited about future file formats that will adopt Iceberg as their metadata management layer expanding the concept of a unified persistence layer.
As a side note about table formats: DeltaLake is a format heavily developed by Databricks and tends to lean more towards managing analytical data. There is little indication that it intends or can easily support other types of data well. Hudi is a table format that is heading in this unified persistence direction 🎉 The team seem to be focusing on building all of the storage-parts into the format, rather than enabling a more pluggable model. This is a smart move initially because it gives more control over features and quality, but could be limiting as the ecosystem grows and users want to bring their own storage formats. Iceberg is taking a more open approach, focusing on the metadata layer to enable bring-your-own-file-format model. Initially it would require more integration work, but eventually will be more flexible to support lots of diverse use cases.
Cool, now we got documents, analytics and vector data in our lakehouse managed by Iceberg, what’s next?
Metadata, with discovery and access controls
There are two big challenges with how RAG is integrated with data platforms:
Discovering available and relevant datasets: For LLMs discovering available datasets is done via similarity searches on vector embeddings of the data. This means engineers need to generate embeddings from every piece of data they can get their hands on. It can become very complex and expensive quickly. Ideally, you would generate embeddings from unstructured data, which doesn’t include sufficient metadata to give the full context. However, for the semi-structured and structured data, there is a lot of metadata already in business catalogs that could be searched using string similarity functions like levenshtein distance or similar NLP functions.
Ensuring access to data is properly controlled: For operational and analytics data we already use RBAC or ABAC to control access, however for vector similarity searches and RAG access to sensitive information there is no good way to control what is returned. An interesting new company Paig.ai is trying to solve this problem for LLMs.
As an example of this, consider an agentic workflow that needs to access analytics data from the warehouse, transactional data from a CRM and then build aggregations and summaries of all the information. The end user asking a question about sales performance in the last quarter, is only allowed to see aggregated numbers but not specifics. How do you control access to all of the data sources based on the user’s level of permission and ensure the results returned from the AI agent didn’t leak information the user isn’t allowed to see. RAG with multiple systems and loose controls makes this task extremely difficult. Unifying access to data under the same persistence and metadata layer makes this problem far simpler and auditable.
The Iceberg REST Catalog (IRC) presents an opportunity to resolve both of these issues in a single tool. One example is DataHub (by AcrylData), an open-source catalog that combines metadata collected from lots of different systems (via crawling), data quality measures, and user + automated annotations with an Iceberg REST catalog endpoint. The combination (business, operational and technical) enables AI agents, and users, to discover relevant datasets to satisfy AI and analytics needs. Furthermore, as a central discovery and access control center, permissions can be layered on top to enforce entitlements on all data assets in a single place.
As we move towards a unified multi-modal lake architecture, the catalog will play an increasingly important role enabling discovery and access controls.
The race is on between Databricks Unity Catalog, AcrylData DataHub and Snowflake’s Apache Polaris (incubating). There will be others as this market is being reimagined, but these 3 have a major head start.
Processing, with an engine of your choosing
Up to this point the reimagined stack enabled us to ingest, store, catalog and secure access to different types of data used for both analytics and AI under a single, unified lake experience. The allure of bring your own tool has been a major benefit of the Lakehouse architecture but traditionally been siloed to analytics query engines.
With the unified lake architecture, we can begin to utilize more tools that are designed to provide the best capabilities and performance for different tasks.
For example, Polars is a great framework and query engine that’s increasingly being used for local data engineering development. With support for Iceberg, users can leverage Polars to build, test and deploy data pipelines to ingest and process data for analytics and AI using a streamlined workflow. Similarly, Apache Trino is a common engine for distributed OLAP queries over large datasets. It also includes lots of custom functions, one of them being cosine_similarity function that lets you do vector searches, just like a vector database but without the cost of another database (and vendor). StarRocks offers similar functions that are super fast on Iceberg data.
Here is a simple Python function that uses OpenAI and Trino client libraries to generate embeddings from a user prompt and execute a similarity search on a large vector dataset stored in the lake using Iceberg.
def lookup_content (conn, q):
cur = conn.cursor()
vector = get_embedding(q)
cur.execute(f"""
SELECT
id,
title,
cosine_similarity(ARRAY{vector}, content_vector) as score
FROM polaris.demo.articles
WHERE cosine_similarity(ARRAY{vector}, content_vector) >= 0.85
LIMIT 10
""")
return cur.fetchall()
Building agentic workflows become simpler since you now have a single source of truth for all of your data, scales seamlessly and allows you to leverage any tool at your disposal for access. This is particularly helpful with GenAI tool use or function calling scenarios where you include in your prompt a set of available tools - custom functions, external APIs, etc. that the LLM can choose from to complete a task.
Lets take this simple idea a step further.
Imagine now you have a tool marketplace, discoverable through the catalog (discussed previously), that is made available for users, machines and AI agents to use. This marketplace can be deployed on Snowflake app marketplace, AWS as Lambda functions, or self-hosted on K8S or similar cloud infra. You can split this out further by registering tools in the catalog and delegating code execution to another compute platform (serverless, fully-hosted or self-hosted).
The good thing is that you don’t need to imagine this scenario, it’s ready and working today. Slowly, slowly vendors are exposing these compute/tool marketplaces where users and vendors alike can build and deploy. However, today, these marketplaces only really work on the marketplace-hosting vendor’s data - Snowflake, Salesforce, SAP, and Databricks (yes Databricks, with the lock-in to DeltaLake and UnityCatalog).
With this proposed unified multi-modal lake architecture, these engines and marketplaces become interchangeable compute platforms that serve your needs.
Another trend that’s accelerating this, is Bring Your Own Compute (BYOC). I’m seeing more and more vendors embracing BYOC as an option for users to leverage their own infra in the cloud, on-prem and across clouds. This further encourages the tool/app marketplace concept to succeed using a customer’s own compute, inside their secure network that’s fully integrated with their unified lake.
Without a unified lake architecture, this reality will be very costly!
Wrapping it up
Consolidation is inevitable, but don’t expect it to happen with tools and services too much. With Iceberg leading as the table format of choice for many large scale lakes, and the growing support from data and AI tools and frameworks, we’re going to see consolidation around data storage and management practices - powered by open standards.
The consolidation or better yet, unification of data management is critical for companies to accelerate adoption and success rate of AI initiatives that go beyond experimentation and gimmicky, one-hit-wonder features.
When building a new lake or migrating from a warehouse to the lake or even if you’re just exploring a move to the lake, the future lies far beyond just analytics and BI.
The market, users and technology momentum are signaling a future where data volume, variety and ease of accessibility are increasingly driving platform decisions.
Good luck!
I would love to see this great article turned into a video :)