Data industry in 2025

Dec 23, 2024

At the end of 2022 I wrote a post with a few predictions as we entered 2023. To summarize, my main predictions were:

Increased focus and adoption of data development platforms. The platform that provide the end to end of data ingestion and integration, ETL/ELT development, query execution and AI+ML training/inference. This trend strengthened through 2024 with what felt as a war or a race between platform providers. AWS joined the race with SageMaker Unified Studio announced during re:Invent 2024 but seemed to have missed the mark. Furthermore, we’ve seen Dremio become a more full fledged data development platform, Google with BigQuery expanding into more ELT and AI workloads and Azure blowing the doors off the industry with a complete rebuild of their data offering releasing their Fabric data development platform. Snowflake and Databricks continue to add to their platforms and grow in key directions offering a true end to end data solution.
Streaming technologies saw more mainstream adoption as the tools became easier to use and simpler to deploy. This is a trend, long time in the making, but in 2024 we’ve seen a significant interest in more streaming and near real-time use of data driven by AI/ML (often used with RAG) and the need to train and predict on current data. Snowflake, Databricks, Redshift and BigQuery have all added native stream ingestion and continuously updating materialized views to support real-time use cases. Confluent acquired Immerok to add managed Flink to its offering, and startups like Upsolver, Risingwave, Vererica, Estuary and others continue to accelerate streaming adoption across CDC, IoT, security and log analytics.
Cooling down of DWH hype and the re-emergence data lakes and lakehouses took up much of our attention in 2023 and 2024 as workloads shift into open and lower cost storage. This trend has really blown up in 2024 as customers began to feel the cost and lock-in pressures of modern DWH solutions. Looking to reduce costs without giving up the functionality and ease of use of DWH, open table formats shot to the top of engineers minds as the next wave of Lakehouse began. Every major player in the data space announced support for at least one format, most popular being Apache Iceberg - Databricks, Snowflake, AWS, GCP, Azure and even Salesforce. Most of the smaller vendors in the data space are quickly adding support for Iceberg. Although DeltaLake, Hudi and Paimon continue to be relevant in some cases, the industry seems to be leaning towards Iceberg as the default open table format.
Specialization of data roles continue to enable companies to be agile as the pace of innovation and competitive pressures accelerate. In 2024, we started to see central data teams moving away from data engineers that do everything to platform engineers for tools and infra and analytics engineers for SQL modeling. This trend continued to accelerate with 2,891 job openings for DEs as of Dec 2024, a 23% increase since 2023. Data platform engineering job postings jumped in 2024 to 45,853 (per job search on Glassdoor) vs. only a few hundreds of postings in 2023. Analytics engineering roles remained low but data analysts roles increased by around 28% YoY. There aren’t a ton of free data about these roles so these are ballpark figures. However, as the data production volumes increase and data becomes more critical to winning in competitive industries, companies need their engineers to specialize so they are hyper-focused on delivering value quickly.

The smaller trends that I called out like data contracts, data quality and data products, continue to be part of the narrative, driven mostly by social media hype and for lack of a better term, fear mongering - bad data quality will destroy your AI projects, fix it now with data contracts. However, neither have picked up serious momentum to bring impactful change affecting these issues. I expect we’ll continue to iterate in these areas and slowly settle on solutions with a few commercial tools coming to market in 2025 - I just don’t see a big enough revenue opportunity to sustain multiple vendors.

One small trend I called out that turned out to be massive is infusion of AI in our data engineering and data consumption lives. LLMs and GenAI lead to the creation of data engineering co-pilots, text-to-sql, prompt-based analytics and much more. AI is now part of everything we do and will continue to grow and improve in the coming years.

Stay tuned for more on this topic in future posts…

Bring on 2025

Enough with 2023 and 2024, lets talk about what’s coming in 2025.

The end of BI as we know it

Yes you heard me right. 2025 will be the first year where traditional BI will begin to see a decline. For years, BI amounted to creating dashboards and reports by engineers and business users to both answer questions we know to ask and provide insights to questions we don’t quite know how to ask. With advancements in LLMs, GenAI and machine learning, the BI experience is turning from building dashboards to asking questions.

The prompt interface makes it easy for anyone to quickly get answers to their business questions. A typical BI experience requires users to figure out the correct SQL queries to run and what is the best method to visualize the results. Although it sounds simple, for most, it can take significant amount of time. It often requires a fair bit of iteration until you get to a state where the data returned seems to represent what you’re looking for. With advancement in LLMs and GenAI, you simply ask a question and the BI engine builds the SQL query and visualization for you. If you don’t know what specific question to ask, you can iterate directly with the engine building up context to allow it to more “intelligently” arrive at a reasonable conclusion.

Snowflake co-pilot AI, Amazon Q AI assistant, Databricks AI/BI, Qlik Answers, Azure Fabric PowerBI AI features, Salesforce’s Tableau Pulse and Agent and lots more embedded AI features and products will change how we do BI.

This is massively powerful and I expect it to overtake the need for traditional BI dashboards. A well needed relief for data teams who currently maintain lots of unused dashboards 🥳

Open table formats accelerate Lakehouse adoption

Through 2023 and 2024 in particular, we’ve seen an increase in innovation, discussion and adoption of open table formats like Apache Iceberg and Delta Lake. I called out this trend at the end of 2022 (see 2023 recap above) and in 2024 we’ve reached initial maturity of the technology that allows companies to begin moving from testing to production. Although Databricks’ customers have been using DeltaLake in production for a few years now, Iceberg is quickly growing in popularity and maturity, leading to more deployments popping up every day. Upsolver recently helped a customer deploy their first 1PB+ Lakehouse using Iceberg. Pinterest, Croudstrike, Netflix, Bloomberg and many more companies are running large scale lakes using Iceberg.

Open table formats (OTFs) help companies minimize vendor lock-in, enable more tools to access the high-quality data without duplication and optimize storage to improve performance and reduce costs by shifting ingestion, processing and query workloads to cost-efficient alternatives.

Adoption of OTFs will be driven by three main capabilities:

Managed storage: although working with OTF in an object store like S3 is fairly simple, incumbents like AWS, Snowflake, Databricks and GCP are coming out with managed Iceberg offerings like AWS S3 Tables, Snowflake managed Iceberg tables, Databricks managed Delta tables and BigQuery managed tables for Apache Iceberg. These managed storage offerings are intended to keep users and compute heavy workloads on the platform. However, there are limitations to these managed tables, mostly lock-in and ubiquitous access. But they are simpler to get started. I expect in 2025 we’ll see the race heating up for the best managed table offering.
Efficient optimizations: optimizing OTFs is a key capability that delivers faster queries and efficient storage that reduces costs. Although every OTF comes with their own built in optimization capabilities and engines can easily execute them on behalf of users, it still requires experience and expertise to get right and to scale. I expect in 2025 we’ll see a new race emerge centered around providing the best and most impactful OTF optimization and maintenance offering.
Client interoperability: OTFs offer a lot of flexibility, but if clients and engines that users use every day don’t fully support them, they lose their potential for impact. We’ve already seen an increase in client adoption beyond the popular big data tools like Spark and Flink and into Polars, DuckDB, Pandas and more. I expect in 2025, more OSS and commercial tools, programming languages and frameworks will integrate OTFs to make it easy for users to leverage them.

The race for open table supremacy will continue beyond 2025, but I believe will be settled simply by engines supporting the top 2-3 popular formats natively. I’m a big believer in Iceberg, as many of you are as well. I expect it to surpass the other formats to become the default choice for most lake deployments and tools (OSS and commercial). The ability to move data between formats isn’t all that complex and the industry will continue to innovate and provide tools, like Apache XTable (incubating) to minimize CIO/CDO concerns when making an early choice.

Catalogs and governance are making a comeback

With the meteoric rise of open table formats, the metastore or technical catalog is experiencing a renaissance. Apache Iceberg introduced the Iceberg REST-spec, a standard set of APIs for engines to interact with a modern technical catalog, which simplifies engine-catalog interoperability. Traditionally, this interface was convoluted and required lots of heavy Java dependencies (Apache Thrift or JDBC wrappers for the Hive Metastore) resulting in few vendors and open source contributors willing to invest in its future.

With OTFs, the catalog is crucial to enable the following advance features:

Support for multiple engines writing to the same table concurrently
Multi-table transactions
Server-side query planning
Access control enforcement (RBAC)

Although most folks also include table management and optimizations as a catalog function, it doesn’t need to be done there. However it makes it convenient to be done in a single place.

Given the importance of catalogs in the OTF era and the loss of storage lock-in via proprietary formats, warehouse vendors are racing to build and own the catalog. Iceberg REST-spec is giving them a standard way to do this leveraging an open source and commercial go-to-market approach.

Databricks offers their customers Unity Catalog and recently open sourced it. Snowflake open sourced Apache Polaris catalog and built in into their platform as a managed offering. AWS added Iceberg REST support to their well established Glue Data Catalog and Dremio extended Nessie with complete Iceberg support and released their own commercial version.

But there is a number of new alternatives coming to market, like Lakekeeper.

One of the primary battle-ground capabilities of the new catalog era is governance. Features like access controls, auditing, lineage and data discoverability are critical for adoption of these new catalogs. However, they also compete with incumbent business catalogs leading to an unexpected disruption in this fairly rosey market.

I expect in 2025 we’ll see an increase in OTF catalog adoption and the beginning of disruption for existing business catalog solutions. Some vendors will scramble to integrate an OTF catalog with their business offerings, others will try to build their own and some will simply ignore it and continue trying to differentiate on business features alone. However, a successful catalog product will include both, deliver cross-engine and cross-platform interoperability and discoverability, securely.

Consolidation is inevitable.

Object store as the new DB

For years, object stores have served as a low cost, highly scalable and fairly performant storage for distributed systems. More recently, performance, consistency and interoperability have improved to a point where more and more systems are leveraging these object stores to persist data in a structured manner.

Open Table Formats like Iceberg are used to store columnar data for analytics and AI.

Streaming engines like Kafka and Redpanda use object stores for tiered storage and to persist streams into queryable Iceberg tables.

Upsolver, Estuary and other data ingestion and processing engines use object stores to persist intermediate results and processing state for resilient scaling.

Traditional databases like Postgres and data warehouses like Snowflake, Redshift and Clickhouse have all built “native” tables backed by object stores, either using proprietary formats or OTFs. These tables deliver near native performance at a lower cost.

I expect in 2025 we’ll see more data tools, databases and data warehouses leverage object stores to persist data that once required costly locally attached hard-drives. Ease of use, openness, lower cost and ubiquitous access will make it a very appealing proposition for users and vendors alike.

Final thoughts

2025 will be the year we’ll see an acceleration in adoption of open table formats and AI to make data more effectively accessible and usable. We’ll revisit how data is discovered, shared and governed in the modern OTF era. We’ll bring together the small-data and big-data folks and their choice of tools to collaborate on the same data. We’ll help companies do more with data, faster and for cheaper than ever before.

The future is bright and even though there are lots of competing technologies and methods and solutions, at the end, when the dust settles, users benefit. And as data engineers and leaders we continue to learn and grow.

Good luck!

RoyOnData

Discussion about this post