There are no more hard problems in data
Why the data industry is stagnating while demand for data is growing
Ok maybe that’s not completely true, there are lots of hard problems to solve in data. However, there aren’t enough hard problems to go around between all the open source projects, hyper-scalers, tool vendors and startups entering the space looking for the next $1B+ exit. We solved the big ones, we are now polishing solutions year after year.
Let me explain, starting from a fairly narrow point of view…
A 1,000 ft view
I recently attended the 2025 Iceberg Summit together with 500+ data enthusiasts, listened to lots of great sessions (including mine on table optimizations 😉), mingled with some of the brightest and nicest people and chatted with vendors offering Iceberg features and integrations.
Although a narrow lens, one thing was clear - we’re all trying to solve the same exact problems in much the same ways. There were several sessions talking about Iceberg catalogs. Several on OSS projects that automate table optimizations, several on managing Iceberg tables at scale, and more. Nothing jumped at me as earth shattering, unique or truly difficult.
Out of the 20+ vendors on display, I couldn’t make out a clear differentiation between their offerings. Everyone is offering some kind of write or read from Iceberg, catalog integration and table optimizations features. Cool, but expected.
This demonstrates trend following behavior without a clear differentiation on customer value, just a #TechMeToo.
To me, this is a clear indication that the problem isn’t big or hard enough. It’s enough for a few key vendors, namely Databricks, Snowflake and more recently Microsoft, to drive and the rest to follow.
Don’t get me wrong, Lakehouses and Apache Iceberg in particular have a very bright future ahead and I’m very bullish on both - as features. I just don’t know if it’s enough to sustain an entire industry or feed too many hungry startups.
A 10,000 ft view
Take a step back from Iceberg, a key technology driving a major change in how we manage and share data at scale, and look at the broader data landscape.
Query engines compete purely on performance and cost - a tech pissing contest for the ages. Ok some are easier to use than others or offer different deployment models, or include a few niceties, but those aren’t hard problems to solve…they’re mostly solved already.
Data integration and pipeline tools compete on number of connectors, pipeline transformation features and cost. Some have 400 connectors and others have 1000. Some let you write Jinja + SQL, others YAML or Python and others are purely GUI. Again, these are problems we solved 20 years ago and have been polishing ever since.
BI tools compete on visualizations, performance and cost. Many BI vendors attempted to deepen their moat by moving more business logic into their product (data models, semantic layer, data annotations, etc), but the challenges around dashboarding and reporting haven’t changed or worsen or become insurmountable that it requires new tech innovation to solve. Especially now with GenAI, BI is being wholesale disrupted and I expect in the next 2-3 years will be completely different from what we know it today.
In my 2023 predictions post I called out that streaming technologies will break out of their decades long rut. In 2024 we’ve seen an increase in number of companies bringing stream processing solutions to market. Confluent with Immerok (Flink), BigQuery/Snowflake/Redshift/Databricks stream ingestion + dynamic tables, Upsolver (part of Qlik now), RisingWave, Estuary and of course can’t forget our good friends Apache Flink and Spark and all of their vendor-managed variations that have seen increased adoption. Again, if you look at these tools, they aren’t solving any new or complex problems we’ve struggled to address. They mostly offer simple developer experience, decoupled hot-storage from compute and some form of bring-your-own-compute. For example, some offer familiar Postgres-like experience and others allow developers to develop streaming Materialized Views. Cool. Neat. But so what?
Customers and users struggle to understand the material difference between the many solutions available on the market. Once they settle on one, they realize that they aren’t seeing any major value that justifies the investment. Although many vendors claim to, they don’t actually help to accelerate user productivity or move the business forward in a material way. Users get features that solve incremental, albeit annoying, problems that maintain the status quo not change it.
The value is not the tool, it’s bigger than that.
A 30,000 ft view
Data and our industry is foundational and has been fundamental to much of the innovation we’re enjoying today and will enjoy in the future, both in our personal and professional lives.
The data industry is a utility in service of insights and intelligent decision making.
Understanding that the data industry is a utility serving bigger, bolder and more disruptive initiatives can hopefully make it easier to find peace with the fact that the big groundbreaking problems are behind us. Today, we’re mostly optimizing, improving and incrementally innovating in ways that make data more accessible, performant and secure - increasingly to be consumed by AI agents rather than humans. This work is by no means simple or unworthy of our attention and shouldn’t discourage the community from pushing forward.
Because of this underlying plateauing, we’re seeing a broader shift towards building platforms and ecosystems, rather than bespoke solutions.
The tech powerhouses of our industry like Microsoft, AWS, Google, Databricks and Snowflake are moving faster than ever, tackling challenges and industry trends head-on to enable ubiquitous and secure access to data that moves us closer to the next frontier - AI.
What used to take an incumbent 5 years to bring to market, now takes 1 year. This cuts a startup’s nimble advantage to 4-5 months at most. So, if you’re following a trend a hyper-scaler or incumbent created without a clear differentiation, you’re guaranteed to lose. No matter how cool your features are, how much faster you think you are, not even how many meetups or conferences you attend.
The incumbents realized that the power is not data gravity (customers hate the inherent lock-in), it’s the platform with its integrated capabilities, partner ecosystem and an excellent user (and developer) experience.
But wait, isn’t this vertical integration that leads to further lock-in?
Well, yes and no. Let me explain…
Yes, vendors, through native features, acquisitions and tech partnerships are building vertically integrated solutions that make using data to drive AI and advanced use cases the best possible within the walled garden of their platform.
No, vendors, pushed by their customers, are embracing decoupled designs based on open-standards and industry adopted APIs that allow customers to plug and play components seamlessly. Of course, it’s not always that simple, but standard formats, APIs and a vocal communities are forcing functions for vendors to agree and interoperate.
Object stores (with Amazon S3 compatible APIs) standardized how data should be stored, secured and managed.
Apache Iceberg table format standardized how engines write, read and collaborate on tabular data in object stores.
Apache Iceberg REST catalog spec standardized how engines find and securely access datasets in object stores using Apache Iceberg table format (eventually supporting additional formats).
Apache DataFusion is a standard query engine implementation providing many of the building blocks needed to plan, distribute, accelerate and access data in multiple formats, allowing others to build on the shoulders of giants.
Apache Arrow provides a standard framework (data structures and APIs) for managing columnar data in memory and sharing it between applications and processes.
dbt Core ushered in a standard way, using Jinja templates and SQL, to manage data modeling, processing and analysis logic independent of the target engine executing the logic.
and so on…
Where do we go from here
There are few hard problems left in data, however they are mostly incremental - every year query engines get 5% faster or storage becomes 5% cheaper.
The real challenge is in building an open data platform and supporting ecosystem that are robust, flexible, unified, collaborative and cost-effective. Individual tools and services are features of the bigger platform, akin to Facebook cherrypicking successful marketplace apps and building them into their platform.
The shift to platform-thinking is evident today as Databricks and Snowflake quickly build out their platforms leveraging open source building blocks (Iceberg tables, Iceberg-REST catalog, object stores, SQL/Python code, etc.). Hyperscalers are following by bolstering their offering with OSS components (Iceberg, catalogs, etc.) as in GCP building around BigQuery and Microsoft around Fabric. AWS strategy still feels all over the place, but seems they want to consolidate around Amazon SageMaker with SageMaker Lakehouse and deep investment in Iceberg.
Startups are leaning in as well, leveraging the open-source approach. By building in the open they can create a community and ecosystem of tools that make their solution more attractive and competing with incumbent platforms. For example, dbt was able to quickly build a massive community that allowed it to remain relevant even as other platforms provide similar capabilities.
I hope that as an industry, we focus our attention on new challenges, those that move us all forward as an industry and unlocks new whitespaces. Rather than competing on who’s engine is 5% faster or or 2% cheaper. The whitespaces will have new and impactful problems for us to solve that creates meaningful differentiation and customer value.
At the end, success is derived from good data! Getting good data into more hands requires new ways of thinking.
Good luck!