Introduction
2024 has proven a big year for Apache Iceberg with many cloud service providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Snowflake announcing new Iceberg features and support. It’s evident that momentum has been growing behind Iceberg over the past year, so now’s the perfect time for a quick recap. If you are considering your Data Lake options for 2025, this blog will provide an overview of Apache Iceberg and 2024’s biggest announcements.
What is Apache Iceberg?
Apache Iceberg was originally conceived at Netflix in 2017, in an effort to improve upon shortcomings in Apache Hive (a pre-existing open-source data warehouse system). The foundational goals of the Iceberg project were as follows:
- Ensure the correctness of the data and support ACID transactions.
- Improve performance by enabling finer-grained operations to be done at the file granularity for optimal writes.
- Simplify and abstract general operation and maintenance of tables.
Iceberg, like Apache Hudi which came before it and Delta Lake which came after, can be categorised as ‘open table’ formats. These solutions provide an open-source metadata layer on top of a given Data Lake, and in-turn enable the querying of Data Lake objects with the performance and reliability of Data Warehousing solutions.
As Lakehouse adoption has grown, the term ‘Open Lakehouse’ has gained popularity. While exact definitions can vary, a common value proposition is that by maintaining an open-source metadata layer (eg. Apache Iceberg), data teams can avoid vendor lock in while taking advantage of both open-source and proprietary engines. For reference, here’s a visualization:
2024 announcements
It’s been a quite significant year for Iceberg with many announcements from major cloud service providers. Here’s a recap of some of the biggest:
April:
Starburst announces ‘Icehouse’ for Near Real-Time Analytics on the Open Data Lakehouse
- “With the Galaxy Icehouse, customers can benefit from the scalability, performance, and cost-effectiveness of a combined Trino and Iceberg architecture (Icehouse) without the burden and cost of building and maintaining a custom solution themselves.”
May:
Snowflake and Microsoft announce expansion of their partnership and Iceberg support
- “Microsoft Fabric OneLake will support Apache Iceberg and bi-directional data access between Snowflake and Fabric … ensuring more efficient and flexible data management”
June:
Snowflake announces General Availability of Apache Iceberg Tables
- “Iceberg tables for Snowflake combine the performance and query semantics of regular Snowflake tables with external cloud storage that you manage.”
Snowflake Unveils Polaris Catalog
- “a vendor-neutral, open catalog implementation for Apache Iceberg … Polaris Catalog will be open sourced in the next 90 days to provide enterprises and the entire Iceberg community with new levels of choice, flexibility, and control over their data, with full enterprise security and Apache Iceberg interoperability with Amazon Web Services (AWS), Confluent, Dremio, Google Cloud, Microsoft Azure, Salesforce, and more.”
Databricks Agrees to Acquire Tabular, the Company Founded by the Original Creators of Apache Iceberg
- “By bringing together the original creators of Apache Iceberg and Linux Foundation Delta Lake, the two leading open source lakehouse formats, Databricks will lead the way with data compatibility so that organizations are no longer limited by which of these formats their data is in.”
Databricks Open Sources Unity Catalog
- “Unity Catalog OSS offers a universal interface that supports any data format and compute engine, including the ability to read tables with Delta Lake, Apache Iceberg, and Apache Hudi clients via Delta Lake UniForm.”
October:
GCP announces BigQuery tables for Apache Iceberg
- “A fully managed, Apache Iceberg-compatible storage engine from BigQuery with features such as autonomous storage optimizations, clustering, and high-throughput streaming ingestion.”
Dremio Unveils Industry’s First Hybrid Data Catalog for Apache Iceberg
- “Dremio, the unified lakehouse platform for self-service analytics and AI, announced that its Data Catalog for Apache Iceberg now supports all deployment options—on-prem, cloud, and hybrid—making Dremio the only lakehouse provider to deliver full architecture flexibility.”
December:
- “Amazon S3 Tables optimize tabular data storage (like transactions and sensor readings) in Apache Iceberg, enabling high-performance, low-cost queries using Athena, EMR, and Spark.”
AWS announces Amazon SageMaker Lakehouse
- “Unifying data silos, Amazon SageMaker Lakehouse seamlessly integrates S3 data lakes and Redshift warehouses, enabling unified analytics and AI/ML on a single data copy through open Apache Iceberg APIs and fine-grained access controls.”
Comparing with other Open Lakehouse alternatives
Aside from Iceberg, the main lakehouse alternatives are Delta Lake (created by Databricks) and Apache Hudi. As these are all open-source projects, it’s an interesting exercise to compare their respective GitHub repositories. Here are the main repository metrics, as of 22 December 2024:
apache/hudi | apache/iceberg | delta-io/delta | |
Stars | 5,549 | 6,466 | 7,730 |
Commits | 9,219 | 6,598 | 3,756 |
Issues | 3,225 | 3,486 | 1,471 |
Forks | 2,438 | 2,289 | 1,733 |
PR Creators | 633 | 709 | 341 |
Repository Creation | 2016-12-14 | 2018-11-19 | 2019-04-22 |
The following plot reflects the level of community interest, as measured by cumulative GitHub stars for each repo:
Ultimately, when assessing the health of open-source projects, GitHub metrics should be viewed as one part of a bigger picture. From these metrics though, an interesting insight is the nearly double unique creators of PRs (Pull Requests) in the Iceberg versus Delta Lake projects.
While Delta Lake was open sourced as a Linux Foundation project in 2019, its source code is supported by a smaller community of contributors and Databricks employees still play a strong role (as of 22 December 2024 seven of the top twenty contributors to the GitHub repo are Databricks employee accounts).
While the Hudi repository has a comparable level of community engagement to Iceberg, it has existed for two years longer and Iceberg’s momentum has grown in recent years. While these metrics provide useful insights, they should be considered as part of a broad, holistic assessment.
Conclusion
Given the significance of the year’s announcements, 2024 can be seen as a breakout year for Apache Iceberg. The level of engagement and effort from major cloud service providers represents an endorsement spanning much of the industry, and a compelling reason for your data team to consider Iceberg’s suitability for your own requirements.
In recent years, vendor lock-in (and avoiding it) has become a recurring theme in the marketing of new data products and solutions. In the case of Iceberg, the freedom to easily adopt and replace compute engines has been a central tenet of many value propositions.
As with any competitive market, there are of course layers to the Iceberg narrative. Aside from the major cloud vendors (AWS, Azure, GCP), Snowflake’s announcement of its Polaris Catalog and Databricks’ acquisition of Tabular were two of the big events of the year.
If you spend time reading online discussions and assessments of these events, you’ll likely encounter mixed opinions, ranging from enthusiastic celebration of open-source engagement to less-enthusiastic suggestions of ulterior motives. Regardless though, this level of activity presents a promising future for Apache Iceberg.
With the level of hype surrounding Iceberg, it’s easy to lose sight of the alternative options. Starburst (the company behind most of the Trino engine’s ongoing development) published a blog in June announcing “Apache Iceberg emerged last week triumphant, having won the race to become king of the data lakehouse”.
Of course there’s more nuance to this, and in a separate blog published three months later, Starburst ranks Iceberg ahead of Delta Lake for ‘Multiple Engine Support’ but Delta Lake ahead for ‘Spark integration’. Ultimately there is no silver bullet, no one-size-fits-all solution.
Your organization’s decision should be dictated by your own circumstances; what you already have in place and where your future plans may take you. If you are considering your data strategy for 2025 and beyond, please reach out to ProCogia for a free consultation.