Understanding Parquet, Iceberg and Data Lakehouses at Broad
In the past few years, I've heard a lot about Avro, Parquet, ORC, Arrow and Feather, but I also keep hearing about Iceberg and Delta Lake. As a "database person", I’ve been struggling to understand all of these different things, and how they relate to Data Lakes and Data Lakehouses (and what exactly are these?). So, I’ve decided to study them, and consolidate my knowledge in writing.
To do this, I had to review quite a bit of content over the last few weeks. I’ll try to link to most of what I’ve read and watched at the bottom of this article. However, this blog post won’t be 100% comprehensive, or even the best starting point for most people. That’s because I’m writing this for myself. I find that the best way to learn new things is by "forcing myself" to re-explain them to others. Anyhow, I hope this is useful to some people!
How Data Gets Stored (in Files and in Memory)
Let’s start with understanding the different file formats for efficient data access and storage. While some systems mainly operate on data formats that are closed (e.g., SingleStoreDB, where I work, Teradata, SQL Server, Snowflake), others operate on open data formats. And these days, most of them, like Snowflake, Redshift, Athena or Hive, support reading and writing into both open and closed storage formats (although there’s varying limitations around their write/read support on open formats).
These are some the main open source file formats for storing data efficiently:
- Apache Avro — Binary, Rowstore, Files
- Apache Parquet — Binary, Columnstore, Files
- Apache ORC — Binary, Columnstore, Files
- Apache Arrow — Binary, Columnstore, In-Memory
- The best way to store Apache Arrow dataframes in files on disk is with Feather. However, it’s also possible to convert to Apache Parquet format and others.
- Protocol Buffers (Protobuf) — Binary, N/A, In-Memory
- Protobuf is a language-neutral data structure definition language. Different implementations will store the data differently (it could be row-by-row, i.e., rowstore, or columnstore).
- (CSV — Text, Rowstore/Columnstore, Files)
- CSV is very different from the others on this list given its structure is extremely trivial. There is, however, a "CSV Schema" language definition, but it doesn’t seem to be very popular.
But what do the formats actually imply? The formats are specifications for how the data should be arranged in actual binary layout. There is a really good blog post by Vladimir Sivcevic about Avro vs Parquet vs ORC which explains the layouts ("structure") for each of the file types — I highly recommend reading it. Afterwards, one can understand where the metadata and the actual data should live in the file’s contents, and what the pros/cons of each format are. For instance, Parquet supports really good compression, whereas Avro is more suited for reading specific row blocks (it’s rowstore). However, both support an evolving schema, i.e., the ability to modify the schema of new data without having to rewrite all existing data. In addition, both support file splitting which is essential for parallel processing of data.
So now we’re starting to understand the most common Big Data file formats. These are really interesting, and you can spend a lot of time studying what makes them so powerful. The main thing to remember, however, is that because they’re open, you can work with these formats from a variety of programming languages and tools. As an example, the Apache Parquet’s repository contains the specification for the actual file format, as well as a reference Java implementation. However, there’s libraries for pretty much every language that allow you to read/write from Parquet files.
"""
The following is a snippet that writes some sample data to a
Parquet file on local disk. It uses Pandas to achieve that.
"""
>>> import pandas as pd
>>>
>>> # Create a Pandas DataFrame with sample data
>>> data = {
... 'Name': ['Alice', 'Bob', 'Charlie'],
... 'Age': [25, 30, 35],
... 'City': ['New York', 'San Francisco', 'Los Angeles']
... }
>>>
>>> df = pd.DataFrame(data)
>>>
>>> # Write the data frame to a Parquet file
>>> file_path = './sample_table.parquet'
>>> df.to_parquet(file_path, index=False)
Presto/Trino, Spark, DuckDB, Hive, Dremio, Impala, AWS Athena and Apache Drill are just some of the engines that can work with Parquet files. I find this fascinating, and I can see how Parquet is very useful for data teams who want to ensure their data can be accessed by many different tools.
But the file formats are just one layer of the overall data stack. They are helpful for data scientists working on their own, or at a small scale. But by themselves, they don’t really give data engines all they need in order to manage large-scale, evolving datasets. And that’s what higher-level storage layers like the Hive Format, Iceberg and Delta Lake are for.
Data Management at Scale — Iceberg and Delta Lake
So, how can we store many different tables, and evolve the individual schemas of all of these tables, and do things like point-in-time reads, all while partitioning the data efficiently for queries? And how can external tools easily read these schemas? And how should statistics about the data be stored in order to build cost-based query optimizers for this data?
In order to do all this, we need a higher-level metadata layer around our data files. The file formats described above only specify the layout of one individual file. So, we need a table format of sorts, as well as a schema registry that allows tools to introspect any metadata about our datasets. One example of a schema registry is Confluent Schema Registry, which supports Avro and Protobuf, as it is more suited for streaming data which tends to be rowstore.
When it comes to columnar data, in 2009, Facebook launched Hive with its own table metadata format (it now supports multiple formats though). Then, over the years, Netflix developed Iceberg to overcome some of Hive’s limitations, specifically around performance and scalability (I highly recommend going into the "Sources" section below for the historical context here). And then Databricks developed and eventually open sourced Delta Lake, which is an alternative to Iceberg. And both Iceberg and Delta Lake use Parquet as their individual file format! Hive, Delta Lake and Iceberg all support some sort of schema registry or metastore. In Hive, the HMS (Hive MetaStore), can be pretty much any RDBMS. Of course, Iceberg and Delta Lake have similar concepts (Iceberg Catalogs, Unity Catalog) — and these even allow organizations to manage governance for their data (i.e., who or which teams can access which tables).
I’m not going to go into the differences between Delta Lake and Iceberg. They essentially solve the same problems, in different ways. There’s also a lot of debate around how "open" Delta Lake is, given its development is not as diverse in terms of contributors as Iceberg’s. And the support for Iceberg across popular data warehouses, and lakehouses is growing very quickly — in the last few years, a lot of different providers announced support for Iceberg (Redshift, BigQuery, Snowflake, Athena, Dremio, etc.).
The important thing to remember is that neither Delta Lake nor Iceberg are query or storage engines in of themselves. Instead, they’re open specifications that allow query engines to do their job. And they enable a lot of features such as:
- Partitioning
- In specific, Iceberg supports partitioning evolution. This means that you can change the partitioning scheme (or shard key) of a table without rewriting all the existing data. This was a huge painpoint at Netflix and it was one of the reasons they created Iceberg.
- Schema Evolution
- Data Compression
- ACID Transactions around schema changes
- Efficient Query Optimization (things like column pruning, predicate pushdown, and statistics collection to accelerate queries)
- Time travel (point in time queries)
- …
I studied the Iceberg specification, and I am extremely impressed by its design. It is no wonder that it is becoming extremely popular as a supported format by so many different systems. This will certainly drive products which operate on closed formats to be as performant as possible on their own formats, and to somehow support Iceberg/Delta Lake. However, there's no going back now — Iceberg will become a "checkbox feature", and all database systems will have to support it sooner or later. And can Iceberg eventually replace the closed data formats? This is more complicated, and it probably won't happen due to performance reasons. The truth is that developing a data format for a single query engine, which again only has to operate on that one data format, gives database developers the ability to extract maximum efficiency and innovate more quickly. But who knows!
And so, what are Data Lakes and Data Lakehouses?
Both of these terms have always been a bit confusing to me, but I’m starting to understand them better now.
So, a Data Lake is where companies store their large amounts of data in some raw format like OCR or Parquet, or even CSV files (picture an S3 bucket with lots of these files). This is different from a Data Warehouse where companies are storing this data in a more structured way (schematized SQL tables and database schemas). And Data Lakehouses are basically the conjunction between a Data Lake, and the ability to do things like running SQL queries, running batch jobs, and setting up data governance schemes. These can now be built on top of a Data Lake with Iceberg together with some query engine and other "add-ons", whereas before you’d need to adopt a Data Warehouse (or a more classic DMBS) to do these things.
Data Lakehouses leverage some sort of scalable storage as the location for all the data (HDFS or a Cloud Blob Store like S3). And they’ve optimized their query engines to be very fast on these storage engines. The idea, then, is that data lakehouses are basically the open version of a data warehouse. And finally, some examples of Data Lakehouse products are Databricks and Dremio.
However, I think the line between a Data Warehouse and a Data Lakehouse is becoming more blurry as certain warehouses such as Snowflake and BigQuery add support for open data formats like Iceberg. This will be very interesting to observe over the years to come!
That’s it!
Thank you for sticking around — hopefully the "Sources" section below is also interesting. I certainly learnt a lot while writing this blog post, as I wrote it for myself more than anything else. As such, I don’t expect others to be able to take too much away from this.
Feel free to reach out on Twitter!
Sources
- https://www.vladsiv.com/big-data-file-formats/
- "An Empirical Evaluation of Columnar Storage Formats", X Zeng, Y Hui, J Shen, A Pavlo, W McKinney, H Zhang, 2023
- This is a very good paper which analyzes the different columnar storage formats, and how they stack up against each other in terms of performance.
- https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705
- https://docs.confluent.io/platform/current/schema-registry/index.html
- https://www.oreilly.com/library/view/operationalizing-the-data/9781492049517/ch04.html
- https://www.youtube.com/playlist?list=PL-gIUf9e9CCskP6wP-NKRU9VhofMHYjcd
- I watched a few videos, and found the first two on this playlist to be especially helpful.
- https://www.databricks.com/discover/data-lakes
- https://www.databricks.com/glossary/data-lakehouse
Netflix and Iceberg
There are a few talks, that if watched in order, will give you a really good historical overview into how Apache Iceberg came to be:
- AWS re:Invent 2015 | (BDT303) Running Spark and Presto on the Netflix Big Data Platform
- This talk from 2015 goes over how Netflix used Spark and Presto at the time, with S3 as their "data warehouse" (the term lakehouse didn’t exist back then).
- https://www.slideshare.net/kbajda/presto-summit-2018-09-netflix-iceberg
- These are the slides from "Presto Summit 2018" introducing Iceberg. This is the earliest reference to Iceberg I could find.
- "AWS re:Invent 2023 - Netflix’s journey to an Apache Iceberg–only data lake (NFX306)"
- Excellent talk, highly recommend watching it to learn about the history of Iceberg and how Netflix’s data ecosystem is like today.