CMU 15-721 Advanced Database Systems 2024 — Part 2 (Final)
It's been almost a full year since I published Part 1 of this series. I have spent a lot of time going through the CMU 15-721 Advanced Database Systems course from 2024. This meant going through every lecture, every reading assignment, and every piece of homework as well. It was a lot of fun, and I learnt a lot about about OLAP systems at scale. However, I put off writing this blog post due to lack of time. This past week, I decided to just get it out there since it looks like I'm going to be working for one of these big OLAP companies again ☺️
In this Notion link, you can find all of my notes and paper reviews.
For the second half of the course, the focus was on real-life systems such as:
- Databricks
- Snowflake
- DuckDB
- Yellowbrick
- AWS Redshift
(You can find the lecture notes for every single one of these on Notion)
Databricks
Let's start by talking a bit about Databricks and Spark, because why not? ☺️ Databricks started as a managed service around Spark, essentially. And Spark itself is written in Java. This had a ton of limitations such as the fact that the garbage collector is built into the runtime (the JVM). For this reason, Databricks decided to create Photon in C++. It accelerates the execution of query plans on raw/uncurated files in a data lake. The key architectural features of Photon are:
- Shared-Disk / Disaggregated Storage
- Pull-based Vectorized Query Processing
- Precompiled primitives + Expression Fusion
- Shuffle-based distributed QE
- Sort-merge and hash joins
- Unified query optimizer + adaptive optimizations
The Photon engine uses precompiled operator kernels (primitives), but they do not precompile parts of the query. This is because the software engineering complexity overhead of code generation is, in the opinion of Andy Pavlo and people from Databricks, not worth it. The problem with working in systems that do codegen is that debugging of those systems is really hard. With vectorization alone, the performance one is able to achieve is roughly the same as with code generation, but working on the system is much easier.
Another thing Fusion does is expression fusing. When it sees two intertwined operators, they fuse the two pre-compiled primitives. Since Databricks is a managed service, they actually have looked at which operators are most commonly used together and have those as primitives already.
Anyhow, Fusion has a lot more optimization techniques than just expression fusing. By now, it is a fully-featured, highly-optimized query engine with tons of mechanisms for accelerating how long queries take.
Snowflake
From my notes on Snowflake's lecture, you can see that they have also created a really advanced query engine with tons of QO (query optimization) techniques (and they are probably doing expression fusing by now as well). This includes things like:
- Precompiled primitives
- Separate table data from metadata
- (metadata in FoundationDB)
- No buffer pool
- PAX columnar storage
- Both for their proprietary storage as well as Iceberg, which they also support
- Sort-merge + hash joins
- Unified query optimizer + adaptive optimization
- Push-based vectorized query processing
Now, what really makes Snowflake stand out is Snowpark, Snowpipe, and everything else they have built on top of the warehouse. According to Andy Pavlo, the underlying query performance is becoming commoditized against Databricks, Redshift, Yellowbrick, etc.
This is really important! If the query performance will just go down to the very same algorithms across all data lakehouses and all data warehouses, the actual business competition is going to be around the DevEx, UX and other security features on these managed services.
I won't be going into Yellowbrick and Redshift, but you can read my notes on Notion.
DuckDB
From my lecture notes on DuckDB, you can quickly learn that:
- It is an embedded analytical database
- It was created in 2017, written in C++
- Push-based Vectorized Query Processing
- (used to be pull-based)
- Precompiled Primitives
- Morsel Parallelism + Scheduling
- PAX Columnar Storage (of course!)
- Sort-Merge + Hash Joins
- Stratified Query Optimizer
- (with support for unnesting of arbitrary subqueries, which we studied before and the only other system that does it is Umbra)
What makes DuckDB really, really cool are all the query optimizer techniques, as well as the fact that it can run as an embedded engine. This has made it kind of "take off" within the data science and data engineering comunity as a really use to use columnar database.
Andy Pavlo also went into Motherduck. He mentione they have bridge operators that pass tuple streams between local and remote DuckDB instances. The idea with this is to have local DuckDB execution for quick things and remote DuckDB execution for things that need scale.
Summing it all up
To sum it all up, you really have to start by reading Part 1 of this series – there's a lot of important takeaways there. For the second part (more focused on real-life OLAP systems), I'm now too biased in order to have a proper "takeaway" on these (like I said, I'm about to be hired by one of these companies). Having said that, I'll just mention that:
- Iceberg and, in general, open source data storage formats, are going to take over
- Query engine and query optimization performance will become commoditized
- The real innovation, and the real competition is on everything else – workflow engines, ingest pipelines, job frameworks, security, pricing, etc.
So, is the course worth it? Absolutely. As I already mentioned in the first part, Andy Pavlo is an extremely entertaining teacher, and I think that makes it much easier to go through the classes. Finally, the quality of the content is really high and you'll really learn a lot – especially if you "force yourself" to write down notes for all the lectures.
As for which of these "big companies" I'm about to join... I'll probably write another blog post on that soon so stay tuned for that.
Feel free to follow me on Twitter/X if you want!