CMU 15-721 Advanced Database Systems 2024 — Part 2 (Final)

David Gomes

28 May 2025 • 4 min read

It's been almost a full year since I published Part 1 of this series. I have spent a lot of time going through the CMU 15-721 Advanced Database Systems course from 2024. This meant going through every lecture, every reading assignment, and every piece of homework as well. It was a lot of fun, and I learnt a lot about about OLAP systems at scale. However, I put off writing this blog post due to lack of time. This past week, I decided to just get it out there since it looks like I'm going to be working for one of these big OLAP companies again ☺️

In this Notion link, you can find all of my notes and paper reviews.

For the second half of the course, the focus was on real-life systems such as:

Databricks
Snowflake
DuckDB
Yellowbrick
AWS Redshift

(You can find the lecture notes for every single one of these on Notion)

Databricks

Let's start by talking a bit about Databricks and Spark, because why not? ☺️ Databricks started as a managed service around Spark, essentially. And Spark itself is written in Java. This had a ton of limitations such as the fact that the garbage collector is built into the runtime (the JVM). For this reason, Databricks decided to create Photon in C++. It accelerates the execution of query plans on raw/uncurated files in a data lake. The key architectural features of Photon are:

Shared-Disk / Disaggregated Storage
Pull-based Vectorized Query Processing
Precompiled primitives + Expression Fusion
Shuffle-based distributed QE
Sort-merge and hash joins
Unified query optimizer + adaptive optimizations

The Photon engine uses precompiled operator kernels (primitives), but they do not precompile parts of the query. This is because the software engineering complexity overhead of code generation is, in the opinion of Andy Pavlo and people from Databricks, not worth it. The problem with working in systems that do codegen is that debugging of those systems is really hard. With vectorization alone, the performance one is able to achieve is roughly the same as with code generation, but working on the system is much easier.

Another thing Fusion does is expression fusing. When it sees two intertwined operators, they fuse the two pre-compiled primitives. Since Databricks is a managed service, they actually have looked at which operators are most commonly used together and have those as primitives already.

Anyhow, Fusion has a lot more optimization techniques than just expression fusing. By now, it is a fully-featured, highly-optimized query engine with tons of mechanisms for accelerating how long queries take.

Snowflake

From my notes on Snowflake's lecture, you can see that they have also created a really advanced query engine with tons of QO (query optimization) techniques (and they are probably doing expression fusing by now as well). This includes things like:

Precompiled primitives
Separate table data from metadata
- (metadata in FoundationDB)
No buffer pool
PAX columnar storage
- Both for their proprietary storage as well as Iceberg, which they also support
Sort-merge + hash joins
Unified query optimizer + adaptive optimization
Push-based vectorized query processing

Now, what really makes Snowflake stand out is Snowpark, Snowpipe, and everything else they have built on top of the warehouse. According to Andy Pavlo, the underlying query performance is becoming commoditized against Databricks, Redshift, Yellowbrick, etc.

This is really important! If the query performance will just go down to the very same algorithms across all data lakehouses and all data warehouses, the actual business competition is going to be around the DevEx, UX and other security features on these managed services.

I won't be going into Yellowbrick and Redshift, but you can read my notes on Notion.

DuckDB

From my lecture notes on DuckDB, you can quickly learn that:

It is an embedded analytical database
It was created in 2017, written in C++
Push-based Vectorized Query Processing
- (used to be pull-based)
Precompiled Primitives
Morsel Parallelism + Scheduling
PAX Columnar Storage (of course!)
Sort-Merge + Hash Joins
Stratified Query Optimizer
- (with support for unnesting of arbitrary subqueries, which we studied before and the only other system that does it is Umbra)

What makes DuckDB really, really cool are all the query optimizer techniques, as well as the fact that it can run as an embedded engine. This has made it kind of "take off" within the data science and data engineering comunity as a really use to use columnar database.

Andy Pavlo also went into Motherduck. He mentione they have bridge operators that pass tuple streams between local and remote DuckDB instances. The idea with this is to have local DuckDB execution for quick things and remote DuckDB execution for things that need scale.

Summing it all up

To sum it all up, you really have to start by reading Part 1 of this series – there's a lot of important takeaways there. For the second part (more focused on real-life OLAP systems), I'm now too biased in order to have a proper "takeaway" on these (like I said, I'm about to be hired by one of these companies). Having said that, I'll just mention that:

Iceberg and, in general, open source data storage formats, are going to take over
Query engine and query optimization performance will become commoditized
The real innovation, and the real competition is on everything else – workflow engines, ingest pipelines, job frameworks, security, pricing, etc.

So, is the course worth it? Absolutely. As I already mentioned in the first part, Andy Pavlo is an extremely entertaining teacher, and I think that makes it much easier to go through the classes. Finally, the quality of the content is really high and you'll really learn a lot – especially if you "force yourself" to write down notes for all the lectures.

As for which of these "big companies" I'm about to join... I'll probably write another blog post on that soon so stay tuned for that.

Feel free to follow me on Twitter/X if you want!

Databricks

Snowflake

DuckDB

Summing it all up

Sign up for more like this.