CMU 15-721 Advanced Database Systems 2024 — Part 1

The CMU Database Group offers two main courses:

  • Intro to Database Systems (fall and spring)
  • Advanced Database Systems (spring)

Both of these are quite well-known to me. The main reason for this is that they make all their lectures available on Youtube, and I’ve watched a few of them in the past. Furthermore, there’s been a few study groups at SingleStore around these courses. Finally, Andy Pavlo (the main professor) is a very well-known person in the database space and we’ve hired a lot of people who know him from CMU at SingleStore.

So, 2 months ago, I decided to properly follow along the CMU Advanced Database Systems class for the spring of 2024. By “follow along”, I mean that I’m:

  • Watching every lecture
  • Reading all the mandatory papers (but some others as well)
  • Taking lecture notes and writing paper reviews

And I’ve been doing all of this in public! In this Notion link, you can find all of my notes and paper reviews.

As I’m writing this blog post, we’re only 10 lectures in, but I’ve already learnt an insane amount about how OLAP systems are built. Even though I’ve been working for SingleStore for a long time, I’ve been more focused on the implementation of our cloud managed service. Additionally, SingleStore is not a pure OLAP system, but rather an HTAP (OLTP+OLAP) database. So, there’s some big differences between how SingleStoreDB does certain things and what we’ve been learning in the course.

In this blog post, I’ll write a little bit about my experience with the course so far. In particular, I want to highlight what I’ve been learning about, with the hope of inspiring others to take these courses as well.

How do large-scale OLAP systems actually store data?

The first few classes of the course are all about how OLAP systems store data (both in local disk or in remote storage such as S3). So, naturally, we learnt about rowstore and columnstore, but to a much deeper level than I was familiar with. In specific, we learnt about the most popular columnar data formats (Parquet and ORC) and how they’re designed. If I had to highlight my key learnings here, those would be:

  • Columnar formats are actually a hybrid called PAX where data is stored column-by-column but divided into row groups (or “stripes”). This is because most queries are multi-column so spreading columns too far from each other is also not beneficial.
    • PAX was actually introduced in 2002 in a paper that’s co-authored by a CMU person and some other folks.
  • Data formats today should be optimized for fast reads and not for ultimate compression. This is because these systems increasingly use S3-like storage which is very cheap.
    • Optimizing for fast reads means different ways of encoding data and storing some amount of statistics, bloom filters and other things.

How do these systems execute queries?

This is probably the area that I knew the least about so far. Fortunately, there was quite a large number of lectures dedicated to this topic, and some outstanding papers as well. The main seminal paper in this area is “MonetDB/X100: Hyper-Pipelining Query Execution”. It is from 2005, and it introduced a really important concept that is used today by the vast majority of analytical databases. Basically, OLTP systems (which were the first kind of databases) typically process query plans 1 row at a time (the “volcano” model). But in this paper, the authors introduced the idea of processing a somewhat large batch of multiple rows at a time for OLAP systems. This works extremely well because modern CPUs greatly benefit from pipelining in order to be as efficient as possible (in other words, the less interruptions the better). This technique is now called “vectorization”.

What about SIMD? And code generation?

Before this course, I thought of SIMD as a set of special CPU instructions for Intel processors that made certain operations go faster. And while that is basically true, during this course I’ve learnt what SIMD actually is and how database systems can leverage it. And it turns out that SIMD is really important to accelerate query execution.

Besides SIMD, we also learnt about code compilation for query execution. This is, again, a topic that I knew a little bit about. But I didn’t really know what it meant in depth until I went through the literature and the lectures. Again, query compilation is another method that allows us to accelerate query execution a lot.

However, it is not easy to utilize both SIMD and code generation (but it is possible). So, most systems end up making a choice between relying more on code generation or just doing vectorization really well (with SIMD). This paper from 2018 compares the two approaches and concludes that the performance difference isn’t tremendous. So, query compilation has the advantage that it makes it easier to implement server-side logic (stored procedures, user-defined functions, etc.). and not using query compilation has the advantage of implementation simplicity. I would say, however, that systems that are looking to deliver maximum performance have to rely on all techniques (and that is actually what SingleStore does).

(Notice also that throughout this blog post, I’m simplifying everything quite a lot. The course material goes into the weeds of most of these things!)

How do these systems schedule queries to be executed?

There was also one class on scheduling and coordination. This was probably a bit too rushed, but it was enough for me to understand something quite fascinating…

I’ve always imagined that distributed databases made query execution schedule much more complicated. However, and this may sound obvious to many, ever since CPUs became multi-core, the problem of scheduling across cores has already existed. And solving scheduling for distributed systems isn’t that different than solving for multi-core single-node systems (with some network-related caveats, of course).

We also learnt about multi-process vs. multi-threading. I have recently watched a talk about this from PGConf EU 2023, so I already knew a little bit about the tradeoffs here. It seems like systems such as Postgres which launch a process for each query are a dying breed and most modern database systems prefer to take over as much work as possible from the underlying OS (and this is the right thing to do!).

The modularization of database systems

During these lectures, I also learnt about Velox, Substrait, Arrow and others. These open source projects have the potential to revolutionize how data systems are built.

  • Velox is a database system agnostic implementation of multiple parts of a query engine
  • Substrait is an agnostic implementation of query plan representation
  • Arrow is an in-memory columnar data format

Together with some other things like Parquet and Iceberg, these open source tools could be glued together to build modular and open-source data systems (this idea was discussed in the very first lecture for this class). In fact, this is already partly happening inside Meta where they’ve taken Presto and Spark and rewrote their query engines in C++ using Velox (as per one of the papers from lecture 5). So, the future of databases will probably be more open than it is today, but not just at the data storage layer (because of Iceberg), but perhaps even at other layers of the stack.

What’s next?

The next few classes will focus heavily on building a good query optimizer (but we’ll also learn about building things like stored procedures and user-defined functions, and a little bit about networking protocols). Then, we’ll head into real system analysis, with one lecture for each of the following:

  • Dremel / BigQuery
  • Databricks / Spark
  • Snowflake
  • DuckDB
  • Yellowbrick
  • Redshift
  • Azure Synapse

I’ll continue to watch along, and I’ll hopefully be able to write more on this blog about what I’m learning. Andy is a fantastic teacher, and his informal style definitely makes the lectures easier to watch. So, I highly recommend that others join the incoming fall 2024 course of “Intro to Database Systems” or try to do what I’m doing through a course that has already started/taken place in the past (none of what I’m doing requires following along in real time, I just find it more fun that way).

In terms of the time investment, I’m spending around 3 hours every week watching the lectures, and then another 2 hours reading papers and writing about them. So, all in all, you should expect to need about 6 hours per week to “take” a course like this. Of course, if you want to try to implement the class project as well, I’m sure that will require a lot more time.

Finally, while I am writing in public, that content is really more for myself. If you’re curious, the official lecture notes in the class’s schedule are much better.

Feel free to reach out on Twitter!