Some more Llama profiling

at better performance

                        April 28, 2021

                Some more Llama profiling

                        I wrote previously about profiling llama, and the challenges of understanding this distributed system. A few notes today about some of my progress since then.
Column stores
I mentioned last time that a column store would be a great fit for my use case. Since then, I finally got frustrated with the speed of SQLite for large queries when I have a lot of traces loaded, so I searched for another solution.
I’ve now moved my pipeline to use DuckDB as my SQL engine. DuckDB can natively read parquet files, so llama trace  can now generate parquet files directly using fraugster/parquet-go. I generate one parquet file per trace, and then programmatically generate a view in DuckDB that looks something like this:
CREATE VIEW spans AS
SELECT
  trace_id,
  span_id,
  parent_id,
  start,
  name,
  path,
  duration_us/1000.0 as duration_ms,
  to_microseconds(duration_us) as duration,
  "global.build_id" as build_id
FROM parquet_scan('/home/nelhage/code/linux/llama/j100-gsem48.parquet')
UNION ALL
SELECT
  trace_id,
  span_id,
  parent_id,
  start,
  name,
  path,
  duration_us/1000.0 as duration_ms,
  to_microseconds(duration_us) as duration,
  "global.build_id" as build_id
FROM parquet_scan('/home/nelhage/code/linux/llama/j100-main.parquet')
UNION ALL
...

DuckDB supports projection pushdown into Parquet files — it only reads the columns needed for a given query — so I can name arbitrarily many columns in the view without performance cost. Explicitly listing columns is necessary since the parquet files may have different sets of columns, since I automatically add columns for every metadata field found in the traces. It also lets me do some renaming and type conversions for convenience.
For the queries I’m making, DuckDB ranges from slightly faster than SQLite to drastically faster. For one monster self-join used to produce some of the graphs I’ll show later, it cuts time from about 14s down to about 2, on a warm cache, with an even more drastic speedup cold.
Thoughts on DuckDB
I’m pretty excited about DuckDB. Column stores feel like magic to me — my parquet files are a fraction of the size of JSON or CSV, and yet I can query them directly at better performance than a SQLite database, for most of the queries I am currently using. Having a high-quality, open-source, local column store available in the ecosystem seems like a really useful primitive to have hanging around.
That said, DuckDB is still a bit raw; while working on this project, I ran into a number of bugs and rough edges. Most notably, their support for column groups (also known as “nested structs” or “nested messages”) inside of Parquet seems to be pretty raw; I played with encoding some of my data into nested fields but ended up flattening everything because it made Duck work better. However, the maintainers are also quite responsive; I reported the above 5 bugs shortly after running into them, and every one was fixed by the developers within a few days. Even with the rough edges, I feel quite good continuing to use the project after that experience.
Jupyter
It feels like every time I have a project that would benefit from some exploratory data visualization, I somehow end up picking a new tool to learn and use. It feels like I’d benefit from having stuck to one and accumulated experience, and also that’s somehow now how it’s gone. I find myself missing our internal tools at Stripe, which were far more pleasant to work with than anything I’ve found in the open-source world.
For now, for Llama, I’ve found myself turning to Jupyter notebooks, Pandas, and matplotlib, and it’s been a decently pleasant experience. I’ve long had a bit of an antagonist relationship with notebooks — I sprained my neck nodding along to this talk making the case against notebooks — but I can’t deny I’ve found them a decently pleasant experience.
DuckDB even has native Pandas integration, making it easy (and efficient) to get a query result as a Pandas dataframe, or even to run SQL over a dataframe in-place, which is frankly kind of magic. Combined with the DataFrame.plot method, which I hadn’t used before, it’s about as easy a plotting experience as I could hope for.
I’ve also been using Tailscale to run jupyter-lab on my big desktop but still connect to it from my laptop if I wanna hack on the couch or from the backyard. I find myself increasingly leaning into that pattern, and it’s really quite nice.
Lambda Concurrency
These tools have also contributed to the one major improvement in Llama performance since last time. The whole premise of Llama is that doing compiles in Lambda will be individually slower than doing them locally, but that we can make up for it by executing tons of builds in parallel. It occurred to me, then, to ask the question: how many jobs are we actually executing in Lambda concurrently? A slightly horrifying SQL query later and I was able to draw the plot, in this case for a -j100 build of the Linux kernel on my Ryzen 3900 desktop:

A few things immediately pop out:

First off, we can see the mostly-serial “link” phase near the end, from seconds 45-55 or so, followed by a brief spike as the kernel builds some modules and module-related stuff that needs the fully-linked kernel image.
We’re mostly doing pretty good at keeping Lambda busy; out of 100 concurrent llamaccs, usually 70-80 of them are in Lambda.
But there’s also some real struggling near the beginning, where our concurrency crashes catastrophically.

With this plot, I had a speculation: Before calling into Lambda, llamacc has to do some local computation — about 30-50ms worth, on my machine — in order to discover all the dependencies for a .c file. Especially at the start of the build, all 100 llamacc processes will be competing for my 24 physical threads, and that contention might slow them all down, meaning it takes longer for any of them to get to the point of offloading work. We’d be much better served by serializing them to some extent, so the first jobs finish quickly and start compiling in Lambda, and then the next ones carry on.
I added a semaphore around the llamacc processes, limiting them to 2*ncpu concurrent jobs actually executing at once, and dropped the semaphore around all the AWS interaction (uploading files to S3 and calling Lambda). The results were remarkable, especially when overlaid on the previous plot:

The version with semaphore finishes almost 15 seconds earlier, nearly a 25% improvement. Not only does the semaphore help llamacc keep load more-consistently high, but it also speeds up the time-to-first-jobs significantly, as expected.
The results are in some ways more pronounced if we limit the local job to using 4 cores, to simulate building on a smaller machine:

In that case, time-to-first-build seems to be the dominant factor; while the semaphore keeps concurrency more consistent, it’s not actually clear that it keeps it higher, on average; more experimentation is indicated. It’s also interesting that in all of these graphs, the initial concurrency spikes to very near 100, and then drops down. I think this may be because of the initial upload of all the header files; early on, essentially every build gets stuck behind an upload of the slowest global header file, and then all release near-simultaneously, but I haven’t verified this.
Closing thoughts
You can tell llama is a side project because this post is as much about yakshaving my tools as it is about actually making progress. I’m comfortable with that; for me a large part of the point of side projects is to make excuses to explore and learn new tools to solve actual problems instead of just toy examples. I personally find that learning fun for its own sake, but it’s also very valuable professionally to keep an eye on new technologies and learn what problems they solve and what they’re like to actually use.
Llama continues to trudge gradually along; I think it’s getting good enough that it’s time for a “release” and some launch PR. If anyone reading this compiles a lot of C or C++ on Linux, especially on a slow machine or a laptop, and wants to give it a try, check out the README and let me know how it goes!

                        Don't miss what's next. Subscribe to Musing in Computer Systems: