Musing in Computer Systems

Archive

Blog post: A cursed bug

Hey folks,

Just writing an update to let subscribers know of a new blog post. I posted a writeup of a delightfully cursed bug that we ran into and eventually ran down at work. Here it is: https://blog.nelhage.com/post/a-cursed-bug/

Ranger updates

As a newsletter subscriber bonus, here’s some photos of Ranger. He recently saw his first-ever snow, and LOVED it. And he had a very restful snoozle thereafter.

#28
February 23, 2022
Read more

Two reasons Kubernetes is so complex

Preface

Hello friends! It’s been a while. I’ve been finding it very hard to write while holding up a full-time job, and I’ve also been dealing with some very frustrating joint/ergo struggles that make using a compute kinda painful. I think they’re making progress and I’m better figuring out how to manage time and energy while working, so hopefully I won’t go on quite as long a hiatus before the next post 🙂

Also! My team published our first paper, which I’m really excited about. It’s pretty in-the-weeds stuff so I don’t expect many people outside of ML to read it, but I do think it’s some really great (if really early!) progress towards understanding what the hell is going on inside GPT-3 and friends. Alongside that, I was able to publish a writeup on Garçon, one of my very first projects at Anthropic, which is the infrastructure tooling that powers most of our interpretability work.

With that out of the way, onward to the idle thoughts I wanted to share with y’all.

#27
January 27, 2022
Read more

Some thoughts on GitHub Copilot

A week or so ago, GitHub announced GitHub Copilot, their AI-powered code completion assistant, powered by a version of OpenAI’s GPT-3 model. I’ve spent a lot of time working on developer productivity tools and am also now working on language-generation models at Anthropic, so I’m very interested in Copilot and its implications. I haven’t been invited to the beta yet (probably because I don’t use VS Code) so I haven’t had a chance to play with it, but I wanted to jot down some initial thoughts and reactions.

I want to caveat up front that these are all my personal views, not those of Anthropic, and I’m pretty sure that most of these reactions would have been the same even had I not joined Anthropic earlier this year.

#26
July 12, 2021
Read more

Blog post: Distributed cloud builds for everyone

Ranger update! He turned 6 months old about a week ago. Here he is celebrating Memorial Day yesterday with his very first slice of watermelon, which he looooooved.

Blog post: Distributed cloud builds for everyone

#25
June 1, 2021
Read more

Blog post: Building LLVM in 90 seconds using Lambda

I’ve been talking about my Llama project here for a while now. Last week, there were some blog posts about building LLVM quickly on large machines, so I decided to throw my hat in the ring with a build using Llama. I was able to build LLVM HEAD in under 90 seconds, at a cost of about 40¢.

#24
May 21, 2021
Read more

Some more Llama profiling

I wrote previously about profiling llama, and the challenges of understanding this distributed system. A few notes today about some of my progress since then.

Column stores

#23
April 28, 2021
Read more

Do the hard one second

Look at this perfect sleepy donut boy!

Technical migrations

#22
April 14, 2021
Read more

Profiling llama

Profiling llama

I wrote a few months back about , my experimental project for executing shell commands in Amazon Lambda. I briefly previewed , the GCC-compatible wrapper that allows for building C and C++ software using Lambda to outsource the compute, in the style of .

#21
March 31, 2021
Read more

New blog post: Opinionated thoughts on SQL

Short email, to let you know I have a new blog post out, sharing my thoughts on and pet peeves with SQL databases.

I was going to send this as a newsletter post, but it got quite large, and I also think this is the kind of content I want to refer people back to in the future, so I decided to put it on the blog. I’m still figuring out which content goes where; I do think some of my past newsletter posts will get promoted to blog posts after a bit more editing, at some point.

#20
March 30, 2021
Read more

Notes on some PostgreSQL implementation details

Ranger update

Seriously friends he has gotten so large. He knows his name and comes when called, like, 80% of the time. He turned four months old today, just in time to celebrate National Puppy Day!

#19
March 23, 2021
Read more

What does a cache do?

I’ve recently had cause to work on scaling up a web application. It’s got a pretty traditional architecture: A CDN in front of a fast native code web server (e.g. apache2 or nginx) in front of an app code in a slow interpreted language (e.g. Python or Ruby) which talks to a database (e.g Postgres or MySQL), with an additional in-memory KV store (e.g. redis or memcached) as a cache . As you might imagine, scaling this application ends up involving a lot of “add more caching.”

As I’ve talked about before, I like to think about performance questions in terms of hardware utilization — what hardware resources are we consuming in order to accomplish a given unit of work? So, this seems like a good opportunity to write up a model that’s been floating around in my head: when we cache stuff in, say, memcached, what are we actually doing in terms of usage of the underlying physical resources?

#18
March 5, 2021
Read more

HTTP Pipelining, S3, and gg

Ranger update

22lbs at last weigh-in! His favorite place continues to be “curled up at our feet chewing his favorite bone”:

#17
February 22, 2021
Read more

Tagged unions are overrated

Among engineers who have strong opinions about programming languages, one particularly widely-held take is in the value of tagged unions, as well as language support for pattern-matching over them. When we were deciding to write Sorbet in C++, for instance, we often heard shock and surprise that we would even consider using a language without native support for tagged unions and pattern-matching. Conversely, OCaml’s excellent support for both is often credited as one of the reasons it is great for writing compilers.

At this point I’ve worked on a number of compilers, both toy and production, in a wide range of languages, including Go, C++, Java, Rust, and Haskell. With that experience, my considered opinion is this: Tagged unions and pattern-matching on them are vastly overrated . They are nice to have, but they rarely make a substantial difference in the development of a compiler, typechecker, or similar.

#16
February 15, 2021
Read more

On Reasoning about Code

Ranger update

Ranger has now gotten enough vaccinations to go for walks! This is very exciting. He was 19.6lbs at last weighing, now crossing twice his weight as-of coming home!

#15
February 12, 2021
Read more

Alive2 and missed-optimization bug reports

First, a Ranger update. He is a growing healthy boy, up 50% by weight over the last two weeks.

This week, I want to do a quick writeup of something I earlier this week, and explain why I think it’s really cool.

#14
February 1, 2021
Read more

Some notes on code review

A brief personal note

I missed a newsletter last week, due to a combination of procrastination and MIT Mystery Hunt, but also due to adopting this little guy into our home!!!!

#13
January 24, 2021
Read more

Tracing JITs and coverage-guided fuzzers

By happenstance, I am friends with a handful of engineers that happen to have spent substantial amounts of time working on both the PyPy project and on fuzzing a lot of C and C++ software. This post is an attempt to capture an observation we’ve made about a surprising similarity between those systems.

Coverage-guided fuzzers

Coverage-guided fuzzers, of a lineage pioneered by and now widely branching out into reimplementations and variants, attempt to mostly-automatically generate “interesting” inputs to a program, most typically with the goal of finding potentially-exploitable bugs in memory-unsafe languages. Given zero or more “seed” files, they randomly mutate the files, and preserve inputs that generate new behaviors in the system under test, measuring “new behaviors” using something akin to branch coverage. This causes them to evolve towards inputs that exercise more and more code, increasing the likelihood of triggering bugs.

#12
January 9, 2021
Read more

Situated Social Software

(Probably) no new letters for the rest of this year — going to take a holiday break. Thanks for subscribing and reading this newsletter, and you can find more of my writing at https://blog.nelhage.com/ if you’re feeling deprived :)

For this week, something a bit different — some thoughts on an essay from 2004(!) that I encountered for the first time this year and which really resonated with me.

Situated Software

In 2004, Clay Shirky wrote about He used the term to refer to “software designed in and for a particular social situation or context,” citing several examples from his students and other communities in and around NYU, where he was teaching at the time. Situated software, as he described it, was specialized to some pre-existing social structures and context, and bootstrapped on those structures to serve roles and accomplish purposes within that community that would be challenging to scale to “web scale.” Shirky made the prediction that over time we would see more of this situated software, built by specific communities for their community, instead of or along side larger, mass-scale, social platforms.

#11
December 18, 2020
Read more

Notes on Amazon Lambda

First, a callback to an older post: Itamar Turner-Trauring did a neat writeup on using Cachegrind to deterministic performance analysis, inspired by my post on the challenges of stable benchmarking.


#10
December 11, 2020
Read more

Papers I love: gg

I’ve been playing with Amazon Lambda the last few weeks for a side project, and during the process I’ve gone from kinda infuriated with and baffled by Lambda to quite a fan. I hope to write more about that journey in a future letter, but I decided that today I first want to share the paper that got me thinking about Lambda in the first place. I really enjoy it, and want to share some of the ideas I love from it.

gg

The paper is “From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers,” by a number of researchers primarily at Stanford (including Keith Winstein, who I worked with at MIT and at Ksplice). They present a tool called , which is designed to allow users to use cloud computation — including “function-as-a-service” systems like Amazon Lambda — to outsource everyday computation (such as software builds) from their laptops to the cloud, without requiring users to provision or manage a standing compute cluster.

#9
December 4, 2020
Read more

Determinism in software engineering

A few weeks back I wrote about nondeterministic performance and the problems it poses for benchmarking. That reminded me that I’ve got a bunch of thoughts about regular old determinism that I’ve been meaning to write up, and so that’s today’s topic.

Determinism

I describe a program’s behavior as “deterministic” if running it multiple times on the same inputs will reliably produce the same outputs. I expect this to be a non-controversial definition, although there is room for nuance in defining exactly what a program’s “inputs” are or what the “same outputs are” — in some contexts we might allow an output to contain a timestamp and still call that determinism, but in others that might be a problem.

#8
November 27, 2020
Read more

Benchmarking and theories of performance

I just this week finished reading Kuhn’s The Structure of Scientific Revolutions. I’d encountered many or most of the ideas in the book by reference, but this was my first time reading through the original work. I’d recommend it — it’s fairly short and pretty readable, and — among other topics — full of fun anecdotes about early scientific development.

In any case, thinking about the ideas in that book definitely contributed to the thoughts in today’s newsletter.

#7
November 19, 2020
Read more

Performance engineering requires stable benchmarks

Reproducible benchmarks are essential to doing performance engineering. In order to know if a change had an impact on performance, you have to be able to measure performance, and that measurement has to be reasonably consistent across time, so that “before” and “after” measurements are comparable.

Unfortunately, modern software systems are incredibly difficult to benchmark, for two broad reasons:

  • — I’m defining noise as effects external to the system being benchmarked which can influence its performance. This definition includes sources as obvious as other processes running on the same CPU (or different CPUs with shared caches or memory buses), but also more exotic processes like kernel timer interrupts, CPU throttling due to increased ambient temperatures, and explicit sources of randomness like .
#6
November 12, 2020
Read more

Test size and scope

It’s been A Week

I have a 60%-written post on software performance I was hoping to send this week, but in retrospect it was pretty foolish to expect I would do any writing at all after about noon on Tuesday. So here’s a short piece from my prepared backlog about testing instead.

Describing different sizes of tests

I’ve been intermittently working through , a book by several experienced Google engineers that attempts to summarize what they’ve learned about developing and maintaining software at scale and over extended periods of time. I find it at intervals both dry and annoyingly condescending in a way I’ve come to associate with some Google engineers, but it’s also full of a lot of good insight and hard-earned lessons. It’s been particularly interesting to compare their advice with my own experiences and lessons learned working on developer tooling and practices at smaller but still sizeable scale.

#5
November 5, 2020
Read more

Three approaches to edge cases in data models

Edge case poisoning

This post is somewhat a response to Hillel Wayne’s recent post on edge case poisoning. It should be understandable without reading his post, but I recommend starting there if you’ve got the time.

This newsletter is a bit of a first draft. I may try to polish it further into a full blog post, so I would love your feedback.

#4
October 29, 2020
Read more

Once more, with feeling…

Sorry for the second email.

It seems that I failed to actually disable link tracking on the previous post. I’m pretty sure that this time I’ve got it fixed for good.

You can find the previous posts online, with all links functioning:

#3
October 22, 2020
Read more

What's worth optimizing?

Broken links in last week’s email

Many people reported that all the links in last week’s post were dead. Thanks for letting me know, and I’m sorry about that. It turns out buttondown’s link tracking doesn’t play well with custom domains that require HSTS on all subdomains (like my own nelhage.com). I’ve disabled link tracking, since I dislike that feature anyways.

You can find last week’s post, with functioning links, here: https://buttondown.email/nelhage/archive/welcome-to-my-newsletter-and-performance-as/

#2
October 22, 2020
Read more

Welcome to my newsletter! And: Performance as hardware utilization

Hello! Welcome to the newsletter.

I started this year with a goal of writing weekly posts on my blog, but ever since COVID-19 lockdowns started, I’ve been really struggling to get ideas into a form where I feel confident writing about them. I’m hoping that a switch to a newsletter format will help motivate me to get back into it, and also help me feel more comfortable getting out less-fully-developed ideas, and create a space to share things I’m thinking about without feeling like I need to have concrete answers or frameworks.

I’ve seeding the mailing list with my . I’m inclined to deprecate that one and also start forwarding blog posts to this list, but if you feel strongly about wanting separate lists, please let me know.

#1
October 15, 2020
Read more
Brought to you by Buttondown, the easiest way to start and grow your newsletter.