Napkin Math

Archive

Napkin Math #21: Index Merges vs Composite Indexes in Postgres and MySQ

While working with Readwise on optimizing their database, I found myself asking the question: How much faster is a composite index compared to letting the database do an index merge of multiple indexes? Consider this query:

SELECT count(*) /* matches ~100 rows out of 10M */
FROM table
WHERE int1000 = 1 AND int100 = 1
/* int100 rows are 0..99 and int1000 0...9999 */

/* composite index */
CREATE INDEX ON table (int1000, int100)

/* versus single indexes for index merge */
CREATE INDEX ON table (int1000)
CREATE INDEX ON table (int100)

Composite indexes are about 10x faster than index merges. In Postgres, the gap is larger than in MySQL because Postgres doesn’t support index-only scans for queries that involve index merges.

Read the full post here.

#21
November 27, 2022
Read more

Napkin Math #20: Scaling Causal's Spreadsheet Engine from Thousands to Billions of Cells: From Maps to Arrays

Earlier this year, I spent a few weeks working with Causal on their core spreadsheet engine. Causal is a spreadsheet built for the 21st century to help people work better with numbers. Behind Causal’s innocent web UI is a complex calculation engine — an interpreter that executes formulas on an in-memory, multidimensional database.

For the fellow nerds out there who like to use probability distributions in their spreadsheets, this is something Causal can do as well. You can type 10 to 100 in a cell, and it’ll do monte carlo simulations across all your cells to give their distributions. Pretty cool.

I worked with them on how we could scale their engine from millions to billions of cells. The short of it is: by moving the core data structure from maps to arrays. 😅

#20
July 6, 2022
Read more

Napkin Math #19: Metrics For Your Web Application's Dashboards

In the beginning of the year I was helping readwise.io get some of their observability up to snuff. A few weeks later “What should I monitor?” came up on another call, so I decided to list out the metrics that I expect from a great dashboard:

  • Web Backend (e.g. Django, Node, Rails, Go, ..)
    • Response Time p50, p90, p99, sum, avg
    • Throughput by HTTP status
    • Worker Utilization
    • Request Queuing Time
    • Service calls
      • Database(s), caches, internal services, third-party APIs, ..
      • Enqueued jobs are important!
      • [Circuit Breaker tripping][cb] /min
      • Errors, throughput, latency p50, p90, p99
    • Throttling
    • Cache hits and misses %
    • CPU and Memory Utilization
    • Exception counts /min
  • Job Backend (e.g. Sidekiq, Celery, Bull, ..)
    • Job Execution Time p50, p90, p99, sum, avg
    • Throughput by Job Status {error, success, retry}
    • Worker Utilization
    • Time in Queue
    • Queue Sizes
      • Don’t forget scheduled jobs and retries!
    • Service calls p50, p90, p99, count, by type
    • Throttling
    • CPU and Memory Utilization
    • Exception counts /min

More details about what these all mean in the latest napkin post!

Any favourites of yours missing? Let me know.

#19
March 20, 2022
Read more

Napkin Math #18: Neural Net from Scratch

Happy new year everyone!

In this most recent post, we will establish a mental model for how a neural network works by building one from scratch! In a future issue we will do napkin math on performance, as establishing the first-principle understanding is plenty of ground to cover for today.

Neural nets are increasingly dominating the field of machine learning / artificial intelligence: the most sophisticated models for computer vision (e.g. CLIP), natural language processing (e.g. GPT-3), translation (e.g. Google Translate), and more are based on neural nets. When these artificial neural nets reach some arbitrary threshold of neurons, we call it deep learning.

A visceral example of Deep Learning’s unreasonable effectiveness comes from an interview with Jeff Dean who leads AI at Google. He explains how 500 lines of Tensorflow outperformed the previous ~500,000 lines of code for Google Translate’s extremely complicated model. Blew my mind.

#18
January 3, 2022
Read more

Careful Trading Complexity for Improvement

Whenever you find yourself arguing for improving infrastructure by yanking up complexity, you need to be very careful.

Published a new article this morning about balancing complexity versus infrastructure improvement.

Enjoy!

#17
November 30, 2021
Read more

Napkin Problem 16: When To Write a Simulator

My rule for when to write a simulator:

Simulate anything that involves more than one probability, probabilities over time, or queues.

#16
September 16, 2021
Read more

Napkin Problem 15: Increase Performance by Fitting In the Initial TCP Slow Start Window

Did you know that if your site’s under ~12kb the first page will load significantly faster? Servers only send a few packets (typically 10) in the initial round-trip while TCP is warming up (referred to as TCP slow start). After sending the first set of packets, it needs to wait for the client to acknowledge it received all those packets.

Quick illustration of transferring ~15kb with an initial TCP slow start window (also referred to as initial congestion window or initcwnd) of 10 versus 30:

#15
July 29, 2021
Read more

Napkin Problem 14: Using checksums to verify syncing 100M database records

A common problem you’ve almost certainly faced is to sync two datastores. This problem comes up in numerous shapes and forms: Receiving webhooks and writing them into your datastore, maintaining a materialized view, making sure a cache reflects reality, ensure documents make it from your source of truth to a search index, or your data from your transactional store to your data lake or column store.

If you’ve built such a system, you’ve almost certainly seen B drift out of sync. Building a completely reliable syncing mechanism is difficult, but perhaps we can build a checksumming mechanism to check if the two datastores are equal in a few seconds?

#14
January 2, 2021
Read more

Napkin Problem 13: Filtering with Inverted Indexes

Database queries are all about filtering. Whether you’re finding rows with a particular name, within a price-range, or those created within a time-window. Trouble, however, ensues for most databases when you have many filters and none of them narrow down the results much.

This problem of filtering on many attributes efficiently has haunted me since Problem 3, and again in Problem 9. Queries that mass-filter are conceptually common in commerce merchandising/collections/discovery/discounts where you expect to narrow down products by many attributes. Devilish queries of the type below might be used to create a “Blue Training Sneaker Summer Mega-Sale” collection. The merchant might have tens of millions of products, and each attribute might be on millions of products. In SQL, it might look something like the following:

#13
November 8, 2020
Read more

Napkin Problem 12: Recommendations

Since last, I sat down with Adam and Jerod from The Changelog podcast to discuss Napkin Math! This ended up yielding quite a few new subscribers, welcome everyone!

For today’s edition: Have you ever wondered how recommendations work on a site like Amazon or Netflix?

#12
September 27, 2020
Read more

Napkin Problem 11: Circuit Breakers

You may have heard of a “circuit breaker” in the context of building resilient systems: the art of building reliable systems from unreliable components. But what is a circuit breaker?

Let’s set the scene for today’s napkin math post by setting up a scenario. Scenario’s pretty close to reality of what our code looked like conceptually when we started working on resiliency at Shopify back in 2014.

Imagine a function like this (pseudo-Javascript-C-ish is a good common denominator) that’s part of rendering your commerce storefront:

#11
August 22, 2020
Read more

Napkin Problem 10: MySQL transactions per second vs fsyncs per second

Napkin friends, from near and far, it’s time for another napkin problem!

Since the beginning of this newsletter I’ve posed problems for you to try to answer. Then in the next month’s edition, you hear my answer. Talking with a few of you, it seems many of you read these as posts regardless of their problem-answer format.

That’s why I’ve decided to experiment with a simpler format: posts where I both present a problem and solution in one go. This one will be long, since it’ll include an answer to last month’s.

Hope you enjoy this format! As always, you are encouraged to reach out with feedback.

#10
July 17, 2020
Read more

Napkin Problem 9

Napkin friends, from near and far, it’s time for another napkin problem!

As always, consult sirupsen/napkin-math to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.

We hit an exciting milestone since last with a total of 500 subscribers! Share the newsletter (https://sirupsen.com/napkin/) with your friends and co-workers if you find it useful.

#9
June 7, 2020
Read more

Napkin Problem 8

Napkin friends, from near and far, it’s time for another napkin problem!

As always, consult sirupsen/napkin-math to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.

Since last time, I’ve added to the napkin math table. Plenty more I’d like to see, happy to receive help by someone eager to write some Rust!

#8
May 3, 2020
Read more

Napkin Problem 7

Napkin friends, from near and far, it’s time for another napkin problem!

As always, consult sirupsen/napkin-math to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.

I debated putting out a special edition of the newsletter with COVID-related napkin math problems. However, I ultimately decided to resist, as it’s exceedingly likely to encourage misinformation. Instead, I am attaching a brief reflection on napkin math in this context.

#7
April 11, 2020
Read more

Napkin Problem 6

Napkin friends, from near and far, it’s time for napkin problem number 6!

As always, consult sirupsen/napkin-math to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.

#6
March 7, 2020
Read more

Napkin Problem 5

Napkin friends, from near and far, it’s time for napkin problem number 5! If you are wondering why you’re receiving this email, you likely watched my talk on napkin math and decided to sign up for some monthly practise.

Since last, in the napkin-math repository I’ve added system call overhead. I’ve been also been working on , which leverage to queue I/O sys-calls (in more recent kernels, network is also supported, it’s under active development). This avoids system-call overhead and allows the kernel to order them as efficiently as it likes.

#5
February 3, 2020
Read more

Napkin Problem 4

Napkin friends, from near and far, it’s time for napkin problem number four! If you are wondering why you’re receiving this email, you likely watched my talk on napkin math and decided to sign up for some monthly training.

Since last, there has been some smaller updates to the napkin-math repository and the accompanying program. I’ve been brushing up on x86 to ensure that the base-rates truly represent the upper-bound, which will require some smaller changes. The numbers are unlikely to change by an order of magnitude, but I am dedicated to make sure they are optimum. If you’d like to help with providing some napkin calculations, I’d love contributions around serialization (JSON, YAML, …) and compression (Gzip, Snappy, …). I am also working on turning all my notes from the above talk into a long, long blog post.

#4
January 7, 2020
Read more

Napkin Problem 3

Napkin friends, from near and far, it's time for napkin problem number three! If you are wondering why you're receiving this email, you likely watched my talk on napkin math.

This weeks problem is higher level, which is different from the past few. This makes it more difficult, but I hope you enjoy it! You are considering how you might implement a set-membership service. Your use-case is to build a service to filter products by particular attributes, e.g. efficiently among all products for a merchant get shoes that are: black, size 10, and brand X. Before getting fancy, you'd like to examine whether the simplest possible algorithm would be sufficiently fast: store, for each attribute, a list of all product ids for that attribute (see drawing below). Each query to your service will take the form: "shoe AND black AND size-10 AND brand-x". To serve the query, you find the intersection (i.e. product ids that match in all terms) between all the attributes. This should return the product ids for all products that match that condition. In the case of the drawing below, only P3 (of those visible) matches those conditions. The largest merchants have 1,000,000 different products. Each product will be represented in this naive data-structure as a 64-bit integer. While simply shown as a list here, you can assume that we can perform the intersections between rows efficiently in O(n) operations. In other words, in the worst case you have to read all the integers for each attribute only once per term in the query. We could implement this in a variety of ways, but the point of the back-of-the-envelope calculation is to not get lost in the weeds of implementation too early. What would you estimate the worst-case performance of an average query with 4 AND conditions to be? Based on this result and your own intuition, would you say this algorithm is sufficient or would you investigate something more sophisticated? As always, you can find resources at . The talk linked is the best introduction tot he topic. Please reply with your answer! 50 * 0.8 = 40 disk reads come out of the memory cache. The remaining 10 SSD reads require a random SSD seek, each of which will take about 100 us as per . The reference says 64 bit, but the OS will read a full page at a time from SSD, so this will be roughly right. So call it a lower bound of 1ms of SSD time. The page-cache reads will all be less than a microsecond, so we won't even factor them in. It's typically the case that we can ignore any memory latency as soon as I/O is involved. Somewhere between 1-10ms seems reasonable, when you add in database-overhead and that 1ms for disk-access is a lower-bound.
#3
December 15, 2019
Read more

Napkin Problem 2

Fellow computer-napkin-mathers, it's time for napkin problem #2. The last problem's solution you'll find at the end! I've updated sirupsen/napkin-math with last week's tips and tricks--consult that repo if you need a refresher. My goal for that repo is to become a great resource for napkin calculations in the domain of computers. My talk from SRECON's video was published this week, you can see it here. Reply to this email with your answer, happy to provide you mine ahead of time if you're curious. First I jotted down the basics and convert them to scientific notation for easy calculation  Then multiplied these numbers into storage per day:
#2
November 2, 2019
Read more

Napkin Problem 1

Napkin friends around the world: it's time for your very first system' estimation problem! Confused why you're receiving this email? Likely you attended my talk at SRECON 19, where I said that I'd start a newsletter with occasional problems to practise your back-of-the-envelope computer calculation skills--if enough of you subscribed! Enough of you did, so here we are!

Reply to this email with your answer and how you arrived there. Then I'll send you mine. You can find many numbers you might need on   Don't overcomplicate the solution by including e.g. CDN logs, slow query logs, etc. Keep it simple. You might want to refresh your memory on . Remember that you need less precision than you think. Remember that your goal is just to get the exponent right, x in n * 10^x. is good at calculating with units, you may use that the first few times--but over time the goal is for you to be able to do these calculations with no aids! Consider using spaced repetition to remember the numbers you need for today's problem, e.g.  is a messenger bot.
#1
October 19, 2019
Read more