tl;dr: I’ve started working in earnest on a new book, tentatively titled, Practical Math for Programmers. Sign up for a mailing list to be notified when it’s published.
This week is a “perfect storm” for me. My wife and infant son are with the in-laws in a different city, and I took a week of vacation from work.
And so I’m taking the opportunity to pretend I’m a full-time math book author. I’m on track to spend about 65% of a full-time work week writing, with the remaining time spent between chores/errands, self-care, blogging, newsletter-ing, and general math reading not related to writing my book.
As mentioned previously, Practical Math is supposed to consist of short, self-contained implementations of neat math ideas that are enthusiastically used in production settings. Each section (I’m calling them “Tips”, like Drew Neil in Practical Vim) is 3-4 pages, code included, describes the general idea and high-level insights behind the algorithm, mentions any relevant guarantees, and has references to lots of additional reading.
The hope is that someone who isn’t invested in learning math yet, but wants to take advantage of the nice solutions math provides, can copy/paste (or translate) the code in the book into their own settings, and get from zero to a decent solution quickly. Due to space constraints in the book, I will have to take some slight liberties with the implementations, so they will likely not be optimal, or extremely performant, as written, though I think they should be good enough.
One of the simpler examples of a Practical Math tip is how to predict when an error budget for an SLO (as in Service Level Objectives) will be exhausted based on the recent error rate, which can be viewed through the lens of basic differential calculus.
A more advanced example is a Markov-chain method for attributing resource usage in a dependency graph of microservices. Imagine your company runs a bunch of services that use cloud storage and CPU, but some of the services only use them indirectly through other services. How can you determine which end-users are the top consumers of a particular raw resource? This tip answers that.
Also on my list are
I’d love to hear suggestions if you have them. I only require that some math idea is included, and that the solution can be written in ~30 lines of Python without using too many fancy libraries.
I’m aiming for ~75 tips (est 225 page book), though I could go as low as 60 + some nice essays. Before this week I had written 6 tips. This week so far, I’ve written 3.5 (the 0.5 is one for which the code is finished, but the prose is not). Roughly half of that time was spent doing research, i.e., refreshing my memory, finding citations, and nailing down details. I will probably finish 4 tips by the end of the week. At that rate, if I were to write 65% of full-time, I would expect to finish the first draft of the book in about 4 months. Double that and I can comfortably write this book in a year, with enough time to take the Summer off. Not bad!
It’s been strangely relaxing to get back into reading research papers. I’ve fallen out of touch with recent research—missing HUGE breakthroughs, though I was preoccupied with a newborn when that bomb dropped. While it feels like seeing an old friend, it also feels enviously bittersweet. Like I’ve been ignoring that old friend and they moved on without me for the better. What lasting work have I done in the mean time? Certainly nothing remotely of interest outside of my insular cost-cutting organization in Google. Meanwhile, a handful of my old friends and colleagues are high-profile professors, or working in esteemed research labs on what seems like important and gratifying problems. In my imagination they’re having a blast, and I can only hope it’s a “grass-is-always-greener” situation. But hey, if I can make solo-authorship work at 65% time, maybe I can carve out some time for research on the side.
Most of the tips I’ve written this week have been on the theme of sketching. Sketching is the name for an algorithm that computes some statistic on a data stream while using severely restricted space. In many cases, the space usage is independent of the size of the stream, and independent of the size of the true quantity being estimated.
For example, the HyperLogLog algorithm, which estimates the number of unique items in a data stream, is configured in most applications to use only about 2kB of RAM no matter how large the data stream is, and no matter the true number of unique items.
This is extremely impressive. How could it possibly work!? The idea, like so many sketching algorithms, is to use a hash to make the bits in the individual data items sufficiently random, that statistics on the bits themselves can be used to infer information about the population. In the case of LogLog-style algorithms, you hash each item, count the trailing zeros in the binary representation, and keep track of the longest run you see. Then, you can estimate the size of the total population as
2 ^ max(num_zeros), because (with a random-enough hash) you expect to see an item with
k zeros once in every
2^k elements. The nature of the maximum function means it doesn’t matter how often a single item occurs in the stream, nor how big the stream is, since you only store one number at any time.
As an aside, this is the same idea that the Bitcoin protocol uses—in a converse way—to set its difficulty target. If you want it to take longer to mine a block, require the hash inversion step result in a number with a longer string of trailing zeros. Sadly, these interesting ideas are used to create waste in Bitcoin, rather that to reduce waste as in HyperLogLog.
Of course, since there is only one hash function in HyperLogLog, you must account for hash collisions. But you can easily estimate how often those occur and correct for it. You also have to deal with the variance of the estimate. To reduce the variance, you can do a neat trick called stochastic averaging, in which you imitate having multiple hash functions by using the first few bits of an item’s hash as the index of a bucket, and then at the end you average the estimates from each bucket, being very careful to deal with any bias that might occur from your choice of averaging scheme. This essentially partitions the underlying population by a small hash key, and applies the HyperLogLog estimate to each part, combining at the end.
Look forward to a more detailed description, and a complete implementation, in Practical Math. In the mean time, you can marvel at how popular HyperLogLog is, by seeing its implementation in Redis (this is the most elegant and advanced implementation I could find based on the literature), Postgres (which is now supported in Amazon Aurora RDS), ElasticSearch, at Reddit, and in Google’s BigQuery as
What’s in production, indeed!