Napkin Problem 12: Recommendations

much

                            September 27, 2020

                Napkin Problem 12: Recommendations

                        Since last, I sat down with Adam and Jerod from The Changelog podcast to discuss Napkin
Math! This ended up yielding quite a few new subscribers,
welcome everyone!
For today’s edition: Have you ever wondered how recommendations work on a site
like Amazon or Netflix?

First we need to define similarity/relatedness. There’s many ways to do this. We
could figure out similarity by having a human label the data for what’s relevant
when the customer is looking at something else: If you’re buying black dress
shoes, you might be interested in black shoe polish. But if you’ve got millions
of products, that’s a lot of work!
Instead, what most simple recommendation algorithms is based on is what’s called
“collaborative filtering.” We find other users that seem to be similar to you.
If we know you’ve got a big overlap in watched TV shows to another user, perhaps
you might like something else that user liked that you haven’t watched yet?
This recommendation method is much less laborious than a human manually
labeling content (in reality, big companies do human labeling and
collaborative filtering and other dark magic).
In the example below, User 3 looks similar to User 1, so we can infer that they
might like Item D too. In reality, the more columns (items) we can use to
compare, the better results.

Based on this, we can design a simple algorithm for powering our
recommendations! With N items and M users, we can create this matrix of M x
N cells shown in the drawing as a two-dimensional array and represent
check-marks by 1 and empty cells by 0. We can loop through each user and
compare with each other user, preferring recommendations from users we have more
check-marks in common with. This is a simplification of cosine similarity
which is typically the simple vector math used to compare similarity between two
vectors. The ‘vector’ here being the 0s and 1s for each product for the user.
For the purpose of this article, it’s not terribly important to understand this
in detail.
How long it take to run this algorithm to find similar users for a million users
and a million products?
Each user would have a million bits to represent the columns. That’s 10^6 bits
= 125 kB per user. For each user, we’d need to look at every other user: 125
kB/user * 1 million users = 125 Gb. 125 Gb is not completely unreasonably to
hold in memory, and since it’s sequential access, even if this was SSD-backed
and not all in memory, it’d still be fast. We can read memory at ~10 Gb/s,
so that’s 12.5 seconds to find the most similar user for each user. That’s way
too slow to run as part of a web request!
Let’s say we precomputed this in the background on a single machine, it’d take
12.5 s/user * 1 million users = 12.5 million seconds ~= 144 days ~= 20 weeks.
That sounds frightening, but this is an ‘embarrassingly parallel problem.’ It
means we can process User A’s recommendations on one machine, User B’s on
another, and so on.  This is what a batch compute jobs on e.g. Spark would do.
This is really 12.5 million CPU seconds. If we had 3000 cores it’d take us
about an hour and cost us 3000 core * $0.02 core/hour = $60. Most likely these
recommendations would earn us way more than $60, so even this is not too bad!
When people talk about Big Data computations, these are the types of large jobs
they’re referring to.
Even on this simple algorithm, there is plenty of room for optimizations.
There will be a lot of zeros in such a wide matrix (‘sparse’), so we could store
vectors of item ids instead. We could quickly skip users if they have fewer 1s
than the most similar user we’ve already matched with.  Additionally, matrix
operations like this one can be run efficiently on GPU. If I knew more about
GPU-programming, I’d do the napkin math on that! On the list for future
editions. The good thing is that libraries used to do computations like this
usually do these types of optimizations for you.
Cool, so this naive recommendation algorithm is feasible for a first iteration
of our recommendation algorithm. We compute the recommendations periodically on
a large cluster and shove them into MySQL/Redis/whatever for quick access on our
site.
But there’s a problem… If I just added a spatula to the cart, don’t you want
to immediately recommend me other kitchen utensils? Our current algorithm is
great for general recommendations, but it fails to be real-time enough to assist
a shopping session. We can’t wait for the batch job to run again. By that time,
we’ll already have bought a shower curtain and forgotten to buy a curtain rod
since the recommendation didn’t surface. Bummer.
What if instead of a big offline computation to figure out user-to-user
similarity, we do a big offline computation to compute item-to-item similarity?
This is what Amazon did back in 2003 to solve this problem. Today, they
likely do something much more advanced.
We could devise a simple item-to-item similarity algorithm that counts for
each item the most popular items that customers who bought that item also
bought.
The output of this algorithm would be something like the matrix below. Each cell
is the count of customers that bought both items. For example, 17
people bought both item 4 and item 1, which in comparison to others means that
it might be a great idea to show people buying item 4 to consider item 1, or
vice-versa!

This algorithm has complexity even worse than the previous one, because worst
case we have to look at each item for each item for each customer O(N^2 * M).
In reality, however, most customers haven’t bought that many items, which makes
the complexity generally O(NM) like our previous algorithm. This means that,
ballpark, the running time is roughly the same (an hour for $60).
Now we’ve got a much more versatile computation for recommendations. If
we store all these recommendations in a database, we can immediately as part of
serving the page tell the user which other products they might like based on the
item they’re currently viewing, their cart, past orders, and more. The two
recommendation algorithms might complement each other. The first is good for
home-page, broad recommendations, whereas the item-to-item similarity is good
for real-time discovery on e.g. product pages.
My experience with recommendations is quite limited, if you work with these
systems and have any corrections, please let me know! A big part of my incentive
for writing these posts is to explore and learn for myself. Most articles that
talk about recommendations focus on the math involved, you’ll easily be able to
find those. I wanted here to focus more on the computational aspect and not get
lost in the weeds of linear algebra.
P.S. Do you have experience running Apache Beam/Dataflow at scale? Very
interested to talk to you.

                            Don't miss what's next. Subscribe to Napkin Math: