July 29, 2022, 12:34 p.m.

☕ Words cannot espresso how much coffee means to me

Late To The Party

It’s been a good week over here. There are now 600+ people reading these weekly emails, which is incredible. Hope you have a great start into the weekend. Now for some machine learning!

The Latest Fashion

  • Version 5 of the Practical Deep Learning for Coders course just dropped.
  • I guess today is a Jeremy Howard issue since his guide to matrix calculus for deep learning is really great.
  • This repo has PyTorch code, models, and pre-trained weights for all your favourite deep learning vision architectures.

Got this from a friend? Subscribe here!

My Current Obsession

I spent the weekend exploring the region around the Rhein and all the castles and fortresses. That was lovely. The Loreley area around Koblenz is fairly famous for its beautiful landscape so I had a nice stroll down the Rhein listening to podcasts.

Castles-on-Rhein.jpg

Pythondeadlin.es had the first pull request! Very happy about two new conferences added.

On Linkedin, I cracked 10,000 followers. And I have been posting a bit more this week, which has been lovely. I got a bunch of new Python libraries to check out both on Linkedin and Twitter for example!

And finally, I booked my travel to Euroscipy in Basel. Maybe see you there?

Thing I Like

A whimsical one today. Did you know you can buy googly eyes and make everything funnier? My latest fave big googly eyes on my vacuum robot. Definitely recommend.

Hot off the Press

I shared four books for ML Ops, one of which is the highly anticipated book by Chip Huyen.

Then I went on to publish this short intro to three books that will get you on the way towards machine learning research.

Machine Learning Insights

Last week I asked, “What is the Double Descent phenomenon?”, and here’s the gist of it:

The Double Descent hypothesis is an interesting quirk of statistics and deep learning.

It explains why smalller models aren’t always worse, and larger models aren’t always better.

Even worse… it shows that more data isn’t always better!

Bias-variance Trade-off

A common topic in statistics is the bias-variance trade-off.

One way to look at this trade-off is that linear models are very robust against overfitting, considering they have two parameters. They often aren’t enough to capture the complexity of a data set though. Once we increase the complexity we are better able to capture the training data, without regularization, however, these models will become highly attuned to the training data.

This also means that the model is less likely to generalize to unseen data. The model’s capacity to overfit increases.

Classic Statistics

Classic statistics would expect an overly simple model is equally as bad as an over-parametrized model. Yet, there’s a sweet spot in the middle of the parameter space.

Think of a data distribution for x³ with some noise. A linear model would fit pretty bad, a quadratic model could at least fit half of a parabola, a cubic model is ideal to with the data, a fourth-order model with gets worse again similar to a parabola, and we get incrementally worse with higher-order models. When we get to models of orders equal to the number as samples, these start fitting directly to the noisy data points but wildly swing in between data points to try and hit them exactly. The common “overfitting”-image that gets used.

This is the first descent that minimizes the model error compared to model parameters in that center sweet spot.

overfitting.png

Modern Machine Learning

Modern machine learning, however, presents an interesting contradiction to classic theory.

Highly over-parametrized models like neural networks… work. They work incredibly well at that. So well, in fact, that an infinitely wide multi-layer perception can be considered a universal function approximator. Large-scale models can do some incredible feats from self-driving cars to language understanding, well, at least part-ways there.

The second descent that minimizes the model error.

The Double Descent

We have a first descent in the classical statistics parameter space and another descent where modern neural networks live.

Clearly, you need a little bump in between our two descents, or it’d just be good ol’ single descent and no hypothesis. That bump is arguably extra particularly interesting.

When we increase the size of a model to match the parameters with the samples of data, the model starts fitting exactly to any noise within the data.

Smaller models need to ignore much of the fluctuations in data, whereas, larger models can abstract much of the fluctuations. Only in that middle spot, where the models neither benefit from the inherent regularization of classical statistics, nor the over-parametrization, we see the error increase.

The Implications

This has a very interesting real implication. Since this is a deep connection between data complexity and model complexity, the makes model scaling very unintuitive from simple numbers alone.

It can happen that, due to the complexity of the data, when we increase the model size, we are still technically in a classical under-parameterized regime, despite having built a pretty massive model. We see the error increase, as we’re still in the first descent. In the accompanying blog post, Nakkiran puts it as:

The take-away from our work (and the prior works it builds on) is that neither the classical statisticians’ conventional wisdom that “too large models are worse” nor the modern ML paradigm that “bigger models are always better” always hold.

Furthermore, it follows that more data is also not always better, as increasing the complexity of the data might push us on the local maximum between the double descent. Finishing out with:

These insights also allow us to generate natural settings in which even the age-old adage of “more data is always better” is violated!

Conclusion

Personally, I find this very insightful for some model debugging, where counter-intuitive behaviours of deep neural networks can be explained by their erratic behaviour right at this threshold between under- and over-parametrization.

The work on Double Descent was originally published in 2019 by Belkin et al. and then expanded on to Deep Double Descent by Nakkiran et al. from OpenAI. It was published on the OpenAI blog and Nakkiran goes more into depth on their blog. They’re great reads, check them out

This article first appeared on the blog.

Data Stories

There’s something magical about old-timey scientific illustrations.

The amount of work and precision that goes into creating these illustrations and then making steel plates for them to be printable is mind-boggling.

The English translation of the encyclopedia consists of 18 topics and is worth a browse. The creator of the website Nicholas Rougeux has gone to great lengths to remove age marks and blemishes from the illustrations and created a beautiful interactive website that serves as a relic of its time. A piece of interactive science history.

Here you can read about the 1500 hours of labour that went into restoring the original scans of the plates to this glory.

This is a part of the plate from the Mathematics & Astronomy section:

iconographic-encyclopedia.jpg

Source: Iconographic Encyclopædia of Science, Literature, and Art

Question of the Week

  • What is your favourite machine learning algorithm and can you explain it in under 1 minute?

Post them on Twitter and Tag me. I’d love to see what you come up with. Then I can include them in the next issue!

Tidbits from the Web

  • This read was fascinating “I regret my $46k Website Redesign”.
  • I’ve been doing this for a while, but I liked this perspective on native content on social media.
  • Hire better with structured interviewing (to remove bias and all that).

You just read issue #89 of Late To The Party. You can also browse the full archives of this newsletter.

Share on Facebook Share on Twitter Share on LinkedIn
Find Late To The Party elsewhere: GitHub Twitter YouTube Linkedin Mastodon Instagram