July 22, 2022, 3:02 p.m.

👻 At a ghosts birthday party, it’s easy to be the life of the party

Late To The Party

it’s my birthday today! 🎂 So, before I’m off to celebrate, let’s look at some awesome machine learning!

The Latest Fashion

  • Last week I shared the Dall-E 2 prompt book. Now you can sign up to the open Dall-E 2 beta.
  • Most practitioners know tabular data is for xgboost. This new paper by Gael Varoquaux et al backs this up.
  • Meta of all places launches a tool Wikipedia is using for fact-checking. It’s so weird, I lapped up the article about it.

Got this from a friend? Subscribe here!

My Current Obsession

It’s my birthday, so I have been planning nice stuff to do on the weekend. Finally, some time to explore the Rhein area, which has one of the most beautiful train routes in Germany.

I have been struggling with burnout, other health stuff and the scorching heatwave this week. So I hope going outside and neither thinking about work nor content creation will be good for me. We’ll see.

Pythondeadlin.es has been making the rounds too and was featured in NotANumber and shared around on Twitter. It feels nice to have contributed a tool that seems to find such wide appeal.

I also had a small tweet about a neat pandas feature got viral. Should I expand on this?

Don’t use axis=0 & axis=1 use axis=’columns’ & axis=’rows’.

— Jesper Dr.amsch (@JesperDramsch) July 20, 2022

Thing I Like

I haven’t really been doing much this week, but Tiktok was pretty great at taking my mind off things. So yeah, follow me there?

Hot off the Press

After making Youtube Partner, I made a video with 100 machine learning tips and tricks to celebrate. I also made a blog post with a lot of extra information, links, and code snippets to go with the story.

Machine Learning Insights

Last week I asked, “What is the advantage of mini-batch learning?” and here’s why:

Mini-batch learning is usually used in reference to neural networks and gradient descent. Back in the day, gradient descent would refer to numerical optimization of a problem (like neural networks) or any other objective function to find a (hopefully) global minimum and, therefore, the best solution. Gradient descent would take in every data point in our equation and calculate the error and calculate the gradient on that. Following that gradient reduces the error and therefore brings us closer to the best solution. Very vague. I know.

Why is this so vague?

Gradient descent can be used on a ton of different problems, not just neural networks. It’s very popular in physics because it turns out all our differential equations are exactly that: differentiable. Great for gradients. So we have our model, our objective function, and our observations. We throw gradient descent at it, change the model, throw GD at it, optimize, etc… until we can’t get a better solution. So we can really use gradient descent for a ton of numerical optimisation problems, hence the vagueness. It’s quite universal.

There are a bunch of problems with gradient descent. One of them is that the more data points we have, the longer it takes to optimise since we have to calculate the errors on each point to then optimize. So on small data, we get a single optimization step for every 100 calculations (assuming we have 100 data points), which is fair enough. On large datasets with magnitudes more data points, that means we only get an optimization step every 1,000,000 calculations. That’s really slow and expensive.

Why not take a step after every calculation?

Well, … Welcome to stochastic gradient descent! When we optimize our model after every single step, we get around this problem of slowdown with larger datasets. It took us a while to figure this out because in most real-world data applications, we only recently have “big data”. It was much more expensive to adjust calculate the gradients, propagate them and adjust the model and there was no drawback to doing this after the full dataset. There’s also a bit of routine and theory behind this. After all, if we don’t optimize the full model, how do we have guarantees that we find the global optimum. We’re not optimising the global model. How could we find that best fit?!

It also comes from a bit of our intuitive understanding of numeric optimization techniques. These techniques are all taught from a perspective of linear algebra, where we start with the matrix inversion, given data a and observations B, find **X** which takes the form aX = B. Then we simply have to invert and have a perfect solution, and most courses go on to teach from there. That’s a bit of perspective that is challenged by taking each a individually to find X in stochastic gradient descent.

The problem with SGD

Stochastic gradient descent sounds like the perfect solution until we have one outlier in our entire data set. That outlier creates a huge error, creates a huge gradient, and therefore kicks our entire careful optimisation out of any reasonable area of the global minimum. Some data points can contradict each other, and we’re now on an optimisation seesaw. It’s messy. This problem is also called “noisy gradients”. Especially in real-world heterogeneous data, this comes back to bite us in SGD.

So mini-batches, which are taking a couple of data points each. We optimise after accumulating the gradients of these data points and then take the optimisation step on the average of those gradients. That’s what we usually do today. These mini-batches give us more frequent gradient updates than classic GD and more stable gradients than pure SGD.

In fact, it’s so successful that even classic physical inversion often uses mini-batch learning for more frequent optimisation steps and the SGD algorithm in all neural network libraries is actually mini-batch gradient descent (often with more modifications like momentum, which makes it more like a conjugate gradient method, but that’s a story for another time.)

Data Stories

I have to be honest here. I’m not a space person. I’m not particularly fascinated by landing on the moon or the stars.

But I got the fascination with the new James Webb Space Telescope. Hank Green sharing the first image or Kirsten Banks talking about the space origami, it’s a lovely science world on Tiktok. It was neat. The technical solutions they came up with? Awesome.

But it took Webb compare for me to realize the step change between Hubble and the JWST. Look at how much more detailed and in-depth these images are!

Wonder why all the skyientists are excited about the new space telescope?

Resolution!

Look at the clarity compared to the Hubble telescope!

See for yourself with “Webb compare”. pic.twitter.com/NEaMYmh9Yj

— Jesper Dr.amsch (@JesperDramsch) July 19, 2022

Question of the Week

  • What is the Double Descent phenomenon?

Post them on Twitter and Tag me. I’d love to see what you come up with. Then I can include them in the next issue!

Tidbits from the Web

  • I loved this interactive display of how mechanical watches work.
  • Fonts can be fascinating and you can make your own with Universal Sans.
  • This bouncing manhole cover is funny and terrifying.

You just read issue #88 of Late To The Party. You can also browse the full archives of this newsletter.

Share on Facebook Share on Twitter Share on LinkedIn
Find Late To The Party elsewhere: GitHub Twitter YouTube Linkedin Mastodon Instagram