it’s been a rough week over here. But it feels like I’m just recovering in time to share some awesome machine learning with you! Let’s dive in!
Got this from a friend? Subscribe here!
Unfortunately, I got very sick this week. I don’t know if it was the heat or some elevated anxiety. But it made me non-functional for a couple days and I’m only just recovering.
I’m currently planning a vacation in Mozambique. If anyone has recommendations, I’d love to hear them. I’m thinking of diving in the South, but I’m definitely open to suggestions!
I haven’t been feeling well this week, so I spent mostly reading in my hammock. During the early pandemic, I splurged for a Kindle Paperwhite after a friend recommended it. I got the one with “adjustable warm light” and without ads, and I’m loving it. Such a delight to read on, and I can seamlessly switch between entertainment and informative reads. And with Calibre I can sideload ebooks without sacrificing my soul to Jeff Bezos.
This week I published a small blog post about 3 books for reinforcement learning.
The Software Sustainability Institute asked me to write a blog post for my fellowship. We titled it “Would I even fit in?” as that is what I wondered the whole time applying.
Last week I asked, “What is boosting and can you name examples of its usage?”, and here’s the gist of it:
Boosting is an ensemble learning method that combines weak learners into a strong learner.
The boosting algorithm finds a way to take the errors of a prediction from previous learners and tries to correct the mistakes in the next model it trains. What a neat concept, where it builds on existing predictions to slowly improve the outcome!
That’s already a bunch of jargon, so let’s dissect that.
Ensemble Learning is a form of Machine learning where multiple models make a prediction together. These ensembles consist of both the same type of models or multiple types of models. They usually get the same input features and the final prediction is averaged across models.
So what’s a weak and strong learner?
Weak Learner is a fancy way to describe a machine learning model that consistently performs poorly. The correlation between the true and predicted class is always low, but it’s better than a random coin flip, so it’s still learning something. A classic example for this are Decision Trees. Especially when it comes to short decision trees, they don’t often show great predictive power. Essentially, a weak learner is an algorithm or model without great predictive capacity.
When people talk about Boosting, they love to talk about two types of algorithms specifically:
Interestingly, they both rely on Decision Trees, but they take different approaches to how they boost!
AdaBoost splits the data with a Decision Tree of depth 1 (they’re also called Decision Stump and I think that’s adorable). It then generates a weighting matrix to assign a higher weight to mislabeled points for the next stump. This looks a little bit like:
[a, a, b, a, a, b, b, b,]
^
split
weight matrix:
[1, 1, 5, 1, 1, 1, 1, 1,]
with the new weight matrix, the second split would be closer to the second and third element.
Gradient boosting takes a different approach to AdaBoost. The algorithm trains the first model on the data and then proceeds to train additional models on the gradient of the misfit from previous models.
There’s a fancy explanation with “approximating gradient descent” and all that, but effectively we’re doing one thing: We take our loss function, for example, Mean Squared Error, and we fit another model on that loss function and repeat this as much as we want. That looks a bit like this:
We have a base estimator F(X) = y and a bunch of additional estimators that predict H(X) = MSE(y_predicted, y_true) to correct the original prediction.
Classic Gradient Boosting is a nice idea, but it is also very slow. Extreme Gradient Boosting is the logical next step to Gradient boosting. Algorithmically, this isn’t different to classic gradient boosting but it implements some clever optimisations so that it can run in parallel.
The main difference is in the fact that Gradient Boosting uses Gradient Descent, whereas XGBoost implements a Newton-Raphson algorithm. Fancy words again, but essentially this leads to faster convergence.
XGBoost also implements some very neat features like automatic feature selection, clever penalization of trees and, of course, distributed and out-of-core computation, which makes it very scalable.
In the 1860s, Florence Nightingale changed data visualization forever.
Nightingale was a nursing admin in the Crimean war, returning with the knowledge that soldiers were dying due to more than just enemy bullets. The British army had a hygiene problem.
In an incredible display of data storytelling, she was able to push a reform that would prevent death from communicable diseases.
Now this itself was impressive enough. But the legislation and the precedent set by this work directly relate to a doubling of the average human lifespan in the coming century.
Source: Scientific American
Post them on Twitter and Tag me. I’d love to see what you come up with. Then I can include them in the next issue!