Hey folks 🎉
I’m currently unpacking boxes, but let’s unpack some great machine learning. I have also decided to answer last week’s Interview Question in this issue. Let me know what you think!
Last year OpenAI delighted everyone with an Avocado chair generated by DALL-E, but of course, the model is huge. In comes the smaller minDALL-E for your experimenting pleasure!
It’s becoming more apparent that suitable ML applications are contingent upon clean labelled data. Doubtlab provides different methods to inspect noisy labels.
Data scientists always wonder why someone would record written out numbers. The numerizer package can convert natural language numbers to ints and floats.
I’m still writing 30 Twitter threads about machine learning in 30 days. It’s been a lot of fun, and I’m still going. If you’d like me to write about a topic, let me know!
Apart from that, I was just done unpacking boxes when my things stored in Denmark for two years arrived here in Germany. It’s truly a blast from the past to find all those things from pre-Covid PhD times. I got my board games! So exciting.
Finally, I have been to the ECMWF office for the first time. It’s so interesting meeting my colleagues for the first time.
It’s the small things in life. Going through my old boxed, I found my old fidget cube. I missed that thing regularly in Edinburgh, and it’s in front of me right now, helping me focus on writing to you!
Anthony Ongaro has a nice video testing fidget toys, in case you’re in the market.
These blog posts started as Twitter threads. If you want to check those out, click the little birdy 🐦!
In last week’s issue, I asked, “How do you make a machine learning model robust to outliers?” 🐦
A robust machine learning model is one that is less likely to be affected by outliers.
Outliers are an important consideration in machine learning. Outliers are any data point that is unusually separated from the rest of the data. They are not necessarily bad, but they can be problematic.
You may want to remove an outlier to avoid biasing your model. However, outliers can actually be interesting to study in their own right, and in some cases, the outlier may be critical to the success of your machine learning model.
If you have a difficult dataset, then you can pick a model which is naturally robust to outliers. For example, if you have a Logistic Regression, you might find that a few outliers are enough to cause it to break down. In that case, you can go back to the drawing board and try out a completely different model like a Random Forest.
When using scikit-learn to preprocess the data, the Robust Scaler method is an outlier-resistant scaling method that is available too. The Standard Scaler subtracts the mean and scales to unit variance, which can be heavily skewed by outliers. The Robust Scaler addresses this by scaling each feature by calculating the distance between the 1st and 3rd quartiles, commonly called Inter-Quartile Range (IQR).
Finally, different metrics are more appropriate to deal with outliers. While the mean squared error explodes outliers, it may be appropriate to use the mean absolute error, which handles outliers linearly. Alternatively, my all-time favourite, the Huber loss, might work on your problem set.
What is a confusion matrix and what are the benefits of using one?
Post them on Twitter and Tag me. I’d love to see what you come up with. Then I can include them in the next issue!