I had a fairly frustrating week, dealing with things breaking and some migraines from the rapid weather changes. But that won’t keep us from some awesome machine learning, now does it? Here we go!
Also check out the new segment on data stories!
I spent most of my week fighting flat tires on my bike. Super frustrating. I ended up replacing both tubes, after a getting some new tools. Now it’s running smooth again, but still, what a waste of time.
I’m also getting ready for some podcast recording. That should be very exciting.
Finally the EuroScipy call for proposals is open, and I think I will prepare a tutorial about making ML reproducible for researchers and scientists. First time I’ll propose a tutorial, so I hope it’ll go through. We’ll see.
A tool recommendation! And a recommendation for tool buying from Adam Savage as a bonus!
Tired of searching for the right wrench size in your stack of different wrenches? Get an adjustable wrench, it’s the most useful tool to hold on to things of varying sizes. It’s even more useful on my e-bike that has some non-standard nuts to secure the motor. (Non-standard for bikes, probably quite normal for motorcycles.)
So how do you buy a good set of tools? Adam Savage, host of Mythbusters and awesome maker and creator, has some sage advice. When you start out, get a set of relatively cheap tools. That way you have a set that covers all use-cases, without having to spend thousands on quality tools. Now, over time, when you use these tools, some will break. That is unfortunate of course, but these tools broke, because you used them. Now it’s time to replace these specific tools with a quality tool. That way you do not have a huge upfront cost, have a range of tools for all possibilities, and quality tools (over time) for the tasks you end up repeatedly doing.
Last week I asked you about good ways to handle missing data, and here’s the gist of it:
Missing data is a common problem in data science, and we usually have data with missing values. The reasons for missing data are numerous. Missing values are common in data because it is expensive to capture all the data, and people may not want to share certain information.
One way to think about missing data is that there are two classes: “missing at random” and “missing not at random”.
Missing at random basically means that data in a feature is missing entirely uncorrelated for other features or hidden predictors in the data. Essentially the mechanism behind the missingness has to be completely stochastic.
Missing not at random, however, indicates some kind of system behind missing values. This is very common in real-world data. In medicine, for example, you often only order tests when specific symptoms are present, so missing test results indicate some non-random process.
There are several ways to handle missing data but the first step is to know when and how to handle the missing data. One can use the missing values as information or choose to replace the values with different methods.
Missingness as a Feature
Many datasets contain missing values that aren’t random. Encoding this as information for machine learning algorithms to exploit is a great way to exploit this. Commonly, an additional column is used that contains binary values that indicate whether data was missing in a sample or not.
Imputation
The process of determining how to fill in the missing data is known as imputation.
Averaging is the easiest and often most successful form of imputation. This involves simply taking the average, usually the mean, of a feature. Scikit-learn implements the SimpleImpute class for this. Statistically, this is “disastrous” however, as it completely distorts the distribution of the data.
A more sophisticated approach is conditional imputation which works iteratively. This algorithm shapes a feature based on other features to produce replacements for the missing values. Conditional iterative imputation can work, but is computationally very expensive and doesn’t scale well.
An easy way to handle missing data is to run a machine learning algorithm over the data. For example, linear regression is a good fit for this kind of problem. You will need to transform your data frame so that the response variable is in a column and predictor variables are in a matrix. You can then run the regression and check whether the fit is any good. If it’s not, you can try with a different algorithm (e.g. support-vector machine, random forest, KNN).
There are some problems using imputation like the before-mentioned distortion of distributions and computational cost. Additionally, imputation especially sophisticated non-linear imputation often does not perform as good as classifiers that handle missing data.
Algorithms that deal with Missing Values
There are machine learning algorithms that can directly deal with NaNs in the data and use this information if it’s improving the result. Tree-based methods like XGBoost can natively work with missing values. On the other hand, classic numerical methods like linear regression and neural networks cannot propagate NaNs, so encoding a feature is preferred.
Can’t get enough of missing values? Here’s a neat paper by GaĂ«l Varoquaux. And I teach handling missing values in my Skillshare course.
I’m late to the party, but if you enjoy data visualizations, you should check out the Chartr newsletter, which is the source of today’s visualization.
How do you visualize the flow of resources? Sankey diagrams! (And yes, I think they should be called snakey diagrams.)
Specifically, if you look at Netflix, we can see the clean split of revenue of the data-driven streaming service. There seems to be a healthy balance of the total cost to produce and acquire content, and the gross profit that flows into marketing, tech, and data science divisions. This seems to be in stark contrast to the recent sentiment from shareholders that might lead to adversary decisions like introducing ads to paying customers. It would be interesting to see how the economics of Netflix change after a decision like this though.
This chart tells a beautiful story of rather drab economic data and how money flows through one of the original streaming services nowadays. Catch the original story on Chartr.
Post them on Twitter and Tag me. I’d love to see what you come up with. Then I can include them in the next issue!