The data newsletter by @puntofisso.
This tweet from Chris Barnes, one of my favourite data gurus, was pretty spot on. He illustrates the Chief Data Office at his organisation, Highways England, as having to answer 4 questions:
This is pretty much a data strategy – and a good plan for data use – condensed into four lines, which I like a lot.
Doing good things with data need not be complicated. In particular, Information Governance need not be complicated. I once had an infuriating IG experience. My team needed access to a research database which did not contain any personal data. Access was regulated via username and password. The team had to go through a full 80-page-long Data Protection Impact Assessment because “username and passwords are personal data”. Which is… kinda correct, but also kinda wrong.
Things can be less complicated, can’t they?
A few months back, I created a few maps using OpenStreetMap data (shameless plug: there are a handful still available and you can get 20% off using the discount code NEWSLETTER20). Interestingly, this has got me thinking about derived data, licensing, and credits. Using data from OpenStreetMap is relatively unrestricted, and there are some simple credit rules that I thought I followed. However, after a few people asked me about the way I had credited the work to OpenStreetMap, I decided to verify with the OSM Legal Team, just to be sure. And this became an interesting opportunity for reflection on open data projects. This is because what happened next is that it was really hard to speak with anyone on the legal team (an issue that was featured on the OSM weekly, and after repeated attempts I ended up having a conversation with a few members of the board.
Now, this might seem like a minor and uninteresting story to you, but I’m reflecting on what happens when you build a large and successful open database like OpenStreetMap. Should projects like OSM always have a legal team? On the one hand, I appreciate the burden this creates, organisationally and financially; on the other, wouldn’t it be a good way to both engage with the community (to give the peace of mind to people who want to use the data in good faith) and to protect the project from misuse?
In last week’s issue, we featured a link to the New York Times headline A/B testing analysis. Jeremy Singer-Vine (whose newsletter Data Is Plural I’ll never cease to recommend) pointed out that the data that powers the analysis comes from the NYT Tracker, a website that queries the New York Times’ API in order to build data about how articles end up on the front pages. Interesting concept.
Don’t forget that things can only get better but that perfect is the enemy of good.
Till next week,
New Data Expose Precisely How White and Male Some U.S. Companies Are
“As part of an initiative to track the corporate response to the Black Lives Matter movement, Bloomberg has obtained detailed breakdowns of U.S. employee counts by race and gender across job categories for 37 out of 100 of the nation’s biggest corporations, up from 25 last fall. An additional 30 companies pledged to provide the information in the near future.“
This is a follow-up from a study previously published in October 2020.
(via Soph’s Fair Warning)
Life expectancy in adulthood is falling for those without a BA degree, but as educational gaps have widened, racial gaps have narrowed
An academic paper looking at correlations between education and life expectancy. We learn that “For those with and without a BA, racial divides narrowed by 70% between 1990 and 2018, while educational divides more than doubled for both Black and White people.”
Explaining the Pandemic: 2020 Data and Visual Journalism Projects on COVID-19
Hassel Fallas has collected for GIJN her favourite COVID-related data/visual articles of 2020. She says: “The selected projects demonstrate that journalism should be useful, and that when it is done to explain something that affects people, the user’s response is to trust and recognize the value of such content.“
Who’s Next in Your State’s Vaccine Line?
The New York Times explores how closely US states are following CDC guidelines.
The Impact of COVID-19 on Black Communities
“D4BL has worked to consolidate state level data to explore the disproportionate impact of COVID-19 on Black people in the US.”
Interesting piece of research and resulting dataset by the Data4BlackLives movement.
The violin plot
How to use the violin plot, which is “ great if you want to look at a set of data values for a category and analyse the highest, lowest and most probable value.“
How we built our covid-19 risk estimator
The Economist presents this “foray into diagnostic codes, sample reweighting and gradient-boosted trees”.
And if you want to geek out on how they do things, they also have this fantastic video interview featuring Data Journalism editor Alex Selby-Boothroyd; you can also download their source code.
Keep an eye on this framework to create web charts by using simple html and CSS. It’s still under development, but very promising.
The Physical Life of Data
“Transforming data into physical experiences is one of the directions we are exploring at the Center for Design and the College of Arts Media and Design to tackle this inherent abstraction of data and visualization.”
(via Massimo Conte)
How and Why We Sketch When Visualizing Data
“If you’re like us, at some point in your early education you decided you couldn’t draw. Your doodles, like ours, didn’t look like you wanted them to. For many, this disappointment can persist into adult life. As researchers into how people learn data visualization, we’re here to tell you that it’s OK — stick figures are fine! You can learn to sketch your data stories; in fact, you’ll see that research tells us that sketching is critical for working in teams and for breaking through visualizers’ block.“
Here’s the Nightingale giving me hope!
AI Progress Measurement
This is pretty cool. The Electronic Frontier Foundation has created this page that “collects problems and metrics/datasets from the AI research literature, and tracks progress on them.”.
It does so by using a Jupyter notebook that gets populated with users’ submissions. There are problems in several spaces, including game playing, vision, written and spoken language, and more.
How to break a model in 20 days. A tutorial on production model analytics.
Using the bike sharing demand dataset, the folks at Evidently.AI use their own tool in a standard data science pipeline to illustrate the problem of model decay.
Nothing Breaks Like A.I. Heart
“An essay about artificial intelligence, emotional intelligence, and finding an ending”, by The Pudding.
Who owns the Nile?
“The Nile River is a lifeline for Ethiopia, Egypt, and Sudan, but also an important source of water, food, and transport for the other countries where the river flows. Finding an equitable way to distribute it remains a challenge these countries will have to overcome.”
Deep dive into Nile politics in this nicely illustrated article by Edurne Morillo at Datawrapper.
Two states tax some drivers by the mile. Many more want to give it a try
“The approach is more complex than taxing gasoline usage and faces opposition from environmentalists who say it favors gas-guzzling SUVs and trucks”.
This brings back some discussions I had, when I was at the Department for Transport, with the two sides of the environmental fence about whether grants to bus operators (or BSOGs) should be paid by fuel use – as it is now – or mileage.
Become a GitHub Sponsor. It costs about the price of a coffee per month, and you’ll get an Open Data Rottweiler sticker (and other stuff).
If you’re a supporter of this newsletter, thanks a lot for your support. Share this e-mail with a friend, or via social media.
quantum of sollazzo is supported by my GitHub Sponsors, and by ProofRed, who offer an excellent proofreading service. If you need high-quality copy editing or proofreading, head to http://proofred.co.uk. Oh, they also make really good explainer videos.