In the last issue I summarised the main discussion at the Executives at PyData session I co-ran at our PyDataLondon 2022 conference. I’m extending these topics into two free Zoom sessions, the first is this Monday (details below) on team structure, the next is in a month on topics including how-to-derisk-and-build-a-backlog. You’re welcome to sign-up and I can send you a GCal+Zoom invite. Below I give some notes on Team Structure and link to some Python 3.11 benchmarks showing off the new speed-up work that we’ll enjoy in a few months.
At the conference there was a lot of continued discussion on Team Structure and some of the original participants will be joining for the Monday call. I’ll share a link to that recorded call afterwards. If you have questions you’ll want to attend, it is free and may help you unlock blockers for your team.
Looking back at my notes from the conference John Sandall (thanks John!) shared this short story on how not to grow a data science team which captures many of the situations I’ve found myself in during strategy sessions. There’s a bullet point list near the top and the last item - “no data leadership” - I’d argue is the root cause. You can’t comfortably invest in all the other problems if leadership don’t have good experience in what it means to be a data-driven company (i.e. - get the data right, get the questions right, build the simplest things that make change, do the right testing, then add complexity incrementally).
The author gives a lot of examples (some of which are painfully recognisable) noting “These are primarily organisational challenges. Teams don’t know how to work with the data team. … What I think makes most sense to push for is a centralisation the reporting structure, but keeping the work management decentralised.”. By decentralising the “data work” you get domain specialisation for the data scientists, so they can help ask better questions - I’ve felt for years that this domain knowledge acquisition is key yet underappreciated.
A couple of pieces in the article talk of data being used to find bugs which improve conversions when fixed, but doesn’t note that some kind of metric was coupled to this. It is so easy for the DS team to not get the credit if the engineers fix a bug that drives $1M in revenue that was previously lost, and this can be “claimed and shared” by the DS team if they’ve done the work, estimated what’s being missed with a dollar/time value attached, then helped define what a fix will look like. Are you helping your team members get the credit that they deserve for their discoveries?
The story ends with “You made it. You have transformed the organization to be truly data-native.” - which takes a year in their timeline. I think this isn’t a pessimistic guess and maybe it is too optimistic for large corporates. If you’ve not gone through this journey before, or you’re on it now…definitely read the article (get a coffee and take your time). There’s a lot of insight to be gained and useful diagrams you might want to adapt.
Raphael (thanks!) also linked to this great O’Reilly free publication (57 pages) on the Care and Feeding of Data Scientists which includes finding, hiring, agile or not and career ladders. It is a meaty read and worthy of an hour. There’s a quote on the “t-shaped” data scientist on page 36:
We feel that a defining feature of data scientists is the breadth of their skills—their ability to single-handedly do at least prototype-level versions of all the steps needed to derive new insights or build data products (Mason & Wiggins, 2010). We also feel that the most successful data scientists are those with substantial, deep expertise in at least one aspect of data science, be it statistics, big data, or business communication.
The top bar of the T is critical - if your colleague can’t actually query the dB with SQL, or doesn’t know how to generate a useful output on the back of an analysis, or has trouble writing a clear agenda for a meeting, they may be more of a cost than a boon to a team. Make sure everyone has enough of the skills that enable them to be useful and not blocked by others, where each has a specialisation that’s actually useful to your work.
Do you have a different view? Or a good resource to share? Drop me a message by replying to this and I’d happily hear your thoughts.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
I’m going to run two virtual events focused on getting to more success. Please reply if you’d like a calendar invite to either of these:
I’ll write-up notes from Monday’s session for the next newsletter plus I’ll share that recording.
Please forward this email to a colleague if they’d benefit from attending this session, I’m happy to invite them along. Note that these sessions are for leaders in data science teams, not recruiters or solo consultants.
I’m pleased to say I’ve fixed the dates for my next on-line (Zoom) courses, I’ve been running these remotely with great success during our pandemic and I’m going to continue the virtual format. Early bird tickets are available and limited for each course:
I’m happy to answer questions about the above, just reply here. If you want a notification for future dates please fill in this form.
For those of you who have been waiting a while for me to get these listed - apologies, being the father to an infant has eaten a lot of time this year and I’ve had to take things sensibly before scheduling new courses.
The Python 3.11 speed-up process continues, some benchmarks have been released to show the relative speed-up between Python 3.10 and the new 3.11. These are by my friend Mark Shannon, the lead dev of this work - thanks Mark! If you scan those benchmark you’ll see that at best they’re seeing a 2x speed-up, many are 1.5x and right at the bottom a couple of currently a bit slower than they used to be. The median appears to be 1.2x at present. You wouldn’t expect your Pandas, sklearn or PyTorch to go faster (all the work is done outside of Python) but anything written in pure Python should see some level of speed-up. This is the first big step-up in performance we’ve had for years so that’s rather exciting.
Along with details of the new speed-ups there’s a list of other improvements including more improved tracebacks for faster debugging in the whatsnew notes.
The release schedule for 3.11 says October, with August for the first release candidate. If you’re using a lot of pure Python and you want some easy speed-ups that should come with no code changes, you may want to think about testing 3.11. If you develop code and find yourself frustrated by tracebacks, I’ve found that 3.10 gave me a boost in productivity and I believe that 3.11 will give a similar boost again.
As you’ve probably noticed I’m pretty reticent to include political mentions in my newsletter. Today however I wish to celebrate the end to Boris Johnson’s reign. Whilst other countries have had similarly shambolic political leadership, I wasn’t never enthused that we could play “yeah, but we can top that because our leader just did this … ” games. Leadership is by example and he’s set a very poor example for 70M people. That’ll be my only note on the subject, closing with this lovely little cassette boy video mix (this one’s safe for work). CassetteeBoy’s got a lot of brilliant, if angry, videos, not all safe for work, but I’d recommend an exploration if you like that kind of thing.
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
As a Senior Data Scientist at J2-Reliance, you will be responsible for developing Data Science solutions and Machine Learning models tailored to the needs of J2-Reliance’s clients. You will typically act as a Full-Stack Project Owner: you will be the main referent for a well delimited client’s problem, and you will be responsible for the conception and the implementation of the end-to-end solution to solve it. You will be supported by a program manager (your direct supervisor) and a Data Engineer helping with industrialisation. The specific nature and level of their involvement will depend on your areas of expertise and the specificities of the project.
The Royal Botanic Gardens, Kew (RBG Kew) is a leading plant science institute, UNESCO World Heritage Site, and major visitor attraction. Our mission is to understand and protect plants and fungi for the well-being of people and the future of all life on Earth.
Kew’s new Plants for Health initiative aims to build an enhanced resource for data about plants used in food supplements, allergens, cosmetics, and medicines to support novel research and the correct use and regulation of these plants.
We are looking for a Data Scientist with experience in developing data mining tools to support this. The successful candidate’s responsibilities will include developing semi-automonous tools to mine published literature for key medicinal plant data that can be used by other members of the team and collaborators at partner institutes.
IndexLab is a new research and intelligence company specialising in measuring the use of AI and other emerging technologies. We’re setting out to build the world’s first index to publicly rank the largest companies in the world on their AI maturity, using advanced data gathering techniques across a wide range of unstructured data sources. We’re looking for an experienced Data Engineer to join our team to help set up our data infrastructure, put data gathering models into production and build ETL processes. As we’re a small team, this role comes with the benefit of being able to work on the full spectrum of data engineering tasks, right through to the web back-end if that’s what interests you! This is an exciting opportunity to join an early stage startup and help shape our tech stack.
We are looking for Staff, Senior Staff & Principal ML Engineers to design and build algorithmic and machine learning systems that power Deliveroo. Our MLEs work in cross-functional teams alongside engineers, data scientists and product managers, who develop systems that make automated decisions at a massive scale.
We have many problems available to solve across the company, including optimising our delivery network, optimising consumer and rider fees, building recommender systems and search and ranking algos, detecting fraud and abuse, time-series forecasting, building a ML platform, and more.
The Regulatory Genome Project (RGP), part of the Cambridge Centre for Alternative Finance, was set up in 2020 to promote innovation by unlocking information hidden on regulator’s websites and in PDFs. We’re a commercial spin-out from The University of Cambridge’s Judge Business School and our proposition is to make the world’s regulatory information machine-readable and thereby enable an active ecosystem of partners, law firms, standard-setting bodies and application providers to address the world’s regulatory challenges.
We’re looking for a data scientist to join our remote-friendly technical team of software engineers, machine learning experts, and data scientists who’ll work closely with skilled regulatory analysts to engineer features and guide the work of a dedicated annotation team. You’ll help develop, train, and evaluate information extraction and classification models against the regulatory taxonomies devised by the RGP as we scale our operations from 100 to over 600 publishers of regulation worldwide.
The Met is looking for an analyst and a lead analyst to join its Strategic Insight Unit (SIU). This is a small, multi-disciplinary team that combines advanced data analytics and social research skills with expertise in, and experience of, operational policing and the strategic landscape.
We’re looking for people able to work with large datasets in R or Python, and care about using empirical methods to answer the most critical public safety questions in London! We’re a small, agile team who work throughout the police service, so if you’re keen to do some really important work in an innovative, evidence based but disruptive way, we’d love to chat.
An exciting opportunity has arisen for a Principal Population Health Analyst to join the Population Health and Care team at Lewisham and Greenwich Trust (LGT) where the post holder will be instrumental in leading the analytics function and team for Lewisham’s Population Health and Care system.
Lewisham is the only borough in South East London to have a population health management information system (Cerner HealtheIntent) that is capable of driving change, innovation and clinical effectiveness across the borough. The post-holder will therefore work closely with public health consultants, local stakeholders and third-party consultancies to explore epidemiology through the use of HealtheIntent, and design new models of transformative care that will deliver proactive and more sustainable health care services. LGT is therefore seeking an experienced Principal Population Health Analyst who is equally as passionate about transforming and improving the lives and care of patients through data analytics and can draw key and actionable insights from our data. The successful candidate will be an experienced people manager with strong communication skills to lead a team of analysts and manage the provision of data analytics to a diverse range of stakeholders across Lewisham, with particular focus on population health and bring together best practice and innovative approaches.