I noted last issue that I’d dig further into building data science teams with a Zoom call - the notes and YouTube link for that are below, there were 12 of us on the call for an hour representing a range of larger UK companies with mature data science outfits.
Further below are 6 job roles including Senior and Staff roles in DS and DEng at organisations like Deliveroo and Regulatory Genome.
On August 8th I’ll be running the next open call on Backlog, Derisking and Estimation - see details here and sign-up here or reply to this email and ask for an invite (note - this is open to leaders, but not to recruiters).
Please forward this email to a colleague if they’d benefit from attending this session, I’m happy to invite them along. Note that these sessions are for leaders in data science teams, not recruiters or solo consultants.
Early last week I ran a follow-up to the conference’s Executives at PyData discussion for leaders, this one was via Zoom with a focus on Team Structure. I took a vote on questions and we discussed the top set. We talked on:
The hour’s call is available on YouTube as DS Success - Teams and Growth. Many thanks to Anders, Sophia, Aymeric, Lorna, Nick, Marysia, Kevin, Raphael, Ryan, Daniel and Barnaby for your excellent points and observations. This group represented mature teams in many larger organisations plus a couple of smaller organisations.
One point repeated by a couple of people is needing to build a team that’s focused on “shipping stuff that works” - assuming you’re not a research team, you want team members who are happy to pitch in to a business need and “get the job done”.
This is opposed to focusing on a particular domain (e.g. ML) at the expense of ignoring the client’s overall needs. Anders noted “we don’t do DS projects, we work on value streams for the business that might just involve some data science”.
Martin’s Clean Code - Handbook of Agile Software Craftmanship was well recommended to teach good software engineering principles to juniors to get them up to speed.
Investing in the team’s software engineering skills (e.g. sending them on AWS Certified Developer courses) and getting the team excited by all aspects of getting a solution to ship is key for larger teams.
Having the team stay in control of the code from development to shipping was also noted as being important, so you’re responsible (across the team) for the whole lifecycle. This only works if you’ve got a team that’s motivated to see change occur in the business, so you have to hire people who are motivated to make an impact - a point reiterated later across the call.
The big message here, and I heard this at the Executives at PyData disucussion at PyDataLondon 2022, is the need to foster a culture of written communication. If things are written they can be consumed asynchronously and comments can build up.
Aymeric notes that LucidChart is great for online collaboration when making diagrams. Dropbox Paper was used by one team for ephemeral team discussion - it was noted that nothing gets “set in stone” but it was very useful for lightweight iteration. Miro gets a shout-out from a couple of teams for remote whiteboarding, I like it too.
Microsoft’s VSCode with collaboration plugins was used by some. I’ve used it solo and it was very comfy (once I’d configured some vim completions). It feels “heavy weight” for lots of lightweight projects (starting in the right configuration is a pain I’ve found), but if you’ve got one project on the go for a while then it probably fits really well and there’s a huge base of plugins.
One team uses Witsmate (appears to be a web SaaS) for tracking team OKRs and company goals. They noted it helps keep everyone aligned by making clear the “why” for goals.
To keep remote teams engaged various new processes are used. Daniel discussed a Monday “mindset meet” - a 30 minute short chat on something relevant to the team to get everyone focused, which replaces “the watercooler chat”. Whilst it doesn’t fix “being in the same room” it helps everyone get to know each other remotely.
Lorna has “coffee roulette” to get you to meet people you know less well. Slack video was noted as being useful as you can use it to highlight portions of a screen (which other video tools don’t offer) which makes tasks like a code review flow easier.
Barnaby’s team share jokes (the worse - the better) on a Friday stand-up call. I’ve been on “Who Wants to be a Millionaire” weekly calls. Kevin’s team distribute points every week to folk who “did something good”, those points can later be converted to cash or prizes. Their company also encourages frequent anonymous feedback on all aspects of the company strategy, so “things that must be said” do get aired.
I’ve read that some people experiment with walking-and-talking on headsets for meetings - trying a new modality that gets you out of the house. You lose a whiteboard but you gain fresh air and a different outside context, and some teams say that this helps with creativity (and you can catch-up on a whiteboard or write-up notes after).
It was noted (again) that as hallway chats no longer occur, you need a firm culture of written communication.
My observation is that both work, but can work in different situations. A centre of excellence is great if you’ve got overlapping problems on related datasets - team members can share knowledge about how they solved Problem1 for Team1 when solving similar-Problem1A for Team2 (and you lose this if embedded members aren’t talking). This is great when you’re replicating the same sort of solutions in different business verticals e.g. in corporates.
Embedded teams critically get to soak up all the necessary business knowledge very quickly. You get to learn how the client works, what they think of the value in the data, where the traps might exist and what they can actually action with a result.
This guidance is critical and often you’ll realise that a sophisticated solution is overkill and what you actually need is something simple, given the issues in the data and the limits of action that can be taken. The downside is that this team’s newly-won knowledge is probably silo’d from other DS members, unless you’re encouraging them to talk frequently.
Sophia noted that a data science is just another tool a team can use to solve a problem, it works in a cross-functional team but can also work if dropped in for short-run projects from a centre of excellence.
Andres notes that “data science is just like UX or other skills” and the Guild System is good for knowledge sharing on weekly or monthly cadences - and you need to hire people who are ok with this mindset. Kevin observed that by embedding people and buddying-up, you get to share skills across team members for added value.
One tip was to hire folk with good software engineering skills who wanted to learn more DS as typically they’d use DS as just another skill to solve a client problem, versus data scientists who might baulk at “having to write clean code” when they’d rather focus on the statistics. I’ve come across this a lot (and I’ve been very guilty of this myself in the past). I think there’s a lot of value in getting a set of smart generalists who can ship working solutions, which might just use some DS, before you hire people with more specific skills. Making sure you set that tone when hiring is critical, and being clear with your DS hires that they’ll need to write code that others can read and support will be a close second.
One downside of “start with the engineers” is that perhaps the team doesn’t know what sort of limits might exist - how much signal there is (or isn’t) in a dataset and what the cost of bad data or poor data-collection processes might mean. Having someone who can guide on these subtler but key issues feels important.
Anders noted that during recruitment often folk who came out of a bootcamp were disappointed to learn that in a real organisation they probably wouldn’t be writing models all the time - giving them a real view on what “the work” looked like was critical to find those who’d thrive.
Aymeric notes that a refactoring task is great for filtering skill levels and identifying weaknesses as it forces someone to think through “what’s here, what’s missing, what have I seen in my past that could help”. He noted that hiring for stronger software engineering skills and training up on data science later was a sensible approach, Anders and others had mirrored that opinion. He balanced this with the observation that you need someone who has relevant deployment experience on the team, otherwise you’ll never ship.
Marysia’s team have ex-data science “analytics translators” who know how to help the business talk about their needs and manage what the team works on, whilst data science leads focus more on solving the immediate problem with the team.
Nobody had a good source of “secret leads” unfortunately.
You’re all welcome to a first post to my jobs list below gratis, just reply to this email if that’s useful - you’ll reach all 1,500+ readers over 3 issues.
If you’d like to join the next call on derisking and building a backlog, see the details further above. I’d certainly encourage you to watch the full video (link above) of the discussion to get further context.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
I’m pleased to say I’ve fixed the dates for my next on-line (Zoom) courses, I’ve been running these remotely with great success during our pandemic and I’m going to continue the virtual format. Early bird tickets are available and limited for each course:
I’m happy to answer questions about the above, just reply here. If you want a notification for future dates please fill in this form.
For those of you who have been waiting a while for me to get these listed - apologies, being the father to an infant has eaten a lot of time this year and I’ve had to take things sensibly before scheduling new courses.
If you’re not in the habit of contributing to open source, you might not realise that just by posting an issue you can wait until someone else hits something similar who might have enough time to finish a solution. I generally don’t have time to write better code (normally the issues I hit are drive-by, so once solved I’m pushing on to the next thing) as a client doesn’t get a clear benefit from me putting time that way. However if I’ve learned something, I’m happy to spend a little bit of time documenting it to see where it might go.
Back in 2020 I hit a problem when using Dask with a custom Aggregation. It took a bit of debugging to realise I was being exposed to a Pandas
SeriesGroupBy before I could figure out how to use the numbers hidden inside it. I posted a long descriptive example and a core dev asked me to push on - I declined.
Two weeks back a fix was added by
dustinwerran2 which actually is wonderfully simple (in hindsight I totally should have spent a few minutes more just doing this myself) and you can see the new text now at the footer of the dataframe groupby page. A benefit of waiting to see if someone else picks up a well-described issue is you’re more likely to be spending time on a valuable activity if at least 1 other person agrees it is worth fixing. In this case the fix was created under a mentorship programme.
If you’ve never made an open source contribution - do think about filing an issue on a project that describes a problem (and bonus - the fix you found), so that others can benefit. Even better is to build on one of these posts to eventually improve the code. You get very public record of your contribution (which includes your ability to describe & communicate) which can only help in your career endeavours.
If you want to share your knowledge this is a great resource so you can find conferences you might otherwise have missed.
If you like hip-hop, check Grandmaster Flash Talks “The Theory” Of Being A HipHop DJ & The Beginnings Of Hip-Hop!! for an hour of the Grandmaster talking on the theory (and a bit of math) of hip-hop. It turns out he invented the slipmat and cue-points to line up samples on the decks and found the earliest turntables that spun up to speed appropriately quickly. If you like your analogue throwbacks, this is well worth a watch.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
As a Senior Data Scientist at J2-Reliance, you will be responsible for developing Data Science solutions and Machine Learning models tailored to the needs of J2-Reliance’s clients. You will typically act as a Full-Stack Project Owner: you will be the main referent for a well delimited client’s problem, and you will be responsible for the conception and the implementation of the end-to-end solution to solve it. You will be supported by a program manager (your direct supervisor) and a Data Engineer helping with industrialisation. The specific nature and level of their involvement will depend on your areas of expertise and the specificities of the project.
The Royal Botanic Gardens, Kew (RBG Kew) is a leading plant science institute, UNESCO World Heritage Site, and major visitor attraction. Our mission is to understand and protect plants and fungi for the well-being of people and the future of all life on Earth.
Kew’s new Plants for Health initiative aims to build an enhanced resource for data about plants used in food supplements, allergens, cosmetics, and medicines to support novel research and the correct use and regulation of these plants.
We are looking for a Data Scientist with experience in developing data mining tools to support this. The successful candidate’s responsibilities will include developing semi-automonous tools to mine published literature for key medicinal plant data that can be used by other members of the team and collaborators at partner institutes.
IndexLab is a new research and intelligence company specialising in measuring the use of AI and other emerging technologies. We’re setting out to build the world’s first index to publicly rank the largest companies in the world on their AI maturity, using advanced data gathering techniques across a wide range of unstructured data sources. We’re looking for an experienced Data Engineer to join our team to help set up our data infrastructure, put data gathering models into production and build ETL processes. As we’re a small team, this role comes with the benefit of being able to work on the full spectrum of data engineering tasks, right through to the web back-end if that’s what interests you! This is an exciting opportunity to join an early stage startup and help shape our tech stack.
We are looking for Staff, Senior Staff & Principal ML Engineers to design and build algorithmic and machine learning systems that power Deliveroo. Our MLEs work in cross-functional teams alongside engineers, data scientists and product managers, who develop systems that make automated decisions at a massive scale.
We have many problems available to solve across the company, including optimising our delivery network, optimising consumer and rider fees, building recommender systems and search and ranking algos, detecting fraud and abuse, time-series forecasting, building a ML platform, and more.
The Regulatory Genome Project (RGP), part of the Cambridge Centre for Alternative Finance, was set up in 2020 to promote innovation by unlocking information hidden on regulator’s websites and in PDFs. We’re a commercial spin-out from The University of Cambridge’s Judge Business School and our proposition is to make the world’s regulatory information machine-readable and thereby enable an active ecosystem of partners, law firms, standard-setting bodies and application providers to address the world’s regulatory challenges.
We’re looking for a data scientist to join our remote-friendly technical team of software engineers, machine learning experts, and data scientists who’ll work closely with skilled regulatory analysts to engineer features and guide the work of a dedicated annotation team. You’ll help develop, train, and evaluate information extraction and classification models against the regulatory taxonomies devised by the RGP as we scale our operations from 100 to over 600 publishers of regulation worldwide.
The Met is looking for an analyst and a lead analyst to join its Strategic Insight Unit (SIU). This is a small, multi-disciplinary team that combines advanced data analytics and social research skills with expertise in, and experience of, operational policing and the strategic landscape.
We’re looking for people able to work with large datasets in R or Python, and care about using empirical methods to answer the most critical public safety questions in London! We’re a small, agile team who work throughout the police service, so if you’re keen to do some really important work in an innovative, evidence based but disruptive way, we’d love to chat.