Building a solid backlog and derisking your novel projects

Further below are 7 job roles including Senior and Staff roles in DS and DEng at organisations like Deliveroo and Regulatory Genome and Kew Gardens

                            August 18, 2022

                Building a solid backlog and derisking your novel projects

                        How do you build a solid backlog and derisk your novel projects?
Further below are 7 job roles including Senior and Staff roles in DS and DEng at organisations like Deliveroo and Regulatory Genome and Kew Gardens
In the last issue I talked on Building Successful Data Science Projects - my well-received talk from PyDataLondon 2022. 
Last week I ran a related call for DS leaders on “backlogs, derisking and estimation”. I’ve got a summary of the first half of that call below, the rest will follow in the next issue. Building a solid backlog is critical to success and there’s a ton of tips in the write-up below.
Along with my Success Zoom calls there’s another option you might want to consider - my colleague Douglas Squirrel does a similar thing on the CTO side. There’s some data science in some of his events, more often it is about the tech and customers. He’s got a bunch of online events coming up and one special event he’s hosting in London on September 8th on “saying Yes”. I’m hoping to attend that one in person. They’re free to attend, Squirrel is very well respected in CTO circles and typically he’s full of useful advice and ideas.
I’m shouting out a thanks to Raman Shah for very kindly sending me one of his data science t-shirt designs. At the end of my public talks I ask for a postcard to add to my growing collection on the wall (I figure a small physical act is a good way to find folk who really got a positive out of my talks). Raman went above and beyond and sent a t-shirt. Many thanks!
Success - Backlogs, derisking and estimating call a week back
Early last week I ran a follow-up to the conference’s Executives at PyData discussion for leaders, this one was via Zoom with a focus on building backlogs, derisking projects and estimating. 
We took a vote and talked on:

How do you estimate project value for novel projects with many unknowns? 
How do you discover projects for the backlog?
How do you estimate a time line for new projects? What’s critical in this time line? (in the next issue)
What are some effective ways to navigate projects designed to improve laggy metrics (e.g. retention, LTV)? (in the next issue)

The recording DS Success - Derisking, Backlogs and Estimation is available in full (1 hour) on YouTube. I cover the first two questions here, and the next two in the following issue.
Many thanks to Anders, Ben, Caroline, Isaac, Lauren, Lorna, Raphaël, Maxime, Ricardo and Thomas for your contributions. This group represented mature teams in many larger organisations plus a couple of smaller organisations.
How do you estimate project value for novel projects with many unknowns?
One of the big issues in getting projects signed off is in estimating “what do we get at the end?”. This is hardest when we don’t know if we have the data, or a route to deployment, or enough signal in the data, or if we have an appropriate solution. Each additional uncertainty makes the project riskier in the eyes of management in many organisations. How do we get past this?
Anders talks about the wonderful (and far less common!) approach taken at his organisation - “they don’t solve DS problems, they solve business problems that might involve DS and they self-organise everything”. The self organisation is really interesting - they get to self-form teams around problems, decide if they should be tackled and then proceed as necessary. No sign-off, no estimations. Obviously this involves a lot of trust. The project evolves sprint-to-sprint with the team deciding if they should keep proceeding. Anders noted that this management style wasn’t driven by the data scientists but preceded them, and one consequence is that it really unlocked the value of their DS team.
What do we do in other “more traditional” organisations? The key is to find sponsors who understand the balance of risks and potential payoff, who’ll back the project in its infancy. If the DS team has worked on a particular domain or problem then hopefully the risks are lower, as they should have more of the answers. Maybe they’ve worked on the same data before so they know that the signal might exist, or they understand the business unit’s challenges so they can propose first “easy wins” which will gain mutual confidence for iterative deepening for the project.
Caroline noted that identifying “how much human time is spent on this challenge” enables a time/value discussion with the bosses. If they’re reasonably certain they can save a useful-enough amount of time then a green-light can be achieved for a first iteration of a project.
Ben noted that some business colleagues are more “numbers focused” than others - if you can find them, they’ll probably show the potential Return on Investment for the project because they’ve already thought about the numbers that would justify it. If you can’t find someone who has worked the numbers already then the challenge is higher - but it is still likely to be critical to getting a project signed-off.
Personally I’m a huge fan of finding “the humans who know” (i.e. those who run the business unit) and investing time with them on questions like “how much time could we save? from your experience with the data, is the necessary signal in here? can you outline several variant projects that we might investigate if any one ended up being a failure?”. If this person has useful answers, we can probably derisk the project quickly without having to touch the data (yet) which is always a time-consuming process.
How do you estimate the value for novel projects? I’d love it if you could reply with a couple of notes on what you do that works, and with any questions for where it doesn’t work?
How do you discover projects for the backlog?
Another critical issue is having an appropriate backlog of new projects. If we don’t have projects for next year’s timeline, how do we know that we’ll be delivering value? There may be strategic choices to make around creating value-multipliers, connecting multiple projects together (e.g. taking a data exhaust from one and using it as an input to another). We can only prioritise sensibly if we have a good road-map for projects and value-delivery. But how does the DS team get those projects in the first place?
Having someone “sell DS” internally is a common route, but takes a lot of effort. Turning the challenge around and making product owners come to you is far more effective but it requires a different sort of “sell”. If your team isn’t selling in some form or other, eventually your leads will dry up. This really is critical for a team to be effective, especially in an organisation that is naive around its DS usage.
Tom noted that once informal drinks led to an idea being discussed at his organisation. The idea was something that was always “on the backburner” for that team and never got promoted to a formal project. This chat led to a project idea for an upcoming hackathon and that led to the project growing up. Tom further noted that if you’re separated from other business units, you just don’t know where the problems are - the business folk know their issues (but not what DS could do), the DS team have cool algorithms but no insight into needs. Building informal networks can bridge this gap.
Lorna talked on doing “DS PR” - selling their team’s successes by joining calls for other teams and giving them a brief outline of the cool and valuable thing they did and asking if there are other topics that might be worth discussing. Lorna also noted that PyDataLondon meetups are also great to attend to get new ideas on what’s worked well for others. 
Anders talked on Big Visual Information Radiators - think of eye-catching posters on the well (or in a slack channel!), which get folk interested in asking questions. I’ve used this with success in some teams and been very frustrated in some environments where you’re simply not allowed to stick anything unauthorised on the walls (!). If you can - print some graphs on an A1 glossy sheet, make sure they’re annotated or self-contained, then stick them up with some clue for how an interested reader can come and ask you questions. 
Anders also noted that value of “sitting in a team” - e.g. visiting the call centre to see how the staff there actually work, to get a deeper understanding of if & how their problems could be solved via data. I’ve used this approach in insurance to “sit in the Box” in the Lloyds of London building (a wonderfully archaic location - letting women not have to wear heeled-shoes was seen as a major step forward in their culture), watching how forms are written out for custom insurance, then tracing those forms all the way back to the database to learn just how much information was retained (and…lost) along that journey. 
Caroline had success using an open wiki page that was promoted for idea-submission. It did require some management (apparently it got a bit chaotic) but it worked to help people share and collaborate on new ideas. 
One thing I’m a fan of is making a slack channel for “results” and encouraging other data-literate teams to share their results. If someone did something cool with their data (e.g. PowerBI, R, Python, Excel, a napkin - whatever) and they can share a paragraph or a nice graphic - brilliant. Once others asynchronously dip into this channel, it’ll lead to new conversations. It also forces team members to create mini-assets which might be used in show-and-tell discussions or when jumping on a call with other teams.
Personally I’m a huge fan of “walking around, meeting folk, asking what they do and what’s hard” - a bit of random search (possibly facilitated by trips to the pub after work) can be very revealing, opening up interesting ideas that’d never usually get promoted in official meetings.
I’ve also seen a “remote poster session” work very well for a distributed DS team, to get them all up to speed on what each team across a large org is doing. Why not extend this idea and make a posters-day where a set of posters are promoted across the company (e.g. on the intranet or slack) for all to see, coupled with a scheduled Zoom call to “meet the authors for open discussion” on each topic. That’s pretty easy to organise, shares the DS results to all interested Product Owners and gathers those with questions into a nice call. 
Outside of the call I had a colleague say they’ve had success using Gather for remote teams, where they’ve setup a shared-office environment for the DS team. They’ve seen the DS team collaborate more as they can “see each other talking”, but they’ve also promoted this to other teams and had folk “stop by for a chat”. I’ve not come across teams using this (and our experiments using it for the PyDataGlobal conferences have had mixed success) - have you found value from Gather or similar tools?
What kind of new ideas might bubble up if you tried some of these ideas? Which might be an easy one to try soon as an experiment?
Training dates now available
I’m pleased to say I’ve fixed the dates for my next on-line (Zoom) courses, I’ve been running these remotely with great success during our pandemic and I’m going to continue the virtual format. Early bird tickets are available and limited for each course:

Higher Performance Python (aimed at anyone who needs faster NumPy, Pandas and scaling to Dask) - October 31 - November 2
Successful Data Science Projects (aimed at leaders and project owners) - December 6-7

I’m happy to answer questions about the above, just reply here. If you want a notification for future dates please fill in this form.
For those of you who have been waiting a while for me to get these listed - apologies, being the father to an infant has eaten a lot of time this year and I’ve had to take things sensibly before scheduling new courses. 
Open source - visualising large data with Jupyter (very pretty!) and using DALL-E-2 to make adventure game graphics (also pretty!)
I was pointed at PyDeck for “High-scale spatial rendering in Python”, which is powered by DeckGL a “WebGL-powered framework for visual exploratory data analysis of large datasets”.
The overview for PyDeck shows just a couple of lines to get a 3D render of a chart in a Jupyter Notebook. The gallery is really lovely and clearly builds on the underlying DeckGL showcase. Have you used it? I’d love to hear if this is useful.
There’s an interesting discussion on hn on Adventure game graphics with DALL-E 2. I’m an absolute sucker for the 8-bit and 16-bit point-and-click graphic adventure games from my youth. Here DALL-E-2 is used to build pixelated screens of imagined adventure game screens. They’re really pretty.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Mid/Senior Python Software Engineer at an iGambling Startup (via recruiter: Difference Digital)
This role is for a software start-up, although is a part of a much larger established group, so they have solid finance behind them. You would be working on iGaming/online Gambling products.  As well as working on the product itself you would also work on improving the backend application architecture for performance, scalability and robustness, reducing complexity and making development easier.
Alongside Python, experience of one or more of the following would be useful: Flask, REST, APIs, OOP, TDD, databases (Datastore, MySQL, Postgres, MongoDB), Git, Microservices, Websocket, Go, Java, PHP, Javascript, GCP.

Rate: Up to £90k
Location: Hybrid - 2 days per week in office opposite Victoria station
Contact: davina@makeadifference.digital (please mention this list when you get in touch)

Data Scientist at Trust Power, Permanent
Trust Power is an energy data startup. Our app, “Loop”, connects to a home’s smart meters, collects half-hourly usage data and combines with contextual data to provide personalised advice on how to reduce costs and carbon emissions. We have a rapidly growing customer base and lots of interesting data challenges to overcome. You’ll be working in a highly skilled team, fully empowered to use your skills to help our customers through the current energy crisis and beyond; transforming UK homes into the low carbon homes of the future. We’re looking for a mid to senior level data scientist with a bias for action and great communication skills.

Rate: 
Location: Oxford on site or hybrid (~1 day/week in office minimum)
Contact: steve.buckley@trustpower.com 07986740195 (please mention this list when you get in touch)
Side reading: link

Senior Data Scientist, J2-Reliance, Permanent, London
As a Senior Data Scientist at J2-Reliance, you will be responsible for developing Data Science solutions and Machine Learning models tailored to the needs of J2-Reliance’s clients. You will typically act as a Full-Stack Project Owner: you will be the main referent for a well delimited client’s problem, and you will be responsible for the conception and the implementation of the end-to-end solution to solve it. You will be supported by a program manager (your direct supervisor) and a Data Engineer helping with industrialisation. The specific nature and level of their involvement will depend on your areas of expertise and the specificities of the project.

Rate: >60000 p.a. 
Location: Fleet Street, Central London
Contact: damien.arnol@j2reliance.co.uk (please mention this list when you get in touch)
Side reading: link

Data Scientist (Plants for Health) at Royal Botanic Gardens, Kew
The Royal Botanic Gardens, Kew (RBG Kew) is a leading plant science institute, UNESCO World Heritage Site, and major visitor attraction. Our mission is to understand and protect plants and fungi for the well-being of people and the future of all life on Earth.
Kew’s new Plants for Health initiative aims to build an enhanced resource for data about plants used in food supplements, allergens, cosmetics, and medicines to support novel research and the correct use and regulation of these plants.
We are looking for a Data Scientist with experience in developing data mining tools to support this. The successful candidate’s responsibilities will include developing semi-automonous tools to mine published literature for key medicinal plant data that can be used by other members of the team and collaborators at partner institutes.

Rate: £32,000
Location: Hybrid, Kew (London)
Contact: b.alkin@kew.org (please mention this list when you get in touch)
Side reading: link

Data Engineer at IndexLab
IndexLab is a new research and intelligence company specialising in measuring the use of AI and other emerging technologies. We’re setting out to build the world’s first index to publicly rank the largest companies in the world on their AI maturity, using advanced data gathering techniques across a wide range of unstructured data sources.
We’re looking for an experienced Data Engineer to join our team to help set up our data infrastructure, put data gathering models into production and build ETL processes. As we’re a small team, this role comes with the benefit of being able to work on the full spectrum of data engineering tasks, right through to the web back-end if that’s what interests you! This is an exciting opportunity to join an early stage startup and help shape our tech stack.

Rate: £50-70K
Location: London (Mostly Remote)
Contact: Send CV to careers@indexlab.com (please mention this list when you get in touch)
Side reading: link

Staff & Principal Machine Learning Engineer at Deliveroo
We are looking for Staff, Senior Staff & Principal ML Engineers to design and build algorithmic and machine learning systems that power Deliveroo. Our MLEs work in cross-functional teams alongside engineers, data scientists and product managers, who develop systems that make automated decisions at a massive scale. 
We have many problems available to solve across the company, including optimising our delivery network, optimising consumer and rider fees, building recommender systems and search and ranking algos, detecting fraud and abuse, time-series forecasting, building a ML platform, and more.

Rate: 
Location: London / Remote
Contact: james.dance@deliveroo.co.uk (please mention this list when you get in touch)
Side reading: link

Data Scientist, Regulatory Genome Development Ltd
The Regulatory Genome Project (RGP), part of the Cambridge Centre for Alternative Finance, was set up in 2020 to promote innovation by unlocking information hidden on regulator’s websites and in PDFs. We’re a commercial spin-out from The University of Cambridge’s Judge Business School and our proposition is to make the world’s regulatory information machine-readable and thereby enable an active ecosystem of partners, law firms, standard-setting bodies and application providers to address the world’s regulatory challenges.
We’re looking for a data scientist to join our remote-friendly technical team of software engineers, machine learning experts, and data scientists who’ll work closely with skilled regulatory analysts to engineer features and guide the work of a dedicated annotation team. You’ll help develop, train, and evaluate information extraction and classification models against the regulatory taxonomies devised by the RGP as we scale our operations from 100 to over 600 publishers of regulation worldwide.

Rate: £70,000-£85,000 plus benefits
Location: UK remote
Contact: Niall Caton careers@reg-genome.com (please mention this list when you get in touch)
Side reading: link, link

                            Don't miss what's next. Subscribe to NotANumber: