How do you build a solid backlog and derisk your novel projects (Part 2)?

Further below are 7 job roles including Senior, Staff and Lead roles in DS and DEng at organisations like Deliveroo and Trust Power and Kew Gardens

                            August 31, 2022

                How do you build a solid backlog and derisk your novel projects (Part 2)?

                        How do you build a solid backlog and derisk your novel projects (Part 2)?
Further below are 7 job roles including Senior, Staff and Lead roles in DS and DEng at organisations like Deliveroo and Trust Power and Kew Gardens
I’ve got the second part of the write-up from my recent Success call for DS leaders on “backlogs, derisking and estimation”. The first half of the write-up is in the last issue. I discuss the updated nbdev v2 further below for potentially faster Jupyter based R&D.
I’ve had to rejig my training dates, only the Higher Performance Python course is available at the end of October for sale right now due to increased client demands for strategic support (I help teams build and execute a plan when they’re under pressure to deliver). Sorry if that’s a pain, I plan to run my Successful Data Science Projects course a few months later, possibly in January. Please reply to this email if you’d like a notification about this.
The call for proposals for PyDataGlobal is open until September 12th and the conference will run early December. I’ll be putting in a talk, the CfP is double-blind to reduce bias so first timers are especially welcomed. Tickets for the conference are available and I hope to run an event around “Executives at PyData” (for leaders) as I’ve done before.
Finally we are starting the reboot of our monthly PyDataLondon meetups - next Tuesday we have a pub meet in central London near London Bridge if you’re around. I plan to be there and will wear a branded PyData t-shirt (as I hope will other long-time attendees and organisers so we’ll be visible!).
Success - Backlogs, derisking and estimating call a week back
A few weeks back I ran a follow-up to the conference’s Executives at PyData discussion for leaders, this one was via Zoom with a focus on building backlogs, derisking projects and estimating. 
We took a vote and talked on:

How do you estimate project value for novel projects with many unknowns? (previous newsletter)
How do you discover projects for the backlog? (previous newsletter)
How do you estimate a time line for new projects? What’s critical in this time line?
What are some effective ways to navigate projects designed to improve laggy metrics (e.g. retention, LTV)?

The recording is available in full (1 hour) on YouTube. I covered the first two questions in the last issue, and the next two in this issue.
Many thanks to Anders, Ben, Caroline, Isaac, Lauren, Lorna, Raphaël, Maxime, Ricardo and Thomas for your contributions. This group represented mature teams in many larger organisations plus a couple of smaller organisations.
How do you estimate a time line for new projects? What’s critical in this time line?
Raphaël noted that sprint-based planning was common (e.g. for 4 week sprints), with an initial de-risking phase for novel projects to get past a set of go/no-go decisions which would allow a project to advance. Typically it is really hard to schedule a novel project when unknowns are prevalent, so some amount of derisking is sensible to reduce the unknowns enough to make a guestimated-timeline for a project.
I’ve had to produce “solid timelines” for high-risk projects with many unknowns and I’ve found the process to be a nonsense. It gives a false sense of security (“we’ve got a Gantt chart so it must be good!”) and nobody wants to talk about the many areas of variability which will inevitably impact the timeline or potential for value to be delivered. Working with teams who accept the uncertainty at the outset is critical.
Tom follows a similar process looking for “early exits” - bail points to end a project if certain pre-designed conditions can’t be met. He noted that figuring out the lower-uncertainty options that are valuable-enough means the team can make some progress delivering something of value, giving them time to further derisk the more uncertain bigger-vision ideas. 
A link was shared for Joe Blitzstein and Hanspeter Pfister’s Data Science Process - it looks reasonably similar to CRISP-DM (which I teach in my Success course). The basic idea is to identify a time-boxed window, then walk up and down the ladder of steps within that time-boxed window to deliver something of value, before repeating. 
Almost certainly this means the early deliverables have to be super-simple, but still showing some value, to get sign-off to proceed. I’m a huge fan of iteratively derisking projects and making go/no-go decisions as you learn more!
Anders noted that delivering value earl was critical (I quite agree!). Figuring out how to use the client’s data to show something useful or actionable early on, possibly during the derisking phase, helps build trust and makes the project proceed. I’ve learned (painfully, having avoided this step in the past) that showing why a client’s heuristic is “ok” but a tiny bit of ML is “better” really early on, or showing outliers or weird trends on graphs, can massively build trust and help make valuable decisions for the team to try, before you sink in a lot of time on a project. What’s the simplest next valuable action you might take?
We then had a side-discussion driven by Lauren on “does AutoML save you time?”. AutoML is the act of using automated ML approaches to quickly build models. This technique is great if you’ve done all the necessary business and data preparation and you’re investing time building models manually and every last ounce of performance matters. Unfortunately at the start of a novel project it probably isn’t useful - you don’t yet have the data and understanding to let AutoML do its thing.
I’ve tried it in the past - it can helpfully identify some relationships in the data (which helps the client build trust) and Lauren noted that it can also be used to identify data leaks which means you derisk quicker. On the call, of those who had tried AutoML for novel projects nobody had found it valuable. Have you found a time when AutoML was genuinely useful? I’d love to hear about it!
We discussed how without data preparation, AutoML will have low value and at worst might mislead junior team members who are blindly following the AutoML without questioning “why is it giving this result?”. Raphaël noted that on tabular data typically XGBoost just gives a good enough answer, assuming the data is already in a good state, that you get enough early signal to derisk that part of a project. 
Can you use the above to help your team move faster? I’d love it if you replied and told me which idea was useful to you.
What are some effective ways to navigate projects designed to improve laggy metrics (e.g. retention, LTV)?
With laggy metrics - e.g. annual churn (i.e. renewals) with pre-renewal offers or figuring out if recommendations increase engagement over time, it is really hard to test if your shiny new ML solution is having a positive impact. Waiting months or a year to see what happens is too long. You lose buy-in from the team and external factors (e.g. recessions, seasonal effects, marketing actions by competitors) can swamp any small signal you might be looking for.
Lorna noted that she’s always looking for proxy fast-moving metrics. They might not correlate perfectly with the business metric, but they can show if you have a problem (or a clear winner), to help build confidence. We also talked about having shorter and longer windows (with commensurate larger and smaller errors) to help visualise what’s really changing (if anything!). 
Caroline noted that sometimes you just have to keep reporting a result every month, noting the uncertainty and possible changes, until after many months the management buy in to the change with confidence.
Ben noted that slowly convincing management to run a 95/5% test, with 5% held-out from the intervention, was another way to help show that change was occurring that wasn’t being affected by external events. You’ve got to deal with the wider error bars on a small held-out sample, Isaac talked of drawing distributions with a wide standard distribution to help non-expert humans understand that you have to wait for the uncertainty to reduce, to get a useful answer. 
It was noted that by helping the team visualise its own uncertainty (perhaps the team has no concept of the uncertainty in their current human-driven process), they could already start to make better decisions.
Have you used this idea in your own projects? I’d love an example of how you found a useful fast-moving metric that worked as a proxy for a better but slower-moving metric.
Training dates (sort of) now available
I have a date for one of my upcoming courses via Zoom, I’ve been running these remotely with great success during our pandemic and I’m going to continue the virtual format. Early bird tickets are available and limited for each course:

Higher Performance Python (aimed at anyone who needs faster NumPy, Pandas and scaling to Dask) - October 31 - November 2
Successful Data Science Projects (aimed at leaders and project owners) - date TBC as I’ve become to busy later in the year on strategic client engagements, probably this will slip to January - please reply and let me know if you’d like to hear about the updated date

I’m happy to answer questions about the above, just reply here. If you want a notification for future dates please fill in this form.
For those of you who have been waiting a while for me to get these listed - apologies, being the father to an infant has eaten a lot of time this year and I’ve had to take things sensibly before scheduling new courses. 
Open Source - nbdev v2 for faster Jupyter development
There was an interesting discussion on hn for Nbdev: Create delightful software with Jupyter Notebooks. The FastAI team have an update to nbdev, it includes a new rendering system with Quarto (which extends the usual Pandoc), can publish to PyPI and Conda, has a text-sync and has greater git friendliness.
On the hn link there’s a post by a core developer which links to a great write-up nbdev+Quarto: A new secret weapon for productivity:

Hamel here, one of the core developers on this project. I just want to say that we are really excited about this new release of nbdev and the added functionality it brings to users. The thing I’m most excited about is that you can use nbdev for more things than ever before, such as: Documenting existing codebases (even if they aren’t written in nbdev), blog posts, books, presentations, etc. We also have an amazing notebook runner, as well as many other quality of life improvements. We will be adding more tutorials, walkthroughs and examples in the coming days. If you are interested in using nbdev please get in touch!

When I teach software engineering or if I need a clean and collaborative way to share my Notebooks I tend to prefer jupytext (which handles the text-sync, similarly to nbdev), it renders a second Python-only file without output which is the one you put into source control or edit in your IDE if you need a richer text editor.
The mention of execnb as a papermill-alternative is also interesting:

BTW we also released something today that’s particularly helpful for this workflow: https://fastai.github.io/execnb/ . Basically, it’s a parameterised notebook runner. It doesn’t rely on Jupyter or nbclient or nbconvert. It’s in the same general category as Papermill, but it’s much more lightweight and requires learning far fewer new concepts

I’m not entirely sold on what nbdev offers, whilst the strongly-held opinions make for a cohesive project I’m not sure it solves the problems that I face. I am however very happy to see a “competitor” approach to R&D-to-publishing for Python DS projects. Have you tried nbdev? Has it boosted your productivity? I’d love to hear your thoughts (positive or negative) if you’ve tried it, or related tools like jupytext.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it’ll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Data Engineering Lead - Purpose
This is an exciting opportunity to join a diverse team of strategists, campaigners and creatives to tackle some of the world’s most pressing challenges at an impressive scale. 

Rate: 
Location: London OR Remote
Contact: sarah@cultivateteam.org (please mention this list when you get in touch)
Side reading: link

Mid/Senior Python Software Engineer at an iGambling Startup (via recruiter: Difference Digital)
This role is for a software start-up, although is a part of a much larger established group, so they have solid finance behind them. You would be working on iGaming/online Gambling products.  As well as working on the product itself you would also work on improving the backend application architecture for performance, scalability and robustness, reducing complexity and making development easier.
Alongside Python, experience of one or more of the following would be useful: Flask, REST, APIs, OOP, TDD, databases (Datastore, MySQL, Postgres, MongoDB), Git, Microservices, Websocket, Go, Java, PHP, Javascript, GCP.

Rate: Up to £90k
Location: Hybrid - 2 days per week in office opposite Victoria station
Contact: davina@makeadifference.digital (please mention this list when you get in touch)

Data Scientist at Trust Power, Permanent
Trust Power is an energy data startup. Our app, “Loop”, connects to a home’s smart meters, collects half-hourly usage data and combines with contextual data to provide personalised advice on how to reduce costs and carbon emissions. We have a rapidly growing customer base and lots of interesting data challenges to overcome. You’ll be working in a highly skilled team, fully empowered to use your skills to help our customers through the current energy crisis and beyond; transforming UK homes into the low carbon homes of the future. We’re looking for a mid to senior level data scientist with a bias for action and great communication skills.

Rate: 
Location: Oxford on site or hybrid (~1 day/week in office minimum)
Contact: steve.buckley@trustpower.com 07986740195 (please mention this list when you get in touch)
Side reading: link

Senior Data Scientist, J2-Reliance, Permanent, London
As a Senior Data Scientist at J2-Reliance, you will be responsible for developing Data Science solutions and Machine Learning models tailored to the needs of J2-Reliance’s clients. You will typically act as a Full-Stack Project Owner: you will be the main referent for a well delimited client’s problem, and you will be responsible for the conception and the implementation of the end-to-end solution to solve it. You will be supported by a program manager (your direct supervisor) and a Data Engineer helping with industrialisation. The specific nature and level of their involvement will depend on your areas of expertise and the specificities of the project.

Rate: >60000 p.a. 
Location: Fleet Street, Central London
Contact: damien.arnol@j2reliance.co.uk (please mention this list when you get in touch)
Side reading: link

Data Scientist (Plants for Health) at Royal Botanic Gardens, Kew
The Royal Botanic Gardens, Kew (RBG Kew) is a leading plant science institute, UNESCO World Heritage Site, and major visitor attraction. Our mission is to understand and protect plants and fungi for the well-being of people and the future of all life on Earth.
Kew’s new Plants for Health initiative aims to build an enhanced resource for data about plants used in food supplements, allergens, cosmetics, and medicines to support novel research and the correct use and regulation of these plants.
We are looking for a Data Scientist with experience in developing data mining tools to support this. The successful candidate’s responsibilities will include developing semi-automonous tools to mine published literature for key medicinal plant data that can be used by other members of the team and collaborators at partner institutes.

Rate: £32,000
Location: Hybrid, Kew (London)
Contact: b.alkin@kew.org (please mention this list when you get in touch)
Side reading: link

Data Engineer at IndexLab
IndexLab is a new research and intelligence company specialising in measuring the use of AI and other emerging technologies. We’re setting out to build the world’s first index to publicly rank the largest companies in the world on their AI maturity, using advanced data gathering techniques across a wide range of unstructured data sources.
We’re looking for an experienced Data Engineer to join our team to help set up our data infrastructure, put data gathering models into production and build ETL processes. As we’re a small team, this role comes with the benefit of being able to work on the full spectrum of data engineering tasks, right through to the web back-end if that’s what interests you! This is an exciting opportunity to join an early stage startup and help shape our tech stack.

Rate: £50-70K
Location: London (Mostly Remote)
Contact: Send CV to careers@indexlab.com (please mention this list when you get in touch)
Side reading: link

Staff & Principal Machine Learning Engineer at Deliveroo
We are looking for Staff, Senior Staff & Principal ML Engineers to design and build algorithmic and machine learning systems that power Deliveroo. Our MLEs work in cross-functional teams alongside engineers, data scientists and product managers, who develop systems that make automated decisions at a massive scale. 
We have many problems available to solve across the company, including optimising our delivery network, optimising consumer and rider fees, building recommender systems and search and ranking algos, detecting fraud and abuse, time-series forecasting, building a ML platform, and more.

Rate: 
Location: London / Remote
Contact: james.dance@deliveroo.co.uk (please mention this list when you get in touch)
Side reading: link

                            Don't miss what's next. Subscribe to NotANumber: