Next week sees the return of our PyDataLondon conference after a 2 year Covid gap. The volunteers and NumFOCUS have put in a huge amount of work to bring this event back and I'm looking forward to attending, speaking and running some sessions. Will you be attending?
The schedule is packed with strong talks including a couple on "getting it right" (one by yours truly), ASR, processing bank data, fuzzy string matching at scale (beyond
fuzzywuzzy!), a DataFrame showdown, faking data for testing, measurements and fairness, feature stores, NLP, plenty of Bayesian and talks on MLOps.
I'm very happy to give an updated run of my Building Successful Data Science Projects talk on Sunday, based on several years of my popular course (email me if you'd like a notification for the next course?). We'll look at a mix of successful and "rather less successful" project outcomes from my portfolio and I'll draw out lessons which, if they fit the patterns you see, will help you succeed more frequently. The prior talk in the schedule on Lessons in transitioning from academia to business looks interesting to me, it has a focus on looking beyond models and scores and considering the needs of the business.
On Saturday morning I'll run another Executives at PyData session. In previous years we've covered a set of burning questions from leaders with open discussion on solutions including hiring, retention, derisking projects and internal communications. We'll have a similar open format this year. Please let me know by replying if you're likely to attend this session, I'll ask you some questions to help refine my agenda.
Volunteering is a great way to both give back to the community, build your network and to get a free ticket to attend. In exchange for the free ticket you take on some duties such as sitting at the sign-in desk and introducing speakers. If you're at an earlier stage in your career this is a great way to build your network - you'll have easy access to the speakers and people behind the scenes of our community. Sign up to hear more about your options to volunteer.
Talking of volunteering - my designer friend Myles has made a great t-shirt design for the conference this year. You'll see it at the event. If you need graphic design work, talk to Myles.
On the morning of Monday 13th I’ll host an informal Zoom-based demo and discussion around Pandas performance tips I’ve developed from running recent versions of my course. If you’d like to watch along, or discuss your own Pandas annoyances and tips, I’d love to have you along. Reply to this and I’ll add you to the calendar invite. We’ll look a little at faster joins, groupbys and row block accessing in this call. In particular I’ll be asking you about your pains using Pandas (and I’ll try to answer your questions) as prep for my Pandas course later this year. Hopefully between those of us on the call we can answer the questions you bring. Reply for a calendar invite.
Two issues back I said I'd share some past project diagnostics, in that issue I spoke on building a data science solution without knowing who the customer was and how this led to a very difficult (but ultimately successful) final sale.
A couple of years back I got to work with a top-tier start-up investment firm to help them start their data science initiative. They wanted to use ML to identify "good prospects" that might meet their exacting criteria for making one of their few Series A-B investments. We had a large pile of past data, a good idea of what a "good target" looked like and seemingly a narrow but decent dataset of indicators to work from.
The first challenge (giving little external demonstration of progress to the client) was to clean the dataset. It had never been mined before so it contained the usual collection of duplicates, incorrect entries, spelling errors, bad dates and garbage. Since companies change names the duplicates also contained near-copies of the "same" company from different points in time (e.g. before and after a pivot), with external metadata from the single correct entity appended to both of these duplicates. Cleaning this all up took some time and really didn't make for an exciting demo but was obviously critical to any data product.
Having cleaned the data, defined what a gold standard might look like and then built one by hand we had a pretty strong learning dataset, with plans to integrate lots of external data. We also had several client-supplied heuristics which I could build into
sklearn estimators so on the same time-series cross-validation dataset we could evaluate their performance. Showing that an ML approach combining many features could outperform any of the client's heuristics wasn't too hard - the ML didn't do a lot more, but it obviously out-performed some of the in-house rules. So far so good.
By now a couple of significant challenges had arisen. The primary challenge was that to validate the results, I needed the investment team to abandon their own ranking process and to use my ranked results. Since we were only looking for circa 10 good hits in a year's worth of thousands of new companies, the hit-rate would always be low. That's when I learned that career advancement opportunities for the people doing the research depended on their hit rate and boy, they really didn't want to risk ignoring their own method to use "Ian's crazy ML method" as it might risk them losing one of their rare great opportunities.
And there I got stuck - if nobody will validate your result, you're going to have a really hard time of showing that you're providing more value than what's possible with existing processes. I later learned that "busy" parts of the year (e.g. November->Christmas) is an awful time to get them to focus on something risky and new and that's exactly when we had our first useful results. We live and learn.
The lack of a target rich environment also meant that possibly good results were hidden in a pile of poor results, which hurt the client's confidence - even though they also went through many poor results to find occasional gems with their own method. Target poor environments are poor places to start from.
The second challenge was learning that whilst we had "pretty good" data, the intuition and human-focused nature of networking meant that lots of the interesting signals didn't exist in the data - they came from people's interactions with other people. Since our dataset would never have these other important signals (nor could we see how to encode many of them - they really were "discussion focused" signals), our approach was necessarily limited. Some of the data could also be "hidden" - sometimes companies never release their funding events, or release them a year or more after they happened - so if that's one of your important signals, you're in trouble.
Ultimately I learned that one of my deliveries had highlighted a company and this was used as an additional confirmatory signal in one of their rare million-dollar investments - so we "got success" via this project, but it wasn't easy and wasn't going to scale automatically.
What did I take from this? Once we'd realised that our dataset didn't have the right signals, we had a fundamental problem. Due to my lack of this domain's knowledge at the start this wasn't obvious and it couldn't be patched. This would have been fixed with prior domain knowledge - as I've found repeatedly, having good domain knowledge trumps having good mathematics skills. Focus on a vertical you like working in and build your domain knowledge.
The signals that did exist were useful and some straightforward ML on cleaned data did the job, but that got us to the issue of deploying a working system. If your target client will have low confidence in your system, they won't use it. A good way to build confidence is to iteratively deploy useful results (which is hard if you've got broken data and few clues to start with!). Now I always focus on deploying a minimal system - probably using heuristics rather than ML - and making sure I've got users who will actively participate in giving feedback. If I can't get to this feedback quickly, I move on.
If I were to tackle this again I'd start with a minimally cleaned dataset and the client's heuristics coupled with a dedicated business colleague who would give feedback in a timely manner, so we could figure out what wasn't working and improve. A solution might include ML but probably (as I learned here) should focus on more rules and a wider dataset, long before worrying about scaling with ML.
That leads to the third problem - I had no similar internal vertical to move to with this client so all the lessons learned couldn't be reused. Once we'd discovered our limits, all that was left was for the client to spend another year building out the data and their BI system with a plan to return to ML at some point in the future. I had a target poor environment for internal projects and that one project we worked on had a target-poor outcome, making it doubly hard. Always look for target rich environments with many valuable problems to solve for the business.
I mentioned recently "making $1M for my insurance client" - one of the reasons we did so well there is that we had many opportunities to pivot and to re-use our hard-won knowledge as we staked out good opportunities within the client's verticals. Always be looking for target-rich environments.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Very soon I'm going to list new course dates for the coming months for my Higher Performance Python, Software Engineering for Data Scientists and Successful Data Science Projects courses. Reply to this if you'd like a notification, I hope to get the events listed soon.
If you like Billy Joel's We Didn't Star the Fire, check out the SUSE Linux parody We Didn't Start the Kernel. There are some lovely open source lyrics. PyData co-organiser John Sandall (thanks for all your work John!) notes that ipytone would have been a cooler inclusion - the linked video in a Jupyter Notebook looks brill.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it'll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
The Regulatory Genome Project (RGP), part of the Cambridge Centre for Alternative Finance, was set up in 2020 to promote innovation by unlocking information hidden on regulator’s websites and in PDFs. We're a commercial spin-out from The University of Cambridge’s Judge Business School and our proposition is to make the world’s regulatory information machine-readable and thereby enable an active ecosystem of partners, law firms, standard-setting bodies and application providers to address the world’s regulatory challenges.
We're looking for a data scientist to join our remote-friendly technical team of software engineers, machine learning experts, and data scientists who’ll work closely with skilled regulatory analysts to engineer features and guide the work of a dedicated annotation team. You’ll help develop, train, and evaluate information extraction and classification models against the regulatory taxonomies devised by the RGP as we scale our operations from 100 to over 600 publishers of regulation worldwide.
The Met is looking for an analyst and a lead analyst to join its Strategic Insight Unit (SIU). This is a small, multi-disciplinary team that combines advanced data analytics and social research skills with expertise in, and experience of, operational policing and the strategic landscape.
We're looking for people able to work with large datasets in R or Python, and care about using empirical methods to answer the most critical public safety questions in London! We're a small, agile team who work throughout the police service, so if you're keen to do some really important work in an innovative, evidence based but disruptive way, we'd love to chat.
An exciting opportunity has arisen for a Principal Population Health Analyst to join the Population Health and Care team at Lewisham and Greenwich Trust (LGT) where the post holder will be instrumental in leading the analytics function and team for Lewisham's Population Health and Care system.
Lewisham is the only borough in South East London to have a population health management information system (Cerner HealtheIntent) that is capable of driving change, innovation and clinical effectiveness across the borough. The post-holder will therefore work closely with public health consultants, local stakeholders and third-party consultancies to explore epidemiology through the use of HealtheIntent, and design new models of transformative care that will deliver proactive and more sustainable health care services. LGT is therefore seeking an experienced Principal Population Health Analyst who is equally as passionate about transforming and improving the lives and care of patients through data analytics and can draw key and actionable insights from our data. The successful candidate will be an experienced people manager with strong communication skills to lead a team of analysts and manage the provision of data analytics to a diverse range of stakeholders across Lewisham, with particular focus on population health and bring together best practice and innovative approaches.
We import mid/large-scale data simultaneously from multiple sources (large databases, proprietary data stores, gigabyte spreadsheets), and merge it into a single queryable data-store. We need someone with a DevOps, DataScience, or Back End Engineering background to impose order on the chaos. This role is a mix of data-science and engineering-for-scale, taking real-world data and inventing automated, scalable, systems to deal with it.
This is a chance to join a well-funded startup (with revenue and customers) at the beginning of a new growth phase. Working with our Lead Back-End Engineer and CTO, you’ll be designing the new systems and taking the lead on implementing and maintaining them. Ideally you have experience of implementing backends using a variety of frameworks, techs, languages - we’re agnostic on specific tech, in most cases using the best tool for each job.
We are looking for Data Research Engineers to join DeepMind’s newly formed Data Team. Data is playing an increasingly crucial role in the advancement of AI research, with improvements in data quality largely responsible for some of the most significant research breakthroughs in recent years. As a Data Research Engineer you will embed in research projects, focusing on improving the range and quality of data used in research across DeepMind, as well as exploring ways in which models can make better use of data.
This role encompasses aspects of both research and engineering, and may include any of the following: building scalable dataset generation pipelines; conducting deep exploratory analyses to inform new data collection and processing methods; designing and implementing performant data-loading code; running large-scale experiments with human annotators; researching ways to more effectively evaluate models; and developing new, scalable methods to extract, clean, and filter data. This role would suit a strong engineer with a curious, research-oriented mindset: when faced with ambiguity your instinct is to dig into the data, and not take performance metrics at face value.
Join us in our mission to help tackle climate change, one of the biggest systemic threats facing the planet today. We are a start-up providing analytics and software to assist companies in navigating climate uncertainty and transitioning to net zero. We apply research frameworks pioneered by the Centre for Risk Studies at the University of Cambridge Judge Business School and are already engaged by some of the Europe’s biggest brands. The SaaS product that you will be working on uses cloud and Python technologies to store, analyse and visualize an organization’s climate risk and to define and monitor net-zero strategies. Your focus will be on full stack web development, delivering the work of our research teams through a scalable analytics platform and compelling data visualization. The main tech-stack is Python, Flask, Dash, Postgres and AWS. Experience of working with scientific data sets and test frameworks would be a plus. We are recruiting developers at both junior and senior levels.
We are looking for a Research Scientist who will help build, grow and promote the machine learning capabilities of Callsign's AI-driven identity and authentication solutions. The role will principally involve developing and improving machine learning models which analyse behavioural, biometric, and threat-related data. The role is centred around the research skill set--the ability to devise, implement and evaluate new machine learning models is a strong requirement. Because the role involves the entire research and development cycle from idea to production-ready code we require some experience around good software development practices, including unit testing. There is also opportunity to explore the research engineer pathway. Finally, because the role also entails writing technical documentation and whitepapers, strong writing skills are essential.
Data Scientists at Monzo are embedded into nearly every corner of the business, where we work on all things data: analyses and customer insights, A/B testing, metrics to help us track against our goals, and more. If you enjoy working within a cross-disciplinary team of engineers, designers, product managers (and more!) to help them understand their products, customers, and tools and how they can leverage data to achieve their goals, this role is for you!
We are currently hiring for Data Scientists across several areas of Monzo: from Monzo Flex through to Payments, Personal Banking, User Experience, and Marketing; we are additionally hiring for Manager in our Personal Banking team and Head Of-level roles in marketing. I’ve linked to some recent blog posts from the team that capture work they have done and the tools they use; if you have any questions, feel free to reach out!
Monzo is the UK’s fastest growing app-only bank. We recently raised over $500M, valuing the company at $4.5B, and we’re growing the entire Data Science discipline in the company over the next year! Machine Learning is a specific sub-discipline of data: people in ML work across the end-to-end process, from idea to production, and have recently been focusing on several real-time inference problems in financial crime and customer operations.
We’re currently hiring more than one Head of Machine Learning, as we migrate from operating as a single, centralised team into being deeply embedded across product engineering squads all over the company. In this role, you’ll be maximising the impact and effectiveness of machine learning in an entire area of the business, helping projects launch and land, and grow and develop a diverse team of talented ML people. Feel free to reach out to Neal if you have any questions!