Next week we have our first PyDataLondon meetup in a couple of years! It’ll comprise a set of lightning talks to get folk ready for the post-meetup discussion. The venue is the same as we used to have (thanks to the Man Group!). Sadly I can’t make it, I hope to be along to the next.
A month later we have our PyDataLondon conference. I’ll be talking on “Successful Data Science Projects” (aka a list of some select past failures, with lessons learned, plus a happy ending) and running another “Executives at PyData” session aimed at leaders. If you’ve got a ticket for the conference you can attend both, I’d love to say hi in person.
I’ve been thinking a lot about past project failures in my consulting - so much of it comes down to human factors (not “algorithmic issues”) and I’m keen to hear more from attendees. Figuring out how to communicate these issues and to bake them into project plans to lessen their impact seems critical to getting more success for data teams (IMHO).
Eric Drass, artist and former keynote speaker at PyDataLondon has a new video out using the Flickr Faces dataset and latent space manipulation to create a mesmerising auto-morphed video for Frank&Beans. Skip 30s in for the morphing to really kick in. I follow hip-hop (anyone else attend the DMC world championships each year?) and this feels like a visual-scratch, I wonder if we’ll see such things for turntablists down the line. Eric describes the base
FFHQ StyleGAN2 model and building up “[a] series of non-existent human faces from the mind of the GAN – interpolated to the tempo of the track.”. Lovely stuff.
Very soon I’m going to list new course dates for the coming months for my Higher Performance Python, Software Engineering for Data Scientists and Successful Data Science Projects courses. Reply to this if you’d like a notification, I hope to get the events listed next week.
On the morning of Monday 13th I’ll host an informal Zoom-based demo and discussion around Pandas performance tips I’ve developed (from running recent versions of my course). If you’d like to watch along, or discuss your own Pandas annoyances and tips, I’d love to have you along. Reply to this and I’ll add you to the calendar invite. We’ll look a little at faster
groupbys and row block accessing in this call. In particular I’ll be asking you about your pains using Pandas (and I’ll try to answer your questions) as prep for my Pandas course later this year. Hopefully between those of us on the call we can answer the questions you bring. Reply for a calendar invite.
In part I’ve been surprised by the performance difference between
query and making a mask for multi-item access and how sorting (or not) can have an impact.
During my last Higher Performance Python course we got into a discussion about how Numba can make
numpy vectorised operations even faster by fusing intermediate results together. E.g.
a = np.arange(...); b = a+2; c = np.sqrt(b) never requires
b to be instantiated - instead you could build the array of numbers and
sqrt as you go, removing the need to allocate the RAM for the
b intermediate - this saves RAM and time. I love showing in the course that compiling well-vectorised
numpy code can lead to further performance improvements.
One of my attendees (thanks!) shared some Numba links for this, you can read up on Numba typed IR rewriting. There’s also a case study on array expressions which gets quite dense. I’ve got code that shows that multiple use of an array (e.g. twice in the same expression) seems to foil the array detection and fusion process, but that’ll require a fresh brain to dig into.
I got into a tweet discussion with Harry and Nick about the RAM implications for using a list comprehension and generator expression (and then went down a rabbit hole with Numba). For this trivial example you ended up using less RAM by running a regular
for loop (which is normally my preference - explicit and easy to read), gain speed by trading RAM with a generator expression and trade more RAM by using a list comprehension. I was quite surprised by the list comprehension result, my ipython_memory_usage tool showed 1.2GB RAM usage during that cell, about +700MB on just doing a
for loop for the same answer. Do you have any insights?
James Powell asks on Twitter “If you were to reïmplement pandas from scratch, would you preserve the
DataFrame distinction? What about the
MultiIndex distinction?”, you might want to see the thoughts from a few others.
If you use
Dask and you dig into the
.divisions attribute to understand whether you can get fast index-based accesses or not, be aware that you have to set
calculate_partitions now by hand whereas it used to be on-by-default. This cost me a couple of hours debugging prior to running my Higher Performance Python course last week.
I see that a new release candidate was available for
scikit learn 1.1 a few weeks back, see the changelog for details, it looks like there’s a lot of small speed improvements in with the usual set of bug fixes.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
The Met is looking for an analyst and a lead analyst to join its Strategic Insight Unit (SIU). This is a small, multi-disciplinary team that combines advanced data analytics and social research skills with expertise in, and experience of, operational policing and the strategic landscape.
We’re looking for people able to work with large datasets in R or Python, and care about using empirical methods to answer the most critical public safety questions in London! We’re a small, agile team who work throughout the police service, so if you’re keen to do some really important work in an innovative, evidence based but disruptive way, we’d love to chat.
An exciting opportunity has arisen for a Principal Population Health Analyst to join the Population Health and Care team at Lewisham and Greenwich Trust (LGT) where the post holder will be instrumental in leading the analytics function and team for Lewisham’s Population Health and Care system.
Lewisham is the only borough in South East London to have a population health management information system (Cerner HealtheIntent) that is capable of driving change, innovation and clinical effectiveness across the borough. The post-holder will therefore work closely with public health consultants, local stakeholders and third-party consultancies to explore epidemiology through the use of HealtheIntent, and design new models of transformative care that will deliver proactive and more sustainable health care services. LGT is therefore seeking an experienced Principal Population Health Analyst who is equally as passionate about transforming and improving the lives and care of patients through data analytics and can draw key and actionable insights from our data. The successful candidate will be an experienced people manager with strong communication skills to lead a team of analysts and manage the provision of data analytics to a diverse range of stakeholders across Lewisham, with particular focus on population health and bring together best practice and innovative approaches.
We import mid/large-scale data simultaneously from multiple sources (large databases, proprietary data stores, gigabyte spreadsheets), and merge it into a single queryable data-store. We need someone with a DevOps, DataScience, or Back End Engineering background to impose order on the chaos. This role is a mix of data-science and engineering-for-scale, taking real-world data and inventing automated, scalable, systems to deal with it.
This is a chance to join a well-funded startup (with revenue and customers) at the beginning of a new growth phase. Working with our Lead Back-End Engineer and CTO, you’ll be designing the new systems and taking the lead on implementing and maintaining them. Ideally you have experience of implementing backends using a variety of frameworks, techs, languages - we’re agnostic on specific tech, in most cases using the best tool for each job.
We are looking for Data Research Engineers to join DeepMind’s newly formed Data Team. Data is playing an increasingly crucial role in the advancement of AI research, with improvements in data quality largely responsible for some of the most significant research breakthroughs in recent years. As a Data Research Engineer you will embed in research projects, focusing on improving the range and quality of data used in research across DeepMind, as well as exploring ways in which models can make better use of data.
This role encompasses aspects of both research and engineering, and may include any of the following: building scalable dataset generation pipelines; conducting deep exploratory analyses to inform new data collection and processing methods; designing and implementing performant data-loading code; running large-scale experiments with human annotators; researching ways to more effectively evaluate models; and developing new, scalable methods to extract, clean, and filter data. This role would suit a strong engineer with a curious, research-oriented mindset: when faced with ambiguity your instinct is to dig into the data, and not take performance metrics at face value.
Join us in our mission to help tackle climate change, one of the biggest systemic threats facing the planet today. We are a start-up providing analytics and software to assist companies in navigating climate uncertainty and transitioning to net zero. We apply research frameworks pioneered by the Centre for Risk Studies at the University of Cambridge Judge Business School and are already engaged by some of the Europe’s biggest brands. The SaaS product that you will be working on uses cloud and Python technologies to store, analyse and visualize an organization’s climate risk and to define and monitor net-zero strategies. Your focus will be on full stack web development, delivering the work of our research teams through a scalable analytics platform and compelling data visualization. The main tech-stack is Python, Flask, Dash, Postgres and AWS. Experience of working with scientific data sets and test frameworks would be a plus. We are recruiting developers at both junior and senior levels.
We are looking for a Research Scientist who will help build, grow and promote the machine learning capabilities of Callsign’s AI-driven identity and authentication solutions. The role will principally involve developing and improving machine learning models which analyse behavioural, biometric, and threat-related data. The role is centred around the research skill set–the ability to devise, implement and evaluate new machine learning models is a strong requirement. Because the role involves the entire research and development cycle from idea to production-ready code we require some experience around good software development practices, including unit testing. There is also opportunity to explore the research engineer pathway. Finally, because the role also entails writing technical documentation and whitepapers, strong writing skills are essential.
Data Scientists at Monzo are embedded into nearly every corner of the business, where we work on all things data: analyses and customer insights, A/B testing, metrics to help us track against our goals, and more. If you enjoy working within a cross-disciplinary team of engineers, designers, product managers (and more!) to help them understand their products, customers, and tools and how they can leverage data to achieve their goals, this role is for you!
We are currently hiring for Data Scientists across several areas of Monzo: from Monzo Flex through to Payments, Personal Banking, User Experience, and Marketing; we are additionally hiring for Manager in our Personal Banking team and Head Of-level roles in marketing. I’ve linked to some recent blog posts from the team that capture work they have done and the tools they use; if you have any questions, feel free to reach out!
Monzo is the UK’s fastest growing app-only bank. We recently raised over $500M, valuing the company at $4.5B, and we’re growing the entire Data Science discipline in the company over the next year! Machine Learning is a specific sub-discipline of data: people in ML work across the end-to-end process, from idea to production, and have recently been focusing on several real-time inference problems in financial crime and customer operations.
We’re currently hiring more than one Head of Machine Learning, as we migrate from operating as a single, centralised team into being deeply embedded across product engineering squads all over the company. In this role, you’ll be maximising the impact and effectiveness of machine learning in an entire area of the business, helping projects launch and land, and grow and develop a diverse team of talented ML people. Feel free to reach out to Neal if you have any questions!
Caterpillar is the world’s leading manufacturer of construction and mining equipment, diesel and natural gas engines, industrial gas turbines and diesel-electric locomotives. Data is at the core of our business at Caterpillar, and there are many roles and opportunities in Data Science field. The Industrial Power Systems Division of Caterpillar currently has an opportunity for a Senior Data Scientist to support power system product development engineers with data insights, and to develop digital solutions for our customers to maximise the value they get from their equipment through condition monitoring.
As a Senior Data Scientist, you will work across, and lead, project teams to implement analytical models and data insights on a variety of telemetry and test data sources, in a mechanical product development environment.