Further below are 7 jobs including: Scientist/Engineer for Machine Learning, Senior Data Scientist at CEFAS, Data Scientist at Arena.Online, Manager Data Science - Business Analytics at Catawiki, Senior Software Engineer at Qualis Flow Limited, Analytics & Digital Data Scientist at JLR, 3-6 month Applied AI Residency at Gradient Labs
I'll share some thoughts on Polars and Pandas 2 (with Dask to follow in the next newsletter) based on a talk I gave at a couple of conferences recently. Further below I share an update on the charity car drive (including the engine-melted Volvo) several of us are doing in September in aid of Parkinsons Research. At the end I've got a new Data Science/Eng Salary Guide, if you're looking for a job or hiring there's some useful detail there - that's right before this edition's 7 jobs.
I'm planning to run new editions of my Software Engineering for Data Scientists, Higher Performance Python and Successful Data Science Projects courses in November. If you'd like a heads-up on the dates, reply to this and I can let you know.
Course outlines can be found on my training page.
In June with Giles Weaver we spoke on Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine, this talk gave us both a good chance to have a crack at learning the interesting Polars dataframe library whilst seeing how it stacked up against the venerable Dask. Behind the scenes it uses Arrow rather than NumPy for data storage and Pandas 2 now also has the option of using Arrow - so I really fancied catching up on what new opportunities these tools might give us.
All of my past talks are on speakerdeck along with the slides for Pandas 2, Polars and Dask (PyDataLondon 2023) (and the ODSC variant which has a couple of extra slides). If you find this useful, please check my fund-raising section below for the MotoScape Car Rally and consider donating to our charity event.
At the conference I also ran a high performance discussion session, I'll talk about that and Dask in the next edition.
The video isn't yet available. I'll summarise our slides.
Pandas 2 adds Arrow vectors in addition to NumPy vectors - these generally cost the same in speed and memory or are better. If you have strings then you definitely want to try the new Arrow string datatype.
Pandas 2 generally uses less RAM than earlier versions (e.g. this old bug of mine now uses much less RAM). It doesn't necessarily run faster and typically is still single-core, but by copying less RAM it both lowers the RAM footprint and wastes less time with RAM duplication. See slide 7 in the presentation for benchmarks.
For the most part if we're using Arrow in Pandas 2 you can continue to use Seaborn (and indeed Seaborn works fine with Polars too).
As Alex noted on LI:
Arrow whenever possible is a must for me nowadays - too tired of missing NaN in numpy. (along with a bunch of other great points)
Jesper 3 notes the following after our talk:
My conclusion: consider giving Polars a try if you prioritize performance or prefer its API. Otherwise, Pandas2 with the Arrow backend for strings is a solid choice. For handling huge datasets, Dask proves to be an excellent option.
We also had some other lovely feedback
Here's a high-level review - Polars is impressively fast, once you get past the different syntax (that's fine, it is just different to Pandas). You can do more in-RAM as it uses RAM more efficiently. Probably it is faster, sometimes 10x faster (e.g. slide 11), as it uses multi-cores natively.
On the flip side it uses Arrow internally so projects that want a NumPy 2D array (e.g.
sklearn see this) won't necessarily work. You can convert to a Pandas
DataFrame easily enough, but that probably doubles your RAM usage. For this use case maybe Polars isn't a great fit. I did have some success with
sklearn from Polars with 2D homogenous arrays, but any mixed-dtype DataFrames will likely need to be converted to NumPy arrays.
We did find some bugs when using Polars causing seg-faults, we found these very early on. The issues however were fixed quickly, I'm certainly impressed with author Ritchie's speed of response to the bugs we found - the main issues were fixed in 24 hours. The fact that we found such bugs (which could cause the machine to reset) early on makes me a little cautious in recommending Polars for anything other than research work at this stage.
I certainly think for R&D projects (not mission-critical work yet), where the underlying Arrow representation doesn't cause performance issues with other tools, Polars is a very interesting tool. It is the first time in years that I'm actively paying attention to a Pandas competitor.
One of the really impressive things with Polars is switching from the eager query API (each query is evaluated as required, just like in Pandas) to the lazy API where queries are squashed together and optimised. In addition this mode can use a streaming interface to larger-than-RAM files on disk.
If you check slide 16 in our slidedeck you'll see we swap from a single in-memory Parquet file to the full (much larger than RAM) parquet set and "it just works". I'll reflect on how much faster this was than with Dask in the next issue.
There are other differences between Polars and Pandas which might trip you up whilst on-boarding. I'll make some notes in the next issue.
So which to use? Pandas 2 is an easy upgrade from Pandas 1.5. If you switch to Arrow from NumPy datatypes then everything "probably works ok". Hopefully you've got appropriate tests to check that your migration is successful - right? I teach this in my Software Engineering class for good reason.
If you do switch to Arrow then you might find a few API calls in Pandas have reduced functionality, for the most part it should all be there and it is probably quicker with less RAM usage.
If you're doing R&D projects and speed is an issue then I'd strongly suggest you try Polars. Expect the occasional speed-bump (e.g. no direct
sklearn support if you have large arrays and no Numba support), but if you're doing data transformation and exploration then it does just "seem to work fine". And for the most part it is faster, with less RAM usage.
The star-usage chart for Polars from their funding announce gives you an idea of the growth behind the project:
Interestingly DuckDB (in yellow) also seems to have a similar hockey-stick growth pattern and that too can use SQL to query Parquet files. I've been told that in some quarters DuckDB is the dominant choice over Pandas, so I suspect we're seeing a convergence towards a common tool-set. Which tool will win is obviously still very open to debate.
I have some corrections to make from the last issue (thanks Marco Gorelli!):
Many thanks to those of you who have kindly donated to our JustGiving page for the charity car "banger drive" to Venice and back in a month, we drive at the start of September through France, Belgium, Luxembourg, Germany, Austria and end in Italy, then drive back - assuming the car is still running!
For those of you who read my newsletter and who haven't donated - if you find value in what I write, please do consider making a donation. All money raised goes to Parkinsons Research, we're doing this for charity (all car costs come out of our own pocket). If I'm helping you move forwards, and definitely if I've helped you find a job or a useful process or tool for your career - please do consider donating.
As noted in the last issue lockdown and becoming a new parent was pretty taxing - doing something silly is a nice way to change things up. Driving a 24 year old car (which hasn't blown up, I'll tell you about the one that did) for a 2k mile round-trip, is definitely a fun distraction.
You can see the current car on our Just Giving page, it is a 24 year old Volkswagen Passat petrol estate that is being dressed up as a surf car:
Making the car reliable is quite the challenge. We did a weekend's work on it recently and driver/engineer Ed has done a whole heck of a lot of work. Pulling the engine apart (see JustGiving for images) to do things like pressure tests on the cylinders has been an interesting "diagnostics challenge" - no
Prior to this we had a Volvo V50 turbo diesel estate. This was referenced in our Polars talk. At this point we'd owned it for 23 hours, at the 15 mile mark (admittedly on a 181k mileage engine), the turbo decided to consume its own oil causing a "turbo run-on event". That's pretty terrifying because if you stop the car (I did) and take the key out (I did), your engine is running at 7k revs (max to the limiter) and is trying to melt itself, with you still in the car. Pretty terrifying.
I'd stopped as the car went crazy whilst I was driving. The rev counter started picking up on its own, the hit maximum whilst my foot was off the accelerator. Did I mention that this was pretty terrifying? This is before the engine started to melt (whilst I was still slowing us down safely) and white smoke poured out of the engine obscuring the country lane I was on.
After I called the fire brigade (the road was obscured, the engine wouldn't stop, I'd run away down the road!) the engine finally popped. The liquids you see are a mix of diesel, oil and coolant - even the coolant tank ruptured. I'd have been more impressed with the mess had I not been driving at the time. That black smoke on the left? That's what was pumped out of the exhaust whilst the engine melted itself:
But heck - YOLO right? So we got this one scrapped and bought a second car and now we're very close to our initial fund-raising target. All funds raised go to charity (Parkinsons Research), all costs come out of our own pocket. If you haven't donated yet - might I please suggest that you consider doing A Good Thing and donating a bit of your money? At the very least go have a look at the pictures on our page and maybe share them on the socials or with petrol-head friends.
In one of my forums the question of salary ranges came up, a colleague shared the Harnham Global Salary Guide and it met the taste test. If you're recruiting, or looking to check what you're worth, you might want to take a look. If you come here you can get the employer's PDF report, it is pretty good. As an individual you can put in details here and it'll say for example that a Data Scientist in England in Greater London could be on £47k as a junior, £77k as a mid-level and £118k as a senior. There's more detail in the PDF so that's worth getting.
A couple of colleagues agreed that a DS in London would probably see £45-60 as a junior, £60-80 as a mid-level and £80-100 as a senior, with company leadership at £150k+. The PDF notes that most companies offer 2-3 days on-site, hardly any require fully on-site. What surprised me in the PDF was that most data-related roles, including BI, mentioned Python as a very important skill (pretty much all of them did), more frequently than SQL. Another surprise was that across job titles, in the same location, the salary bands looked pretty similar. Hopefully this helps you judge your market-worth.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Forecasting the weather accurately saves lives. At the ECMWF we have been predicting the weather 24/7 since 1975 with now 35 member and co-operating countries with a highly regarded physical system.
In this next chapter we're looking for 4 more colleagues to round off our team in creating a cutting-edge machine learning models to supplement our physics-based model and make our predictions faster, more accurate, and more energy efficient. Normally, this kind of impactful work comes at the expense of a decent salary... not in this case! So if you have the relevant deep learning experience, you might be able to make that impact in the world with us! (If you're part of an under-represented minority, please consider applying. The vacancy note is written to cover 4 positions, which means we don't expect everyone to cover every aspect!)
The UK Government's Centre for Environment, Fisheries and Aquaculture Science (CEFAS) is looking for data scientists and senior data scientists to work on computer vision and machine learning projects. We're tackling the serious global problems of climate change, biodiversity loss and food security to secure a sustainable blue future for all.
Projects include the detection, classification and quantification of benthic organisms in sea floor video, remote electronic monitoring of fishing vessels, beach litter in remotely piloted aircraft imagery, and work with innovative ship-based instruments such as plankton cameras and flow cytometers.
Closing date 28th August.
Arena.Online is the UK's leading ethical flower delivery service, specialising in personalised D2C and B2B fulfilment. The Insight & Data Science team is central in informing company decision-making and identifying growth opportunities. In an industry with sparse data, we're inventive and flexible in creating and maximising data sources.
Key responsibilities in this role include maintaining and improving our datasets (quality, functionality, and scope) and reporting capability. This already covers a wide range of sources and techniques with everything from formal order data to web scraping to customer review analysis. You will also be running projects independently and as part of our close-knit team which require creativity, adaptability, and determination. We aim to refine our data stack from data discovery to collection, preparation, reporting, dashboard building, and ultimately analysis and recommendations. We’re looking for somebody intelligent, curious, and full of exciting new ideas. As our team's third member, you'll face significant responsibility, but mega opportunity in our supportive, fun, and always learning-focused environment.
We’re looking for a Data Science Manager for our Commercial Data Insights team who will manage a team of Data Scientists / Analysts that support all the commercial departments of Catawiki (Marketing, Experts, Sales, Categories & Clusters, Finance) in using data to better understand our marketplace dynamics, to take the right decisions, and to identify opportunities to build a better Catawiki.
Your team and your role
We’re looking for someone that will be responsible for designing and developing the software that powers our products. You’ll need to collaborate with other teams, write high-quality code and ensure the codebase follows best practices.
You’ll be working in our Engineering team, working closely with Product and other technical teams and reporting to the team lead.
Proven experience in guiding and mentoring other engineers
This role sits within the new JLR Supply Chain Analytics Team – an embedded cross-disciplinary group of data professionals who move fast using the latest technologies, sharing best practices with colleagues in JLR central digital. In this delivery role you’ll be part of a fun, inclusive, learning, flexible, value focussed team backed by growing supply chain domain expertise and fearless in the face of a thorny technical problem or highly complex data. You will assist in the design and delivery of transformational Digital products to help streamline the way we work, enabling us to meet future vehicle sales volumes and customer delivery expectations in this unprecedented era of global supply chain disruption.
You will use your mathematical skills and scientific approach to help with the formulation and implementation of our Digital strategy for Planning, particularly the use of Optimisation and forecasting models to guide planning scenarios. You will use your technical knowledge to translate the business question into the correct mathematical formulation and technically implement it, assisting the business with interpreting and diagnosing the output and proactively suggesting improvements.
Gradient Labs is a new AI startup in London. We’re launching an AI Residency Program: we’re looking for an early-career AI researcher to join us for 3-6 months to experience what day 0 at an AI startup in London is like.
We’re building a suite of LLM-based agents that can safely automate customer support conversations in complex environments. You’ll be working directly with a small team— us (me, Danai Antoniou, Dimitri Masin) and any other early hires we make. Join us to spent your time researching, prototyping, optimising, and analysing what our agents can or should do. Help us build that functionality and evaluate how well it works.