Some more thoughts on Polars and Dask

                            August 24, 2023

                Some more thoughts on Polars and Dask

                        Some more thoughts on Polars and Dask
Further below are 7 jobs including: Software Engineer at Qualis Flow Ltd, Scientist/Engineer for Machine Learning (multiple vacancies), Senior Data Scientist at CEFAS, Permanent, Data Scientist at Arena.Online, Permanent, Droitwich (Worcestershire) (Ian's note - Arena sent me a flower bouquet as a thank-you and they were nice and safely delivered, cheers!), Manager Data Science - Business Analytics at Catawiki, Permanent, Amsterdam, The Netherlands, Analytics & Digital Data Scientist at JLR, Permanent, Gaydon UK, 3-6 month Applied AI Residency at Gradient Labs
Recently I ran a private Higher Performance Python course, my talk with Giles Weaver (LI) at PyDataLondon 2023 on Polars and Pandas 2 and Dask was on my mind, I share some further observations on these below. I've also released new dates for my public courses for November, early bird tickets are available for each of these if you're quick.
In just 10 days I'm off on my charity car adventure to Venice and back, there's another photo below (and no, this car still hasn't tried to melt itself, unlike the last one - phew) and I'd love it if you considered donating please to our Parkinson's Research charity. Details at the end. Many thanks to those of you who have generously donated already, we're well over our initial target and that's down to you kind souls - many thanks.
Martin Goodson of the London Machine Learning Meetup has asked me to share a rather fabulous looking LLM talk for Aug 30th (online event) by the author of the ALiBi positional encoding method used in BLOOM and others. Ofir Press (Uni. of Washington) will be giving this NLP focused talk, registration is required. Ofir will discuss "his current views on the field and what future directions excite him.", Martin says this talk is not to be missed.
As always at our monthly PyDataLondon meetups we're looking for new and interesting talks - submit your talk suggestion here. 
Upcoming courses listed for November
I've listed dates for the following public courses, they'll each run as virtual courses via Zoom in the UK mornings. Early bird tickets are available for each:

Successful Data Science Projects (November 13th & 14th)
Software Engineering for Data Scientists (November 20th, 21st, 22nd)
Higher Performance Python (November 27th, 28th, 29th)

The Successful Data Science Projects course is related to the private RebelAI leadership group (announced here) I'm putting together, I hope to share some reflections from those sessions from November onwards. If you're a Data Science leader and you could do with support - and a peer group of leaders - get in contact and we can have a chat.
During a recent Higher Performance Python course (privately run for a large hedge fund) we had a chat about the internal move towards Polars - some research groups were starting to try Polars with cautious success. Typically this worked in isolated places where Pandas had poor performance and where lots of RAM and many CPUs could more efficiently process the data in Polars. 
The chats during my courses are always interesting and help me gauge where organisations are on their adoption curves. Python 3.11 still isn't widespread (Python 3.9 seems dominant when looking at many organisations) and yet it offers some nice pure-Python speed gains. I'm waiting to see if Python 3.12 will offer some additional improvements when it is released in the coming months.
Talking on "Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine" at PyDataLondon 2023 and ODSC 2023
So in the last issue I talked on Polars and Pandas 2. Here I'll give a few more thoughts on Polars and add some reflections on Dask for medium data. The popularity of Polars continues to grow at a sensible pace.
Update on #9001
First - Ritchie, author of Polars, asked that I address the "Polars seg-faulted" comment from the last issue. He's right, it didn't cause a Segmentation Fault (I really shouldn't have said that), it did reach an Out of Memory state which in some cases led to either Giles or I having a machine that reset or a machine that was locked and needed a reset. 
Using the latest version (0.18.15) I've re-run the data-load benchmarks that had involved problems on #9001 and noted that now only one consistently causes the OOM and that condition is a quite an edge case (a scan_parquet to load many rows, with a subsequent limit, where that limit is ignored), a second combination runs low on SWAP but ultimately still succeeds, the other combinations are fine. If you're not doing anything too weird with your data loads you're unlikely to hit this issue. Having spoken to others about their use of Polars nobody else noted issues like this so perhaps we've just uncovered an interesting corner case.
Whilst writing this Ritchie updated the bug report to note "I will ensure the limit is pushed down the the n_rows parameter. It is a trivial fix.".
Some differences between Polars and Pandas
When working with Polars I came across some differences (nothing bad - just different) to my normal workflow with Pandas. Maybe noting these will help your initial experiments (and I do think it is worth experimenting with). 
Personally I like the idea of composing verbs to add behaviour in Polars, rather than the Pandas "everything in each API call" approach, but just be aware that some things like default sort orders, argument naming (ascending vs descending), output column naming and day of week indicator bases may be different.
How has your journey been whilst trying Polars? I'm keen to get your feedback, just reply to this email and let me know how smooth (or bumpy?) the process was.
Group By aggs are named, by default, differently between Polars and Pandas
If you try to do 2 or more aggregations in Polars on the same column, since that column name is used for the output (unless you use .alias to change it), you'll get an error:
dfp.groupby(by='make').agg([pl.col('cylinder_capacity').count(), 
                            pl.col('cylinder_capacity').median()])
#DuplicateError: column with name 'cylinder_capacity' has more than one occurrences 

# use .alias to rename the columns instead
dfp.groupby(by='make').agg([pl.col('cylinder_capacity').count().alias('cyl_count'), 
                            pl.col('cylinder_capacity').median().alias('cyl_median')])

In Pandas if you're lazy and just do multiple aggregations, you'll get output columns named by the aggregation:
df.groupby('make')['cylinder_capacity'].agg(['count', 'median'])
# the generated DataFrame has `count` and `median` columns

Almost certainly we're better off in Pandas by using the NamedAgg (also see Why is Nobody Talking about Pandas NamedAgg?) as that makes everything clearer, just as it is explicit in the Polars approach.
Day of week is 1-based in Polars, 0-based in Pandas
Polars counts Monday as 1 and Sunday as 7 in the weekday call, Pandas' dayofweek starts with Monday as 0. The PyArrow implementation allows either, but that's not used by Polars (the Rust equivalent is used in Polars) so it is rather moot, it just confused me when I was first digging around and might cause you a hiccup if you're using other Arrow layers to access datetime data.
If say in Polars you're mapping numeric counts to days-of-week from your Pandas plot code you might find you lose a day on the plot, if you use the numeric values in calculations or logic choices (e.g. ordinals to mask out weekends) then be careful to check as you migrate.
value_counts sorts the other way
This is minor - value_counts only works on a series on an eager (not lazy) dataframe. value_counts has an option sort=True to sort largest-to-smallest counts. If you want to sort the other way around you'll need to add .value_counts().sort(by='counts', descending=False), and that sort order is the same as Pandas uses. 
Pandas' value_counts has a default ascending=True to sort values smallest-to-largest. Again - there's nothing right or wrong, just "a bit different".
The above are the ones I'd noted whilst preparing the talk for PyDataLondon, they're not huge and since the error messages in Polars keep improving they're not too painful, so it is more about being open-minded about Polars being similar-but-not-the-same as Pandas.
The counter-point to this is that I've found myself being happily surprised that Polars can be so much faster than Pandas, so I've then manually re-worked some of my Pandas expressions to make them faster (though generally not as fast as Polars achieves). I've put one of these examples into my Higher Performance Python class as an exercise for students.
Dask
During our talk we compared Polars and Dask on a medium-data problem, slide 20 shows similar(ish) code for a sub-minute execution of our query. Slide 22 however showed a more impressive result for Polars - with no tuning we ran a more complex query in 11s whilst in Dask we went from 3 minutes to about 1 minute after Giles spent the better part of a day experimenting his way to a faster solution. 
Matt Rocklin reflected that Dask has trouble optimising queries at the moment but soon will have a new "Dask expressions" library which will enable Polars-like compound operation optimisation. It'll be great to re-run that benchmark once this is available. For now though we'd concluded that if you like Pandas, Dask is great for scale and you can make things fast but it can be a bit of a manual operation. Lots of large companies use Dask at scale and clearly "it works". Polars continues to looks interesting for anything that fits on a single machine, for those who are happy to experiment with newer technologies.
Have you compared Dask with Polars? What are your thoughts? Please reply and let me know.
As a consequence of our talk we had a good chat with Matt Rocklin, the author of Dask. Matt shared a couple of upcoming features including a new diagnostic in Dask that shows GIL contention using gilknocker, as described here:

Code running on Dask runs in a threaded environment with other threads doing compression, disk I/O, network I/O, and other user code. Hopefully this code uses the GIL in a judicious manner.  GIL monitoring through the dashboard and Prometheus metrics enables users to identify workflows that can be improved to be more GIL friendly. ... With GIL monitoring you’ll be able to view how your code behaves in this context and enable you to make more informed decisions on performance improvements.

Matt also pointed me at a useful write-up of how he sees clients solving bigger-data issues using Dask. He notes PyArrow strings (I noted in the last issue how well this improves RAM use and performance in Pandas already), the new "more intelligent" task scheduling (I've covered this before), better S3 bandwidth use and an experimental new shuffling service to come out this year.
Alexander Hendorf (hi!) shared a short write-up too on his observations via LI.
MotoScape Car Rally - charity banger run to Venice in September
Many thanks to those of you who have kindly donated to our JustGiving page for the charity car "banger drive" to Venice and back, we drive at the start of September through France, Belgium, Luxembourg, Germany, Austria and end in Italy, then drive back - assuming the car is still running! We're off in about 10 days!
For those of you who read my newsletter and who haven't donated - if you find value in what I write, please do consider making a donation. All money raised goes to Parkinson's Research, we're doing this for charity (all car costs come out of our own pocket). If I'm helping you move forwards, and definitely if I've helped you find a job or a useful process or tool for your career - please do consider donating.
So recently we bled all the brakes and replaced the hydraulic fluid, replaced the cambelt and auxiliary timing belt, rebuilt most of the suspension (shiny orange springs will appear on the JustGiving page soon), changed all the fluids and filters, replaced many jubilee clips and hoses, changed a fan, added new spark plugs and a rocker cover gasket, replaced a gearbox mount, gave it new tyres and wheel bearings (and as a consequence all 4 wheels now point in the same, useful, direction) and generally did a whole lot of work. 
We feel we get bonus points for recharging the air-con and having the coolant stay in the pipes. And by "we" I mean mostly my co-driver Ed (thanks Ed!), but I've managed to get myself reasonably greasy too. Here's the wheel-bearing-removal-tool:

You'll remember how in the last issue I talked about how our first car (a Volvo) tried to melt itself whilst I was driving it (see that issue for the fire brigade picture). Well - this car (the Passat) hasn't done that at all. It just seems to keep on sprouting new leaks, which on a 2,000 mile round-trip isn't really what you want. But, YOLO, right? I'm sure we'll get there and probably back too, maybe with the car if we're really lucky.
Of course you support this foolishness and just maybe you're open to giving to charity (all funds raised go to Parkinson's Research, all costs are from our own pocket) - if so please see our JustGiving page or just go and look at all the lovely pictures.
This is probably the last issue before we go, hopefully I'll have a positive update to report in a few weeks time :-) If you check the JustGiving link early next week it should show the car dressed in its finery including surf boards and many decals that we've had designed up. 
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Software Engineer at Qualis Flow Ltd
We’re looking for someone to someone that will be responsible for designing and developing the software that powers our products. You’ll need to collaborate with other teams, write high-quality code and ensure the codebase follows best practices. You are curious and enthusiastic with a drive to constantly learn and acquire new knowledge.
You’ll be working in our Engineering team, working closely with Product and other technical teams and reporting to the team lead.

Design, develop, and maintain the core engine that powers our products, ensuring scalability, performance, and reliability
Write high-quality, maintainable code that is well-documented and tested (we are fans of TDD)
Extensive collaboration with other engineers, including pair-programming and mob programming
Ensure the codebase follows best practices for software development, such as using appropriate design patterns, writing clean and modular code, and ensuring the codebase is easy to understand and maintain
Continuously improve our development processes and technologies to ensure high-quality software delivery
Participate in code reviews and provide feedback to other engineers on their code
Work closely with the Product team to translate product requirements into technical specifications
Collaborate with other internal teams to develop software that meets the needs of the business and our customers
Contribute to the technical direction of the team and provide ideas and input on architectural decisions
Provide technical guidance and mentorship to more junior members of the team

Always have an eye on the big picture to avoid getting lost in the weeds

Rate: £60,000 – £75,000,

Location: Remote (Access to London office)
Contact: sam.joseph@qualisflow.com (please mention this list when you get in touch)
Side reading: link

Scientist/Engineer for Machine Learning (multiple vacancies)
Forecasting the weather accurately saves lives. At the ECMWF we have been predicting the weather 24/7 since 1975 with now 35 member and co-operating countries with a highly regarded physical system.
In this next chapter we're looking for 4 more colleagues to round off our team in creating a cutting-edge machine learning models to supplement our physics-based model and make our predictions faster, more accurate, and more energy efficient. Normally, this kind of impactful work comes at the expense of a decent salary... not in this case! So if you have the relevant deep learning experience, you might be able to make that impact in the world with us! (If you're part of an under-represented minority, please consider applying. The vacancy note is written to cover 4 positions, which means we don't expect everyone to cover every aspect!)

Rate: £68,374 to 103,517€ (net of tax) + benefits
Location: Bonn, Germany or Reading, UK
Contact: jesper.dramsch@ecmwf.int (please mention this list when you get in touch)
Side reading: link, link, link

Senior Data Scientist at CEFAS, Permanent
The UK Government's Centre for Environment, Fisheries and Aquaculture Science (CEFAS) is looking for data scientists and senior data scientists to work on computer vision and machine learning projects. We're tackling the serious global problems of climate change, biodiversity loss and food security to secure a sustainable blue future for all.
Projects include the detection, classification and quantification of benthic organisms in sea floor video, remote electronic monitoring of fishing vessels, beach litter in remotely piloted aircraft imagery, and work with innovative ship-based instruments such as plankton cameras and flow cytometers.
Closing date 28th August.

Rate: 37 - 41K
Location: Remote, Lowestoft or Weymouth
Contact: robert.blackwell@cefas.gov.uk (please mention this list when you get in touch)
Side reading: link, link, link

Data Scientist at Arena.Online, Permanent, Droitwich (Worcestershire)
Arena.Online is the UK's leading ethical flower delivery service, specialising in personalised D2C and B2B fulfilment. The Insight & Data Science team is central in informing company decision-making and identifying growth opportunities. In an industry with sparse data, we're inventive and flexible in creating and maximising data sources.
Key responsibilities in this role include maintaining and improving our datasets (quality, functionality, and scope) and reporting capability. This already covers a wide range of sources and techniques with everything from formal order data to web scraping to customer review analysis. You will also be running projects independently and as part of our close-knit team which require creativity, adaptability, and determination. We aim to refine our data stack from data discovery to collection, preparation, reporting, dashboard building, and ultimately analysis and recommendations. We’re looking for somebody intelligent, curious, and full of exciting new ideas. As our team's third member, you'll face significant responsibility, but mega opportunity in our supportive, fun, and always learning-focused environment.

Rate: £45k
Location: Droitwich, Worcestershire, UK (Hybrid, must be able to travel to the office at least 3 days per week)
Contact: Please email alina.ali@arenaflowers.com with your CV and tailored cover letter (please mention this list when you get in touch)
Side reading: link, link

Manager Data Science - Business Analytics at Catawiki, Permanent, Amsterdam, The Netherlands
We’re looking for a Data Science Manager for our Commercial Data Insights team who will manage a team of Data Scientists / Analysts that support all the commercial departments of Catawiki (Marketing, Experts, Sales, Categories & Clusters, Finance) in using data to better understand our marketplace dynamics, to take the right decisions, and to identify opportunities to build a better Catawiki.

Rate: 
Location: Amsterdam, The Netherlands
Contact: Justin den Hamer - j.den.hamer@catawiki.nl (please mention this list when you get in touch)
Side reading: link, link, link

Analytics & Digital Data Scientist at JLR, Permanent, Gaydon UK
This role sits within the new JLR Supply Chain Analytics Team – an embedded cross-disciplinary group of data professionals who move fast using the latest technologies, sharing best practices with colleagues in JLR central digital.  In this delivery role you’ll be part of a fun, inclusive, learning, flexible, value focussed team backed by growing supply chain domain expertise and fearless in the face of a thorny technical problem or highly complex data.  You will assist in the design and delivery of transformational Digital products to help streamline the way we work, enabling us to meet future vehicle sales volumes and customer delivery expectations in this unprecedented era of global supply chain disruption.  
You will use your mathematical skills and scientific approach to help with the formulation and implementation of our Digital strategy for Planning, particularly the use of Optimisation and forecasting models to guide planning scenarios.  You will use your technical knowledge to translate the business question into the correct mathematical formulation and technically implement it, assisting the business with interpreting and diagnosing the output and proactively suggesting improvements.

Rate: 
Location: Gaydon (Midlands) UK, hybrid
Contact: aburrel1@jaguarlandrover.com or apply via the link (please mention this list when you get in touch)
Side reading: link, link

3-6 month Applied AI Residency at Gradient Labs
Gradient Labs is a new AI startup in London. We’re launching an AI Residency Program: we’re looking for an early-career AI researcher to join us for 3-6 months to experience what day 0 at an AI startup in London is like.
We’re building a suite of LLM-based agents that can safely automate customer support conversations in complex environments. You’ll be working directly with a small team— us (me, Danai Antoniou, Dimitri Masin) and any other early hires we make. Join us to spent your time researching, prototyping, optimising, and analysing what our agents can or should do. Help us build that functionality and evaluate how well it works.

Rate: Approx £60k
Location: London (in person 1-2 days per week)
Contact: neal@gradient-labs.ai, dimitri@gradient-labs.ai (please mention this list when you get in touch)
Side reading: link

                            Don't miss what's next. Subscribe to NotANumber: