Further below are 5 job roles including Senior and Lead roles in DS and DEng at organisations like Causaly, a gambling start-up and an eco power firm
Below I talk on some significant speed-ups and parallelisation opportunities we’ll see in new versions of Python, refer to a nice faster-CSV-solution for Pandas and share some new tools which let you embed Python in Excel and tidy up your older-style Python code.
Python 3.10 is the current release of Python, I use it in my courses. By the end of the year we’ll see the new Python 3.11 release which will include the first round of significant performance enhancements inside the interpreter, paving the way for a later just-in-time compiler to be introduced. These speed-ups will impact any code written in pure Python, but less so for anything using C or Fortran libs (i.e. we won’t see a change on
News site Phoronix shows a comparison of Python 3.8 vs 3.10 (current) and the upcoming 3.11 showing that Python 3.11 should be circa 40% faster for pure-Python code compared to the current 3.10. If most of your computation is in
pandas or an ML library then you won’t see much of an improvement. All the improvement will be in core Python code doing things like data manipulation, loading and processing.
Python 3.11 vs PyPy and Pyston shows that there’s still room to make substantial gains in our usual CPython as PyPy (Python in a restricted subset of Python) and Pyston (a faster but less visible fork of CPython) can be a lot faster on these pure-Python benchmarks.
Both of the alternatives don’t have any particular usage by
pandas users due to integration issues (PyPy “works” with
numpy but is very slow, Pyston states some compatibility but I’ve not seen anyone using it). Python 3.11 should still be your target if you’re doing data science in the Python world.
If your work stack does a lot of regular data processing then you might want to start benchmarking changes that could result in upgrading to Python 3.11. Remember that the library ecosystem such as
pandas can take a good month or so to settle-in with minor bug fixes after each major new release of Python.
For 3.12 next year we’ll see another interesting change, as discussed on hn. Python is traditionally bound by the Global Interpreter Lock so multiple threads can only process code one-at-a-time regardless of the number of cores you have. This is a subject of my Higher Performance Python course and book with O’Reilly. Whilst Python isn’t ditching the GIL, it is introducing the long-existing but hidden subinterpreter API (hidden at the C level) as a new Python module. Each subinterpreter will have its own GIL, so this will be like using
multiprocessing to spawn processes (which have a setup time and memory cost), with no additional costs.
Exactly whether this will impact our DS workflows is uncertain. This PoC back in 2020 shows how calculating a set of factorial calculations in pure Python is just as fast with a hacked-together subinterpreter demo as with multi-core
multiprocessing. Data communication will occur using a new “channel” concept (like pipes). I don’t see how this will give us a clear advantage with Pandas over using
multiprocessing as we still have to pass results around and those data copies are typically expensive. There does seem to be some buzz in the community around this, so I’m keeping my eye on developments.
What we might also see (at the tail-end of Mark Shannon’s checklist is a short tracing optimizer in the interpreter. That’s a JIT (as used in PyPy), but baked inside CPython. Again this will only impact pure Python code, probably focusing on short loops of data processing code, but it might mean we can do less in
pandas and a bit more in pure Python with a big speed gain. How this works out is something we can learn next year.
One wider implication is that Python will be “taken more seriously” by some folk who default to other languages, because finally “that silly scripting language has a compiler like grown-up languages and maybe it won’t be so slow any more”. This isn’t a terribly well informed opinion, but one I come across from folk who default to C# or Java in quant orgs on occasion (and showing that
pandas default to C/Fortran routines behind the scenes slowly convinces them of equivalency). Python’s going to remain a number-1 language for quite some years to come.
Do you see yourself gaining by moving to a newer version of Python any time soon? Have you looked at subinterpreters to form an opinion of how or where they’ll benefit a data science pipeline? I’d love to hear from you, just reply to this, if you have!
Core dev Marc Garcia has posted a nice blog post on Pandas with hundreds of millions of rows showing ways of reading in a CSV file much faster than we’re used to.
Marc looks at 120M rows of airline data in CSV files. First he reads them in using pure CPython (no
pandas) to calculate an average delay - using Python 3.10 (note the potential for a 40% speed-up in my notes above!) this takes 7 minutes. Next he takes the same, unmodified, code and runs it using PyPy (the different JIT Python implementation that sadly doesn’t work so well with
pandas) and the same code runs in about 60% of the time.
pandas is used, with a look at the 80% time spent loading the data and the 20% time spent computing the average, taking under 3 minutes overall.
pyarrow has a new CSV engine and is used to get to a 1 minute solution (nice!). Marc then goes to more specific solutions - using PyArrow directly (no
pandas) and then optimising the algorithm to sum partial results before calculating the average at 30 seconds.
My notes back to Marc were that whilst this is nice, often with clients I don’t see reading CSVs to be the limit (it happens, and hackathons and Kaggle use CSV, but I think it is an uncommon bottleneck).
My client engagements tend to focus on reading repeatedly from SQL (not CSV) or Parquet, or being limited in subsequent processing stages (like
apply). Still - if you’re dealing with a data format that PyArrow supports, especially CSV, then looking at this approach is likely to lead you to some nice operational gains.
Please reply to this if you are limited by CSV file loading, I’m always curious to hear where real bottlenecks occur. I’ve got a similar post to share in a future issue soon which looks beyond Pandas for large quant data pipelines.
xlslim, cleaner code with
refurband other tools
Reader Russel (hello!) shares a tool he’s written to embed Python, Pandas and more in Excel called xlslim. I asked “why might you want to use it” and he’s noted:
Do your users love Excel, but you need to use Python libraries to deliver new functionality? Or perhaps you need to stream data - prices, orders, tweets, etc - into Excel? xlSlim can help. xlSlim makes it easy to use Python within Excel, often without any code changes. xlSlim can also be used to replace VBA with Python. The overriding goal of xlSlim is ease of use - numpy and pandas are natively supported, and objects are cached automatically. Give it a try, it is quick to install and includes a free trial. xlSlim is free forever if you only use the Python standard library, other use cases require a license.
I haven’t tried it (I’m on Linux most of the time with no Excel!), it has a free pure-Python license and needs a low cost (I’d argue - too cheap given what’s available) license if you want to embed Pandas and related tools. There’s a couple of videos on the homepage to give you an idea of what’s available.
A while back a bunch of useful Python tools were shared on a https://news.ycombinator.com/item?id=29582437 post including the excellent progress bar tqdm (the author attends PyDataLondon!), FastAPI for easy web APIs and a whole bunch of others, with useful discussion. If you’re after new tools, take a look.
If you use Python
f strings based on C string formatting (and you probably should, they simplify a lot of formatting) then this fstring help site is a great reference. It offers a great intro plus guidance on how to use them when debugging.
Finally there’s a new “simplification” tool called refurb with some useful discussion on hn on the good and less-useful parts. The goal is to offer more-modern alternatives to how you write your code, in a opinionated fashion as per
black‘s code formatting opinions. It is a young project but looks useful.
Are you the author of a library that would like some new users? Do you have a tool you think the readers here would appreciate? Just reply to this email and I’ll take a look, I’d love to hear from you.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it’ll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
In this role, NLP engineers will:
Collaborate with a multicultural team of engineers whose focus is in building information extraction pipelines operating on various biomedical texts Leverage a wide variety of techniques ranging from linguistic rules to transformers and deep neural networks in their day to day work Research, experiment with and implement state of the art approaches to named entity recognition, relationship extraction entity linking and document classification Work with professionally curated biomedical text data to both evaluate and continuously iterate on NLP solutions Produce performant and production quality code following best practices adopted by the team Improve (in performance, accuracy, scalability, security etc…) existing solutions to NLP problems
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 2+ years experience working as an NLP or ML Engineer solving problems related to text processing Excellent knowledge of Python and related libraries for working with data and training models (e.g. pandas, PyTorch) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of modern natural language processing tools and techniques Excellent understanding of the fundamentals of machine learning A product and user-centric mindset
We are looking for a Senior Data Engineer to join our Applied AI team.
Gather and understand data based on business requirements. Import big data (millions of records) from various formats (e.g. CSV, XML, SQL, JSON) to BigQuery. Process data on BigQuery using SQL, i.e. sanitize fields, aggregate records, combine with external data sources. Implement and maintain highly performant data pipelines with the industry’s best practices and technologies for scalability, fault tolerance and reliability. Build the necessary tools for monitoring, auditing, exporting and gleaning insights from our data pipelines Work with multiple stakeholders including software, machine learning, NLP and knowledge engineers, data curation specialists, and product owners to ensure all teams have a good understanding of the data and are using them in the right way.
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 5+ years experience in backend data processing and data pipelines Excellent knowledge of Python and related libraries for working with data (e.g. pandas, Airflow) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of data processing principles A product and user-centric mindset Proficiency in Git version control
This is an exciting opportunity to join a diverse team of strategists, campaigners and creatives to tackle some of the world’s most pressing challenges at an impressive scale.
This role is for a software start-up, although is a part of a much larger established group, so they have solid finance behind them. You would be working on iGaming/online Gambling products. As well as working on the product itself you would also work on improving the backend application architecture for performance, scalability and robustness, reducing complexity and making development easier.
Trust Power is an energy data startup. Our app, “Loop”, connects to a home’s smart meters, collects half-hourly usage data and combines with contextual data to provide personalised advice on how to reduce costs and carbon emissions. We have a rapidly growing customer base and lots of interesting data challenges to overcome. You’ll be working in a highly skilled team, fully empowered to use your skills to help our customers through the current energy crisis and beyond; transforming UK homes into the low carbon homes of the future. We’re looking for a mid to senior level data scientist with a bias for action and great communication skills.