Further below are 5 job roles including Senior roles in Data Science and Data Engineering at organisations like Causaly, Cultivate and MDSol.
Two issues back I spoke a little on the upcoming Python 3.11 release, that release is now out and I give a little benchmark below - the speedups for no work on your behalf are very nice.
Do you use scikit-learn with Pandas? If so there’s news of an upcoming nice change below where Transformers (e.g. the preprocessing tools) will maintain DataFrames rather than converting down to numpy
array objects which will ease debugging. I’m also experimenting with
mamba in place of
conda, notes further below.
Are you a data science leader? Would you like to raise leadership questions in a like-minded group to get answers and share your hard-won process solutions? I’m organising another of my Executives at PyData sessions for the upcoming PyDataGlobal (virtual, wordwide) conference for December 1-3. On Thursday Dec 1st I’ll run a session over a couple of hours focused on leaders, anyone who is approaching leadership or who runs a team is welcome to join.
I have a plan to make this more problem-solving focused than previous sessions, with a write-up to be shared after the conference so there’s something to take away. Attendance for these sessions is free if you have a Global ticket. This builds on the sessions I’ve volunteered to run in the past and the Success calls I’ve organised via this newsletter earlier this year.
Reply to this (or write to me - ian at ianozsvald com) if you’d like to be added to a reminder and a GCal calendar entry (there’s no obligation, these just remind you and set it in your calendar).
Python 3.11 has just been released, I’ve had a tiny play directly from Anaconda (see the demo of using
mamba as a replacement
conda for Python 3.11 below). There’s a lot of information out there about the new Faster Python project spearheaded by Mark Shannon. The bottom line for this release is that pure Python code can be sped-up “10-60%” (depending on what you’re doing), but it is unlikely to impact any Pandas or Numpy code (as the slow stuff there is delegated to compiled C routines).
Normally I use the following code snippet as an introduction to how the Numba compiler makes math functions faster during my Higher Performance Python course. It estimates Pi really inefficiently.
# approximately guess at pi using slow pure Python import random def monte_carlo_pi(n_samples): acc = 0 for i in range(n_samples): x = random.random() y = random.random() if (x ** 2 + y ** 2) < 1.0: acc += 1 return 4.0 * acc / n_samples print(monte_carlo_pi(1_000_000)) # 3.1422 - approx! %timeit monte_carlo_pi(1_000_000)
Using Python 3.10.6 this takes
443 ms, using Python 3.11.0 this takes
302 ms, so it runs in 70% of the time using the latest version of Python, with no code changes, just by upgrading the CPython interpreter.
I’m going to look more into the changes in a later issue. See the release notes for more details, I’d suggest holding off of any significant updates until at least the first bugfix release comes out.
Soledad (author of feature engine) tweeted out about the new Pandas DataFrame support for sklearn transformers, this is introduced in this 15 min demo video. This is coming in the next version (i.e. it isn’t available right now), the video is a feature preview for v1.2.
The short story is that whilst some parts of
sklearn preserved a
DataFrame if you had one (e.g.
traintestsplit) but the transformers such as the
StandardScaler always turned your
DataFrame into a numpy
sklearn only supported numpy and Pandas support only came later. You turn this on using
At 5:00 in the video we see the pretty
Pipeline visual representation that you can interact with. You can see a longer demo via Binder here and that link has a few clickable elements. Somehow I’d missed this when it got introduced.
What tricks in
sklearn and Pandas have helped you recently?
You may have seen references to the replacement for the Anaconda installation tool
conda with Quantsight’s open source
mamba. The big sell is that it is much faster at resolving the right set of versioned packages to install. I’d somewhat given up on using
conda for anything other that base packages as it could take 30-60 minutes for a complex new environment, and I make lots of new lightweight environments.
mamba offers a much faster solver and presents as a tool and library and it is helping
conda evolve to accept alternative solvers (as libraries). For you - it is much faster, so try it if
conda is too slow.
Building a 3.11 environment looks just the same as with
conda, being simple this didn’t suffer from much dependency resolution so
mamba create -n tmp311_mamba python=3.11 ipython ran pretty quickly - it also shows some nice graphics in the terminal whilst it sets up the new environment.
From what I’ve read it is pretty stable now and generally always faster than setting up or adding packages using
conda. I installed it with
conda install mamba -n base -c conda-forge from here.
Have you switched to
mamba already? Have you been happy with the experience? I’m guessing there’s some weird edge cases that might be worth knowing about?
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it’ll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
our team at Medidata is hiring a Senior Cloud Platform Applications Engineer in the London office. Medidata is a massive software company for clinical trials and our team focus on developing the Sensor Cloud, a technology with capabilities in ingesting, normalizing, and analyzing physiological data collected from wearable sensors and remote devices. We offer a good salary and great benefits !!
In this role, NLP engineers will:
Collaborate with a multicultural team of engineers whose focus is in building information extraction pipelines operating on various biomedical texts Leverage a wide variety of techniques ranging from linguistic rules to transformers and deep neural networks in their day to day work Research, experiment with and implement state of the art approaches to named entity recognition, relationship extraction entity linking and document classification Work with professionally curated biomedical text data to both evaluate and continuously iterate on NLP solutions Produce performant and production quality code following best practices adopted by the team Improve (in performance, accuracy, scalability, security etc…) existing solutions to NLP problems
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 2+ years experience working as an NLP or ML Engineer solving problems related to text processing Excellent knowledge of Python and related libraries for working with data and training models (e.g. pandas, PyTorch) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of modern natural language processing tools and techniques Excellent understanding of the fundamentals of machine learning A product and user-centric mindset
We are looking for a Senior Data Engineer to join our Applied AI team.
Gather and understand data based on business requirements. Import big data (millions of records) from various formats (e.g. CSV, XML, SQL, JSON) to BigQuery. Process data on BigQuery using SQL, i.e. sanitize fields, aggregate records, combine with external data sources. Implement and maintain highly performant data pipelines with the industry’s best practices and technologies for scalability, fault tolerance and reliability. Build the necessary tools for monitoring, auditing, exporting and gleaning insights from our data pipelines Work with multiple stakeholders including software, machine learning, NLP and knowledge engineers, data curation specialists, and product owners to ensure all teams have a good understanding of the data and are using them in the right way.
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 5+ years experience in backend data processing and data pipelines Excellent knowledge of Python and related libraries for working with data (e.g. pandas, Airflow) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of data processing principles A product and user-centric mindset Proficiency in Git version control
This is an exciting opportunity to join a diverse team of strategists, campaigners and creatives to tackle some of the world’s most pressing challenges at an impressive scale.
This role is for a software start-up, although is a part of a much larger established group, so they have solid finance behind them. You would be working on iGaming/online Gambling products. As well as working on the product itself you would also work on improving the backend application architecture for performance, scalability and robustness, reducing complexity and making development easier.