Further below are 5 jobs including: Senior Data Engineer at Hertility Ltd, Permanent, Remote, Software Engineer at Qualis Flow Ltd, Scientist/Engineer for Machine Learning (multiple vacancies), Senior Data Scientist at CEFAS, Permanent, Manager Data Science - Business Analytics at Catawiki, Permanent, Amsterdam, The Netherlands
So I'm coming up to the end of my several-month break and shortly will be back onto some proper data science work. Taking a head-clearing break has been amazing. I've shared some notes and photos below on the charity fund-raising old-car rally I was on recently (we raised a ton of cash for charity - thanks for your support!), currently I'm surfing in Cornwall and then over the next couple of months I've got updated forms of my training courses to run privately and in public. The Rebel AI leadership group also kicks off in a month.
Further below I talk on a couple of LLM and open source topics that I've been keeping an eye on.
I've listed dates for the following public courses, they'll each run as virtual courses via Zoom in the UK mornings. You can also add your email for future date notifications. Early bird tickets are available for each:
The Successful Data Science Projects course is related to the private RebelAI leadership group (announced here) I'm putting together, I hope to share some reflections from those sessions from November onwards. If you're a Data Science leader and you could do with support - and a peer group of leaders - get in contact and we can have a chat. This course helps you avoid common mistakes which lead to failed projects.
The Software Engineering course starts with identifying "what's wrong" with a badly written Notebook and moves through refactoring, adding unit tests and assumption-tests, structuring code for maintenance and getting to a decent project structure that supports a
setup.py for installation. This is great for anyone who feels that they don't understand topics like unit testing, Python's
sys.path, modules and folder structures. It'll also help you gain seniority faster.
The Higher Performance Python class starts with profiling (critical if you're to work on the actual slow part of your code!), then moves through making numeric and Pandas code faster, scaling up with Dask and looking at the new Polars to start to understand why it is a compelling alternative to Pandas. If you're stuck with slow code and expensive bills then profiling and trying the right speed-up will make your team more productive.
If you've got queries - just reply to this newsletter, I'm happy to chat about the content.
We actually had an awful lot of fun! Getting the car finally dressed in its surf boards and sticker-set was great. The first drive down to the south coast for our Le Shuttle departure was a bit wobbly as we had odd engine issues (dropping revs coming onto roundabouts!) but that mostly cleared up later. Arriving in St Omer to meet the other teams was amazing, lots of other teams had dressed themselves (it was fancy-dress based) and we all had little set pieces to perform when we arrived. We've posted a pile of pictures onto our JustGiving page for Parkinson's Research (one of our co-drivers has Parkinson's in the family), I'd be happy to tell you silly stories if I see you in reality.
The Nürburgring is the only public "proper" racing circuit in the Europe that one can arrange to drive on. Whilst my co-driver has been dead keen to drive the 'ring' for years, we'd agreed that taking a 24 year old estate car onto the circuit, on day 2 of a 7 day excursion, was asking for trouble. Especially given "the revs issue", the paraphenalia on the car and our "mismatch" to the quality of the cars that might be on the circuit. So imagine my surprise when I tried to direct us to the parking and (I swear - not what I wanted) we found ourselves at the grid-gate onto the 'ring. That was enough for my buddy and a race official waved us forwards, so on we drove, filled with trepidation. Then followed a 21 minutes slightly-terrifying 40mph drive around the 'ring, with Porches, Ferraris and more lapping us at >100mph.
Being slightly terrified wasn't silly - another team had a mishap and their wheel fell off (they shattered a wishbone hitting a corner too fast), after we were all clear of the track we ended up helping them knock components off of their suspension to try to bend their Mercedes back into a drivable shape. Thankfully a farmer the next day was able to weld their suspension back together (true story) and they carried on with the rally.
The trip over the Alps via the Timmelsjoch Pass was almost sedate in comparison, despite the sheer drops past a metal barrier and two small lanes, plus many motobikes overtaking and various large trucks swinging through the hairpin turns. The one oddity was when Waze offered me a "better route down" and we suddenly found ourselves on a 2m wide track with no option to turn around or reverse back up, with the sheer drop off to one side. Meeting a grass-cutting tractor half way down was a surprise and required a bit of deft manoeuvering at a corner. When we got to the bottom and slid back onto a more normal-sized road there were some looks of confusion on the faces of passers-by - perhaps a surf car descening the Alps on a cattle track is a little unusual.
Oh yes - we also won the rally. You got points for activities every day (like "surfing the Rhine"), overall we managed to gain enough points to win several stages and then the overall event. We donated our prize money to our JustGiving page, as did another team (thanks Wreck to the Future!). We've ended up raising over £3,000 for Parkinson's Research which is a brilliant result. If you'd like to donate, it'll take you two minutes on JustGiving and all the money we've raised goes to our chosen charity.
Entries are open for next year.
There's a lot of talk about the wonderfulness of LLMs and less about the negative sides, particularly with the rush to generate new content regardless of "usefulness". Author Jane Friedman writes writes about how LLMs are being used by others to publish books using her name to Amazon and Goodreads. Given her profile she was able, over a couple of weeks, to get them removed from her Goodreads official profile first and then later from Amazon.
We've increasingly seen auto-generated books - my buddy Shardcore auto-generated a book written (and clearly labelled it as such) as if by his friend and author John Higgs using an LLM in 2019 called algohiggs which was featured in a few art sets.
I think however the phenomenon of auto-generating "fairly convincing" content in volume and trying to trade it off on someone else's name is pretty new. If the likes of Amazon don't improve their author verification process, we're only going to end up trusting such sites less. Weird times.
On a similar topic, I see LLMs being wired into sites like Quora to help generate answers quickly. Great - if we trust it. This example popped up yesterday showing something rather nasty. The question "can you melt an egg?" was asked on Quora and their LLM generated the answer "yes" with an explanation (note - you can't melt an egg). Because Quora is often authoratitive, Google allegedly picked up the answer and used it to answer the same question in a Google query (see photos from that tweet). I could verify yesterday that Quora still has the LLM self-generated answers as their top answer, but Google was showing the human's top answer (i.e. "no") in their snippet when I checked.
Ohh, interesting - today when I run that query on Quora ChatGPT says "no you can't melt an egg". It is a fast moving world. I wonder how many other answers are in conflict with a human expert? This is covered in this arstechnica article, the big take-home being that an autogenerated lie self-perpetuated from Quora onto Google and then was ready to be scraped by other LLM training processes who trust Google and Quora to serve up truthy answers. In LLMs we trust, right?
I like to use the IPython shell for rapid prototyping, but copying each line of successful code is a PITA. I remembered in the past having an IPython magic (the entries that start with a
%) that quickly copied a useful line of code to the clipboard, so I could paste it via IDE into my code. A very quick bit of googling revealed a stackoverflow post about a similar topic and that led to
%clip on GitHub...and checking the comments I saw my Python 3 updates from 2014. It turns out these still apply and it all "just works".
I'd been writing a function that needed
shutil to copy a file, having found the right invocation in IPython (checking the outcome in the shell as I ran it) I just needed to
up arrow and prepend
%clip shutil.copyfile... and then that useful line was in my clipboard reading for pasting back into my script.
When I teach my courses I talk about contributing to open source as a way of encoding what you know now, and helping others (and yourself) in the future, this just felt like a rather nice example of exactly that :-)
A little while back I got stuck, without
numpy, wanting a binomial random draw in pure Python. I was surprised to see that the standard Python 3.11 library's
random module lacked a
binomial call. As Raymond Hettinger wrote, I ended up with something similar to (and not as clever as this faster alternative):
def binomialvariate(n, p): ''' Binomial distribution for the number of successes in *n* Bernoulli trials each with a probability *p* of success. Returns an integer in the range: 0 <= X <= n''' return sum(random() < p for i in range(n))
This turned into a pull request which looks like it has been merged for the upcoming Python 3.12 release in the next month or so. It obviously begs the question about how much of
numpy needs to be squashed back into the Python core library, but as convenience goes, it will be nice to see it there.
If you don't know how new code gets added to the Python standard library - scan the comments in the links above. It is all discussion based, just like it will be on your team (and very decentralised, with a view to a decade+ support plan!).
There was a time a decade back when it was useful to show clients evidence of Python's growth if they thought it "wasn't a serious language". I'd normally refer to the TIOBE Index but another useful bit of evidence has popped up. IEEE Spectrum has noted that Python & SQL are the top languages of 2023, IEEE is a well regarded professional body and Spectrum, their publication, is 60 years old. If you need any evidence in client conversations about Python's suitability, maybe this helps.
Some of you know that I've got a 3 year old (Kai). He's growing well and is very interested in the wider world. I'm inevitably keen to show him a bit about "what daddy does" so I loved investing in the first Computing Engineering for Babies Book via KickStarter. We learned to press the right switches to get the light to light up and the founder - Chase Roberts - shared his developmental diary. It was fun and clearly he'd put a crazy (really - crazy) amount of work into succeeding with his first hardware project, and he really cared about it.
A couple of years later I've just backed his Computer Engineering for Big Babies (and there's still time to join the project, if you're that way inclined). There's memory reads! Shift registers! And no doubt more. I'm stoked, and my little boy will be too when he figures out what's turned up for Christmas.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Velo is a (seed-funded) stealth start up working on code generation using LLMs. We're looking to hire an ML researcher to help build out and optimise our system. Right now we're just using Pytorch (but open to suggestions!).
We're looking for someone with the following skills: - Experience in a research setting: - Discussing potential experiments and deciding which to run - Making any (software) changes required to run them - Analysing the results and communicating outcomes, taking successful findings forward (either into system improvements or further experiments) - Proficient with Python (and able to follow our existing way of doing things) - Knowledge of LLMs - Comfortable with remote work
Hertility is a women’s health company built by women, for women. We’re shaping the future of reproductive healthcare by pioneering unique diagnostic testing that provides data-driven and advanced insights into reproductive health, fertility decline and the onset of menopause. We provide expert advice, education and access to care - all from the comfort of your home.
We’re looking for a Senior Data Engineer to help us build the world’s first data platform for women to manage their hormonal health. This is an exciting opportunity to work with a variety of data from multiple sources. You will be building out scalable data solutions for clinical services that are changing the lives of women everywhere.
Key responsibilities will include building out data infrastructure on AWS, developing ETL code for data cleaning/linking and collaborating with data scientists/machine learning experts to design cutting AI tools that will revolutionise healthcare. We’re looking for someone with 5+ years of experience, a degree in Computer Science, IT, or similar field from a top university and a ‘can-do’ product mindset.
We’re looking for someone to someone that will be responsible for designing and developing the software that powers our products. You’ll need to collaborate with other teams, write high-quality code and ensure the codebase follows best practices. You are curious and enthusiastic with a drive to constantly learn and acquire new knowledge.
You’ll be working in our Engineering team, working closely with Product and other technical teams and reporting to the team lead.
Always have an eye on the big picture to avoid getting lost in the weeds
Rate: £60,000 – £75,000,
Forecasting the weather accurately saves lives. At the ECMWF we have been predicting the weather 24/7 since 1975 with now 35 member and co-operating countries with a highly regarded physical system.
In this next chapter we're looking for 4 more colleagues to round off our team in creating a cutting-edge machine learning models to supplement our physics-based model and make our predictions faster, more accurate, and more energy efficient. Normally, this kind of impactful work comes at the expense of a decent salary... not in this case! So if you have the relevant deep learning experience, you might be able to make that impact in the world with us! (If you're part of an under-represented minority, please consider applying. The vacancy note is written to cover 4 positions, which means we don't expect everyone to cover every aspect!)
The UK Government's Centre for Environment, Fisheries and Aquaculture Science (CEFAS) is looking for data scientists and senior data scientists to work on computer vision and machine learning projects. We're tackling the serious global problems of climate change, biodiversity loss and food security to secure a sustainable blue future for all.
Projects include the detection, classification and quantification of benthic organisms in sea floor video, remote electronic monitoring of fishing vessels, beach litter in remotely piloted aircraft imagery, and work with innovative ship-based instruments such as plankton cameras and flow cytometers.
Closing date 28th August.
We’re looking for a Data Science Manager for our Commercial Data Insights team who will manage a team of Data Scientists / Analysts that support all the commercial departments of Catawiki (Marketing, Experts, Sales, Categories & Clusters, Finance) in using data to better understand our marketplace dynamics, to take the right decisions, and to identify opportunities to build a better Catawiki.