Did you know that Netacea are hiring for a Head of DS + Data Engineer, Signal AI need a Senior Data Scientist, Aflorithmic need a Data Eng and 2iQresearch are hiring for a Senior Dev and and a Quant Dev? VivacityLabs have a Special Projects role and Inawisdom need a Software Engineer.
Details for all of these and more are down below.
Down below I have my next interview - this one with Ritchie Vink on his “faster than Pandas” Polars dataframe library. It is only 2 years old, beats Pandas (sometimes significantly) in many benchmarks and seems to be gaining traction. I think his developmental style is interesting and I dig into this in some of the interview, it’ll air over this and the next issue. Previously I interviewed Kaggle Competition Expert Mani Sarkar which contained lots of great ML tips.
One of my realisations this year (in part as a new parent) is that I can let go of lots of unproductive things that built-up over time. For me one is the realisation that “I’ve done this stuff for 20 years - I ought to know the answer by now!”. Instead I’m getting better with the Zen/Buddhist notion of “beginner mind” - maybe I can see everything afresh, just like my infant son does, without past associations clouding my view. This viewpoint lets me play more which is peaceful. I continue to build up my observations and code snippets for me in my notes to self and I’ve got a private GDoc where I try to make notes on important points from the books I read (how the heck did I ever remember anything before child-plus-sleep-deprivation, I’ll never know). Having these external long-term memories feels very powerful. Do you have any tips to share back on how you deal with our rapidly changing world?
Jesper, author of the Late to the Party AI newsletter was kind enough to reference me last issue noting that he’d had job interviews via my jobs list - w00t! I read Jesper’s emails, little insights like the reference to Matt Harrison’s Advent of Code error-log to improve one’s learning rate is very good. James Powell shares Python & Pandas tips (currently lots on the Advent of Code) in his newsletter. Lynn Cherry’s TITAA has long occasional newsletters on art, Python and NLP with random stuff like Christmas poop logs. All are worth signing-up for!
My next Success course will give you the tools you need to derisk projects and increase the likelihood that you get to deliver ontime and with happy clients. Send me an email back if you have questions? It runs on Feb 9th+10th virtually in UK hours.
With one of my clients we recently discussed how to deal with having built a large distributed team around the globe. Many hires were made during the pandemic and most have never met the wider team except via Zoom. Getting the new hires to critique internal process and offer suggestions was really hard.
Some of the suggestions we came up with include:
Mostly these ideas focus on “getting folk to talk to one another” which opens the door to trust and discussion around what might need fixing. What ideas might you suggest that you’ve seen work? I’d happily share more ideas here.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
We’ll talk about topics like these in the next iteration of my Success course, if brainstorming on problems in your workplace is something you’d like to discuss.
Ritchie Vink is the creator of Polars, a faster-than-Pandas dataframe library that he started during the first lockdown of 2020. His “I wrote one of the fastest dataframe libraries” blog post outlines a number of the technical design points that make Polars so fast for in-memory single machine dataframe manipulation and is a great read. Let’s learn some more…
Well it started mostly as a hobby project because I needed to join some tables in Rust, and found it quite some work as there was not yet a DataFrame library in that ecosystem. I was inspired by the Apache Arrow project with Arrow (as used in Parquet), and built a simple DataFrame structure on top of Apache Arrow. Incrementally that turned into a dataframe library which at first was slower than Pandas, then it became faster than Pandas and I added a Python frontend and here we are 2 years later!
Along the way I learned a lot about CPU caches, multi-core design, query planning and building a good composable API and that’s all reflected in the benchmarks for Polars.
I’ve had anecdotal results from users finding 5-70x speed-ups over the equivalent Pandas code, without having to use tricks like compiling with Numba. Any practical Pandas dataframe problem should run faster using Polars.
Polars was not really designed because of Pandas pains. Of course you learn from mistakes done in Pandas. Polars was written because I was enthusiastic about the Apache Arrow project (which was initiated because of Pandas pains). We can think that Arrow was designed by looking back at Pandas, and Pandas was designed by using NumPy, and we have great examples around us with Dask and Spark and databases for thinking about multiple cores and larger datasets.
Polars aims to do everything that Pandas does (without any built-in plotting - use your own plotting library for that!). It knows before we run code what datatypes we expect to see at the end, it has native parallelisation across all the CPUs on your machine, it can do both eager and lazy evaluation. With lazy evaluation it can combine operations with query planning to do less work. Finally it has a predictable, composable API with a much smaller code footprint than we see in Pandas.
As an example of composition - to implement a quantile function we can reuse both a sort and an indexed look-up to fetch any quantile, so there’s no need to write any additional code to add a quantile feature.
The most inspiration came from database systems, mostly how they define a query plan and optimize that. This gave way for Polars expressions, and that are the key selling point of our API. They are declarative, composable and fast.
Behind the scenes we have copy-on-write so generally copies, which are expensive in RAM and speed, don’t have to happen unless you modify the data - the data itself is immutable. All of this happens in the Rust layer, using Rust threads (which you don’t see from the Python frontend), so running low on RAM is much less of an issue compared to Pandas.
Notably the data is stored using Arrow in a columnar format (not in 2D arrays via a block manager as in Pandas), which opens up other opportunities for cache-friendly column based operations with parallel SIMD execution. We also don’t have indexes in Polars, so there’s no construction cost for the hashtable on each index creation (which you’d generally only need for joins).
It is becoming ever more stable. But there are still parts that we want to improve, therefore we don’t bend backwards to prevent breaking changes. They are for the better and often can be resolved with small adaptations for code.
I work at a company called CEMS and we have removed Pandas completely and replaced the production data ingestion and joining code with Polars for faster overall results. Xomnia also have a blog post on why they support the development of Polars.
I’ve also had reports that ETL tasks can be reduced from hours to minutes as described in this blog post.
Part 2 will follow in the next issue.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Netacea is an industry-leading provider of bot detection & mitigation capabilities to business struggling with automated threats against their websites, apps and APIs. We ingest and predict on vast quantities of streamed real-time data, sometimes millions of messages per second. As a successful start-up that is now scaling up substantially, having robust and high quality data pipelines is more important than ever. We are looking for an experienced data engineer with a passion for technology and data to help us build a stable and scalable platform.
You will be part of a strong and established data science team, working with another data engineer and with our chief technical architect to research, explore and build our next generation pipelines & processes for handling vast quantities of data and applying our state-of-the-art bot detection capabilities. You will get the opportunity to explore new technologies, face unique challenges, and develop your own skills and experience through training opportunities and collaboration with our other highly skilled delivery teams.
We have open positions for two mid-level data scientists on our team at Netacea. You will be joining a strong and established team of data scientists and data engineers, working on unique problems at a vast scale. You will be building an industry-leading bot detection product, solving new emerging threats for our customers, and developing your own skills and experience through training opportunities and collaboration with our other highly skilled delivery teams.
We also have two Lead Data Scientist roles with one of these specialised towards supporting long-term technical customer relationships. Both Lead roles will be fundamental to the success and growth of the data science function at Netacea. You will be a technical leader, driving quality and innovation in our product, and supporting a highly competent team to deliver revolutionary data science for our customers.
Application links: Lead Data Scientist (Commercial) - https://apply.workable.com/netacea-1/j/4B7ACCC80D/?utm_medium=social_share_link Lead Data Scientist - https://apply.workable.com/j/F3A4E8F82F/?utm_medium=social_share_link Data Scientist - https://apply.workable.com/j/D58EA8DCE2/?utm_medium=social_share_link
Netacea is a Manchester based business providing revolutionary products including website queuing system to prevent traffic to websites that may cause failure and bot management solution that protects websites, mobile apps and APIs from heavy traffic and malicious attacks such as scraping, credential stuffing and account takeover. Netacea was recently categorised by Forrester as a leader in this rapidly expanding market.
We are looking for an outstanding leader to spearhead the growth and development of their data science team. As Head of Data Science, you will lead a department of skilled engineers to deliver outstanding solutions to the most interesting problems in cybersecurity. You will feel comfortable working in an agile way, taking ownership of data science strategy, effectiveness, delivery, and quality. You will grow, nurture, and develop your team and encourage them to explore their full potential. This is a mainly hands-off role, but you should feel confident talking about data science technology with internal and external stakeholders and partners. You will be passionate about data, and understand how it can be used to deliver value to customers.
You will be a core player in the growth of our platform. You will work within one of our platform teams to innovate, collaborate, and iterate in developing solutions to difficult problems. Our teams are autonomous and cross-functional, encompassing every role required to build and improve on our products in whatever way we see best. You will be hands-on working on end-to-end product development cycles from discovery to deployment. This encompasses helping your team discover problems and explore the feasibility and value of potential ML-driven solutions; building prototype solutions and conducting offline and online experiments for validation; collaborating with engineers and product managers on bringing further iterations for those solutions into the products through integration, deployment and scaling.
This particular role will initially be within a team whose responsibilities include effectiveness and efficiency of our labelling processes and tool, training, monitoring and deployment of systems and models for entity linking, text classification and sentiment analysis, among others, across multiple data types. This team also works closely with the operation teams to ensure systems and models are properly maintained.
We’re an audio as a service startup, building an API first solution to add audio to applications. We have customers and we’re fast growing.
As Audio-As-A-Service API-first Voice Tech company our aim is to democratise the way audio is produced. We use AI and “Deepfake for Good” to create beautiful Voice and Audio from simple Text-to-speech - making creating beautiful audio content (from simple text) as easy as writing a blog. Join a 23 people strong international engineering, voice, R&D and business team made out of 13 nationalities (backgrounds include: Ex-University of Edinburgh, PhDs, European Space Agency, SAP, Amazon).
Looking for a data engineer to work on our core data pipelines for our voice-as-a-service and support our team growing. Our stack includes Kubernetes, Python, NodeJS and we use a lot of kubeflow and the serverless stack.
At Vivacity, we make cities smarter. Using Reinforcement Learning techniques at the forefront of academic and research thinking, our award winning teams optimise traffic lights to prioritise cyclists and improve air quality. Our work makes a real difference to real people using ‘privacy by design’ principles.
We’re looking for a confident developer / ML engineer, who is comfortable working in an adaptive setting: get familiar with complex concepts, implement accurately, and communicate your plans effectively with various stakeholders. We’d like to see 1-2 years of industry experience in a relevant field. Our software is in many modern programming languages (Python, Golang, C++ etc) so you will need a willingness to learn. We’d also like to see good capability with Python or Golang.
Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. Built originally in Python for working with NumPy arrays, Zarr is now supported in more than half a dozen languages. With funding from the Chan Zuckerberg Initiative, we are looking to hire a full-time, open-source enthusiast for two years to work as our community manager.
NumFOCUS is seeking a Scientific Software Developer to support the SunPy project. SunPy is a Python-based open source scientific software package supporting solar physics data analysis. Contract is available for U.S. residents only. This is a 1-year contract but work may be completed in less time.
The primary role of the Project Jupyter Community Events Manager will be to manage two event programs: JupyterCon and Jupyter Community Workshops. In conjunction with NumFOCUS and Project Jupyter leadership, you will create and implement a strategy to connect the international Jupyter community through both online and in-person events.
Inawisdom are a Data Science & Machine Learning Consultancy, and AWS Premier Partner. We are looking for mid+ level Python developers with AWS experience (or OO Programmers with AWS who are willing to lean Python, or vice versa!) for a Permanent role. This is an exciting opportunity for someone to make an impact implementing and delivering cloud native solutions and serverless applications in a Data Science business. You will be required to develop software with the latest and greatest tech for high profile, enterprise clients.
• Knowledge of functional and object oriented programming. • Knowledge of synchronous and asynchronous programming. • 2 or more years developing in Python 2.6 or 3.x. • Experience in using Python frameworks (e.g. Flask, Boto 3) • Familiarity with Amazon Web Services (AWS) and REST APIs. • Understanding of databases and SQL. • Understanding of Non-SQL databases. • Experience in unit testing and TTD.
Desirable requirements: • Experience in AWS serverless services (Lambda, API GW, SNS, SQS, and Dynamo DB). • Has developed solutions using AWS SAM or the Serverless Framework and defined APIs in Swagger.
We are looking for an experienced Python Developer with a strong background in Finance to join us as one of our first engineers in the core team.
You will play a key role in designing and maintaining analytics/predictions and visualizations for our new data platform, “Alpha Terminal.” It bundles 2iQ’s data and analytics into one easy-to-use product, offering fundamental investors a range of powerful insights.
Responsibilities: Working with the Quant and Product teams by designing, building and managing critical infrastructure while automating everything with code. Initially, this role will be based in our Lisbon office. However, there is the potential for flexible working arrangements in the future. The role may suit an individual that is looking for a change of scenery or better work-life balance.
Requirements: Experience in a DevOps or software engineering role Strong background with Linux, K8s and Docker (or other container) High proficiency in a language such as Python, Java, or Go
Nice to have Cloud or Big Data experience (Elastic, Aerospike, ClickHouse, KDB+, …) Experience with message buses Spark and/or Dask knowledge
We are seeking highly talented Quantitative Developer with a solid background in Python to join our platform analytics team. In this role, you will help implement, support, and run the hybrid compute infrastructure that manages all research and production workloads.
Working closely with the Quant and Product teams, to support and develop code that is running in our production systems. These systems are the building blocks of the “Alpha Terminal”, a tool for fundamental investors to explore the market. You will also build and optimise data analytics services as well as integrating the data to support the quantitative team. Adapting research prototypes of models to the production environment, is also a key responsibility of this role. This role is to be fulfilled in our Lisbon office. However, flexible working arrangements as well as a hybrid model transition period are available for all candidates.
Requirements: Experience in numerical Python and SQL Working knowledge of Pandas / NumPy libraries Dask and/or Spark knowledge CI/CD knowledge
Nice to have: Docker (or other containerization) knowledge Cloud or Big Data experience (Parquet, PyArrow, Aerospike, ClickHouse, KDB+, …) Knowledge of AI/ML libraries (Tensorflow, PyTorch, SciKit, ..)
Over half a billion videos are watched across millions of websites on a JW Player video player every day. Our product teams leverage data coming from our player to measure success, prioritize our next steps, and envision new possibilities for the thousands of video publishers we serve daily across the web. We iterate quickly, conduct frequent experiments as part of product development, and seek to be data driven in everything we do.
As a Product Analyst on the JW Player Data Science & Product Analytics team, you will work closely with product managers, engineers, and data scientists to develop insights that inform product decisions and strategy. Your findings will impact the next generation of JW Player products, from our flagship video player and video platform to our video recommendations service and other data products. You’ll play a critical role in improving these products and guiding our future development efforts.
JW Player powers billions of video plays every week across a wide spanning web of broadcasters and video publishers with a diverse set of audiences and content types. Leveraging the vast stream of data sent by our flagship player, the Data Science team works in close collaboration with adjacent teams to improve our existing products, drive sound decision making, and develop new data products that bring value to our customers in both the video publishing and video advertising spaces. We iterate quickly, conduct frequent experiments, and seek to be data driven in everything we do.
As a Senior Data Scientist at JW Player, you will be joining a collaborative, creative, multidisciplinary team of scientists, engineers, and data analysts responsible for research and development, product analytics, and running production machine learning models that make tens of millions of predictions every day.