Did you know that Holiday Extras are hiring a Senior Data Engineer? See their ad and many more below.
Below I’ve got the first of a two part interview with Kaggle Expert Mani Sarkar, I want to bring in some deeper expertise to this newsletter so I’m going to run some of these interviews. Please reply to let me know if this is useful.
Vincent shared a tweet about his tool that lets you “draw data” in a Jupyter Notebook which you can read back into Pandas - that looks great for playing with ideas.
Mani Sarkar is a Kaggle competitions expert, we met through PyDataLondon. I asked him some questions about how he approaches machine learning competitions and how similar this is to client engagements. He’s given me such a rich set of answers that I’m splitting the interview over several newsletters - there’s a pile of gold in here for you if you take part in competitions or tackle novel ML tasks. Part 1 follows.
Tackling a Kaggle competition often means learning a whole new domain. How do you approach getting all the key knowledge quickly?
Learning a new domain on Kaggle is always fun, although I only pick competitions where I know the domain or I’m interested in knowing the domain further. I start with the competition landing page, almost all the details are there and then I peek into the Discussion area for comments or messages from the participants and also look into the Notebooks (Code tab) shared by them. Between these three areas, all the details for the competition are present - we don’t need to look at anything else.
I record my findings in Dropbox paper - as it is shareable with other team-members, it’s better and lighter than Google Docs – and do this on a weekly basis, and rank them by importance. Note taking is a skill that I’m still learning and I try to apply the idea that less is more. I keep track of the important details that I need to know all the time in this document including links to important Notebooks and discussions, everything else I try to read and remember (or forget with time). I also jot down any ideas I get.
Do you use the same process on client engagements?
My general approach is similar, I’m always taking notes and downloading onto an exoskeleton (notebook or Dropbox paper doc). The difference with client engagements is the lack of similar resources like the Discussions and Notebooks resources that we find on Kaggle. But if a client has such artifacts that I can use for the solution, I gather them in one place and I’m also always collecting notes and links in a text pad or Dropbox paper that I may reuse during the course of the project/engagement. I apply these links across the code, issue or PR descriptions of the codebase and version control system - to keep track of the source of that specific area of the work so others who come after me (including myself) can refer to it and it creates a better picture and reference point for many future questions or conclusions.
Which tools help you productively investigate the data?
There are a handful of tools that I used to use and now it’s narrowed down to just one or two: pandas-profiling and Dataiku for columnar or numeric data - here’s some getting started tips. I used to also load the data into bamboolib but the purpose of such a tool is different. For text data I have written my own profiler called nlp-profiler.
But I also load up the data in plain old python notebooks to peek into the different visualisations the data can produce. In the past I would extract features using these hand-written functions, but you can already do this using most tools these days. The tools mentioned already give a lot of information which some may find scientific or academic in nature. I have listed many of them here, here and here and in my AI/ML/DL git repo).
It’s important to decompose the original raw data (one or more datasets) into a single dataset (subjective) with the columns properly transformed (as simple as possible), such that it is representative of the original raw data. I strive to achieve this all the time - as this helps the ML libraries/tools to create better models and apply all the pathways and options it can. Often this can lead to a bottleneck in training or inaccurate results if not done correctly. The step also means I could end up with multiple types of datasets (single datasets), and also multiple models from them. But in the end I will pick the better dataset/model pair(s) that has the best score(s) (using the metric that suits best for the competition or problem statement).
In the next newsletter Mani will talk about the tools he uses to gain trust and many shares some more advice. Thanks Mani!
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Carbon Re is an AI research and development company dedicated to removing Gigatons of CO2 (equivalent) from humanity’s emissions each year. We aim to do so by optimizing production processes, redesigning manufacturing systems, developing new control processes, and accelerating the development of new climate-friendly materials and systems.Carbon Re is an equal opportunity employer. We are still a small team and are committed to growing in an inclusive manner. Carbon Re is an equal opportunity employer. We are still a small team and are committed to growing in an inclusive manner.
As Principal Data Scientist, your key role will be to establish, define and implement data science solutions in order to deliver business value by making the optimal decisions to ensure efficient and cost-effective performance. You will build data science tools, providing business experts throughout Gas Transmission (GT) with the technology and expertise to unlock and exploit the information we hold to support the effective running of the business.
We’re looking for experienced, highly motivated Data Scientists to support the research and development of Ripjar’s analytics and data products. You will carry out data analysis tasks to develop Ripjar’s understanding of relevant data and will develop, train and evaluate machine learning models that can be integrated into Ripjar’s software products and data processing pipelines.
You will have a strong technical and theoretical background, with a strong understanding of statistics and statistical models. You will be proficient in at least one programming language, preferably Python. You will have a good understanding of machine learning and large-scale data analysis, and will be comfortable working with complex data at scale.
As the founding team data scientist, you’ll develop Good With’s intelligent data analysis and recommendation engines, supporting voice and natural language interaction with users.
Python and open source technologies are the overarching strategic choice for the data processing, analysis, machine learning and recommendation engines.
You’ll work at the heart of a dynamic multidisciplinary agile team to develop a platform and infrastructure connecting a voice-enabled intelligent mobile app, financial OpenBanking data sources, state of the art intelligent analytics and real-time recommendation engine to deliver personalised financial guidance to young and vulnerable adults.
As a founding member, you’ll get shares in an innovative business, supported by Innovate UK and Oxford Innovation, with ambitions and roadmap to scale internationally.
Supported by Advisors: Cambridge / FinHealthTech, Paypal/Venmo & Robinhood Brand Exec, Fintech4Good CTO & cxpartners CEO.
Working with: EPIC e-health programme for financial wellbeing & ICO Sandbox for ‘user always owns data’ approaches.
The Computational Optimisation Group has a two-year research opening (either pre- or post-doctoral) in surrogate-based optimisation. The role intersects computational optimisation, machine learning, and open-source software.
Senior Python Engineer - Knowledge Graph project for a major European Bank - Semantic Partners are seeking several skilled engineers with the following skillset - Python, Django, Flask etc, RESTful API’s, CI/CD, Containerisation, Docker, Kubernetes, NoSQL, BDD.
You’ll be joining a project team focusing on building a Knowledge Graph so an interest in Graph technologies and any experience of specific triple store systems would be a big plus, but more important is a desire to get into semantic engineering.
We believe that time is precious, so we create products, tech and services that make travel and holidays easy, simple and fun. Our purpose is clear: we offer customers less hassle so they can have more holiday. We are looking for Senior Data Engineers with big ideas. Problem solvers and collaborators who love a challenge and are always striving to improve and grow. You’ll bring everyone along on the journey with you - sharing your knowledge, inspiring others so they can improve too. The Data Team is a small but growing team meaning there’s lots of opportunity for you to get stuck in, help us progress and for you to learn and grow yourself. At an exciting time in our data journey, we’re working hard on clearing down the last of our legacy tech. We’re moving to a modern data stack; Airflow, Google BigQuery and Looker. The team’s work fuels our HEHA! App, enabling us to explode 7 data points into 1000, turning our customers’ trips into holiday experiences. Please visit the links below for further details on the opportunity.
Overton is looking for data scientists to join our small, dynamic team. Overton is a young company with big ambitions to help universities, think tanks and NGOs track how their research translates into real world policy, laws and regulations. Our platform allows users to search over 4.5m policy documents and understand how they link to each other, to academic papers and to individual authors. We use a wide range of techniques to clean and enhance our data, including entity recognition and linking, classifiers and document topic extraction as well as heuristic based approaches.
You’ll be helping with everything from developing new product updates to finding new data sources, experimenting with new ways to enrich the data and maintaining our existing pipelines. You will be fluent in Python and have experience with web scraping, machine learning pipelines and data analysis & reporting. Experience with data visualisation and front end development, familiarity with scholarly metadata, bibliometrics and/or knowledge of the academic, think tank or research impact space a bonus. See link for details on how to apply.
Sikoia is an ambitious new fintech building a unified data platform and API marketplace for global financial services. Our mission is to make it simpler for fintechs, lenders, and corporates to embed financial innovation and automate their decisioning, from customer onboarding through to risk underwriting.
Our founders are from Softbank, JPMorgan and Experian. With VC funding from EarlyBird and Seedcamp, plus support from top fintech CEO angel investors, we are now building a small, top quality tech/data engineering/data science team. Based remotely or at our office in central London, this is an opportunity to shape our product and technology from the very beginning.
Our tech stack is C# and Python, running in Azure. We are leveraging co-development projects with our first clients to build out our core platform. We have partnerships with UK credit bureaus and Open Banking providers, and are adding financial data vendors from many other countries. If you’re a mid/senior level Python developer or data engineer/data scientist, have fintech or SaaS experience, and are excited about fintech and financial innovation, then join us on this journey.