Nov. 1, 2021, 10:17 p.m.

ETL pipelines, what makes a data science project successful, Data Science for beginners by Microsoft

5 minutes of Data Science

🗯 This week

  • I’ve been wrapping up the series on “Building an ETL pipeline from scratch.” It’s a great opportunity to get started if you’re not used to building pipelines. The bonus is that it uses the newest version of Airflow. I hope to finish the blog post in a week or two.
  • After a week of posting tweets on stats regarding successful data science projects, I figured the best is to compile in a (future) blogpost. After seeing so many failed projects, I found it interesting to understand how to tackle common data science problems. If you’re curious, I started tweting about it around here. Here are some example stats: stats
  • Remember to check the most popular Reddit posts this week on data-related boards. 👇

🔮 Data Science

  • Data Science for beginners by Microsoft

👋 See you next time

Let’s keep in touch, Pedro.

website | twitter | medium | github | stackoverflow | linkedin


🔝 Most popular Reddit posts this week

r/DataScience

  • Data Science is 80% fighting with IT, 19% cleaning data and 1% of all the cool and sexy crap you hear about the field. Agree? (⬆️ 1115 ; 💬 183)
  • Where do Data Scientists go camping? (⬆️ 598 ; 💬 43)
  • I was hired as a data analyst 4 months ago by an AI company and my boss is expecting me to create a reasoning system (as part of our attempt at KRR)– I feel extremely overwhelmed and am convinced I’ll be fired for underperforming (⬆️ 318 ; 💬 98)
  • 80/20 rule: models that account for maybe 20% of your toolkit but solve 80% of your practical problems? (⬆️ 277 ; 💬 103)
  • What would you do if the upper management wants you to work with 30 excel files that are being used as database? (⬆️ 264 ; 💬 101)

r/DataEngineering

  • Let’s show some appreciation to data engineers (⬆️ 508 ; 💬 5)
  • I deleted data from production (⬆️ 154 ; 💬 25)
  • How do you test your pipelines? (⬆️ 86 ; 💬 24)
  • Is our coding challenge too hard? (⬆️ 85 ; 💬 120)
  • We wrote about how Postman’s data team operates! (⬆️ 67 ; 💬 11)

r/MachineLearning

  • [P] StyleGAN3 + Cosplay Dataset. Happy Halloween! 🎃 (⬆️ 784 ; 💬 20)
  • [D] New in-depth AI interview episode out! Yuval was featured on 2 minute papers for his incredible work on AI toonification. (⬆️ 253 ; 💬 3)
  • [D] How can companies like Facebook use Pytorch for commercial applications when BN and dropout are patented? (⬆️ 232 ; 💬 106)
  • 100Circles - Words to Paintings via NightCafe VQGAN+CLIP [Project] (⬆️ 254 ; 💬 16)
  • [D] What is a reasonable way to address a paper that was published and you consider to be dishonest or plain bogus? (⬆️ 217 ; 💬 60)

r/LearnMachineLearning

  • Should have read binary classifier, but ok… (⬆️ 982 ; 💬 21)
  • We Built IntelliBrush - An AI Labeller Using Neural Networks and CV (⬆️ 640 ; 💬 19)
  • How to read more research papers? (tips & tools given) (⬆️ 299 ; 💬 13)
  • These plants do not exist (⬆️ 155 ; 💬 5)
  • Tired of university, i need help on how to learn AI and ML by myself (⬆️ 114 ; 💬 44)

r/AskStatistics

  • Can a Statistician using only R get a DS job not having a strong CS background? (⬆️ 22 ; 💬 10)
  • What are the differences between linear models and linear regression? (⬆️ 12 ; 💬 10)
  • I am trying to get a random number based on the normal distribution (⬆️ 11 ; 💬 11)
  • Alternatives to Poisson distribution. (⬆️ 10 ; 💬 19)
  • Resource recommendation to relearn statistics (⬆️ 11 ; 💬 4)

r/LatestInML

  • How to read more research papers? (tips & tools given) (⬆️ 26 ; 💬 6)
  • Straight out of science fiction! Drones that can track and 3D reconstruct any person also while avoiding obstacles! (pose estimation) (⬆️ 21 ; 💬 1)
  • ADOP: Approximate Differentiable One-Pixel Point Rendering (Synthesize Smooth Videos from a Couple of Images) (⬆️ 14 ; 💬 2)
  • Multitask Prompted Training Enables Zero-shot Task Generalization (Explained) (⬆️ 7 ; 💬 0)
  • [D] State of the art in the document information extraction/parsing for resume parsing? (⬆️ 7 ; 💬 0)

r/MLQuestions

  • Early coding habits to pick up (⬆️ 21 ; 💬 7)
  • What is the point of pseudo-labeling for a semi-supervised learning task? (⬆️ 12 ; 💬 0)
  • Graduate Studies in Machine Learning (⬆️ 9 ; 💬 3)
  • ML Algorithm Suggestions (⬆️ 8 ; 💬 5)
  • Python practical time series materials (⬆️ 7 ; 💬 1)

You just read issue #10 of 5 minutes of Data Science. You can also browse the full archives of this newsletter.

Share on Facebook Share on Twitter Share on LinkedIn
Find 5 minutes of Data Science elsewhere: GitHub Twitter Linkedin Mastodon
Brought to you by Buttondown, the easiest way to start and grow your newsletter.