How do I become a data scientist?
January 15, 2016

This week, our founder and executive director Jake Porway got to do his first ever Reddit AMA on the data science subreddit. While there were tons of great questions ranging from ethics in data science to how DataKind measures impact and more - it was clear a common theme centered around how to get started in a career in data science. Jake wrote the reply below to help folks start their journeys towards being a data scientist. If you too are wondering how to get started, ready through below and feel free to post any resources you've found useful in the comments! Read through all the questions in the full AMA thread here.


Hey everyone! I’m seeing many questions from budding or new data scientists in the thread trying to figure out the best path ahead - How do I get started in a career in data science? What skills do I need? What should I major in?

As we all know, data science is becoming increasingly popular, yet the term is still hotly debated.

So to start us off, my view is that a data scientist is basically a statistician who can program. Data science is the art of using the latest computer science and statistical techniques to collect, analyze, visualize, and otherwise draw conclusions from data. Most of the thorny topics being discussed these days about bias, quality of data, modeling, learning, and data cleaning all come from the healthy body of statistics we've built over the last 100 years.

The novelty of data science comes from a technical need to be able to handle the volume of data now available and to wrangle it from many disparate forms into a clean, usable format. Beyond that, all the other skills attributed to a data scientist - visual communication skills, good written skills, subject matter expertise - hold true for anyone doing science, from biology to anthropology.

What’s interesting to note is that the skills needed by a true “data scientist” are exceedingly rare. Using Drew Conway’s data science Venn diagram (yes, I still reference this one), one needs to have:

  • Hacking skills: These are programming and scripting skills, but are often not taught in universities or even in industry.
  • Statistical experience: Not many people are trained in formal statistics beyond simple linear regression. A good data scientist should be an expert in questions of bias, advanced modeling, and causal inference.
  • Machine learning: I’m going to single out machine learning chops, as not every hacker + stats person has these. It takes a special skill set to build efficient neural networks and to understand how they do/don’t work.
  • Substantive expertise: The data scientist may not need to be an expert in the field themselves but, if they’re not, they better learn enough of it from an expert to be able to interpret results or think creatively. At DataKind we solve this by teaming non-profits with the data scientists to bring their expertise.

The good news? With this diversity of skills needed, there are lots of pathways you can follow and no one way. For example, Drew himself was a computer science undergrad who went on to get a Ph.D. in political science. His graduate work drew him into the world of statistics and data, including machine learning concepts that inspired him to study the social networks of terrorists and do predictive analytics on voting behaviors. He also has great communication skills, picked up some basic visualization, and has strong business and management savvy from his time in government and intelligence.

Other people I know have come from mathematics backgrounds and then picked up programming and computer science to be able to build more advanced models. No matter how you get there, you’ll need to build up your programming and stats skills and not lose sight of your soft skills of communication and creativity.

To learn more about the paths of data scientists, I also recommend a great book Sebastian Gutierrez, one of the moderators of /r/datascience put together called Data Scientists at Work.

The bad news? There are lots of pathways you can follow so it can feel overwhelming to figure out how to get started.

The Internet is now littered with online courses to teach you data science. Check out Coursera first and foremost. There are also fellowships through Insight and the Data Incubator that will round out your data science training over about 12 weeks. I’m also a huge fan of John Foreman’s Data Smart for a good intro to data science algorithms and thinking if you’re more of the self-learning type. Of course the best way to learn is to do: Check out online competitions through Kaggle or DrivenData to take part in machine learning competitions. Start small and look at questions you’re genuinely interested in.

Lastly, don’t underestimate the power of meeting people in person. Immerse yourself in the data science community as best you can. Attend local Meetups, check out webinars or local conferences, and keep posting questions on /r/datascience of course and you’ll soon be well on your own data science path. When you’re ready to start the job search, don’t forget that we do a monthly jobs round up over at DataKind to help you use your powers for good - check out our list for January!

No matter how you get there, enjoy the journey. Data science is a thrilling and exciting field and whether you know Linear Algebra backwards and forwards or not is not as important as rolling up your sleeves and having fun digging in wherever you’re at. Good luck!

Read more posts
April 12, 2021
Powering Public Data for Communities: Highlights from Virtual DataDive®️ Report Back
We recently hosted our DataDive Report Back, keeping with the theme of Trust, Transparency, Togetherness: Powering Public Data for Communities, and we’re honored and humbled by the level of excitement...
Read full story
November 07, 2019
Applying AI to Societal Challenges in Cities & Communities: Highlights from DataKind’s Virtual DataDive With Microsoft
We can’t believe it’s been a month since our first-ever virtual DataDive. We’re still feeling the energy and enthusiasm from that day. More than 100 DataKind volunteers from around the world, hailing from...
Read full story
June 20, 2018
Using Satellite Imagery to Generate Awareness and Funds for Refugees
At a 2017 DataDive presented by USA for UNHCR, #GivingTuesday and the Bill & Melinda Gates Foundation, a team of DataKind volunteers led by Evelyn Lang, Ravin Pierre, and Friederike Schüür worked alongside...
Read full story
Blog Archive