• Big Data & Technology
  • Alec Smith
  • JUN 09, 2016

How to Become a Data Scientist (Part 2/3)

This is Part Two in a Three-Part Series
Part One: What is Data Science? | Part Two: Learning | Part Three: The Job Market


Having read Chapters One and Two (i.e. Part One), you should now have a good comprehension of what commercial data science entails, the different forms it takes, and what is required to be a success in the profession. And having thought deeply about your motivations, you should have a clear picture of your goals, and ultimately – the type of data scientist you want to become. So give yourself a pat on the back, because you are now ready to begin the real fun: learning.

In this chapter, we will explore the options at your disposal – but first – we will begin proceedings by discussing an important notion that concerns data science and learning.

Continual Learning

Just like a doctor has to stay abreast of medical developments, learning never stops for a data scientist. The field (and the technology) evolves so quickly; what you learn now might not be relevant in the years to come. Look at the rise of deep learning, to take just one example. This is what Sean McClure was alluding to in his post emphasising the importance of problem solving (highlighted in Chapter One).

Quite simply, if you are not passionate about the field and do not enjoy learning, then data science is not for you. Attending conferences and networking with the data science community are effective ways of keeping on top of the latest developments, and it is advisable to regularly read books and papers. On the latter: if you do not have a background in research, it is worth familiarising yourself with academic papers so you can get the most out of them (I haven’t specifically researched the best way to go about this, but after this post was featured on Hacker News, the user ‘Obi_Juan_Kenobi’ came up with an interesting answer to this question – if you have the patience to scroll through this thread: https://news.ycombinator.com/item?id=12243377).

Play. Build. Experiment.

Going back to the message we touched on in Chapter One, there is only one-way to develop your capability as a data scientist: experience, experience, experience. I could launch into a lengthy discussion on this, but I happened to come across two excellent posts that cover the key points so have a read of Brandon Rohrer: A One-Step Program for Becoming a Data Scientist and Rossella Blatt Vital: The Scary Rise of the 'Fake Data Scientists'.

This is what should you take from these: data science is an expert field, it takes a long time to master, and you will only do so through practical experience. As James Petterson summarised:

“Nothing beats experience. You can read as much as you want, you can do all the Coursera courses, but unless you get your hands dirty, you won’t learn”

The good news is there are some great avenues to gain practical experience, and we will turn our attention to these now.  

Kaggle / Open-Source / Freelancing

If you haven’t heard of Kaggle, Google it... NOW! Kaggle is an incredible platform where you can play around, develop your expertise and learn, of course. James put it this way:

“If I hadn’t competed in Kaggle competitions, I would have finished my PhD without knowing the tools that people use in industry. For example, a lot of the methods used in industry are based on ensembles or decision trees, like random forests. They are really powerful and are my first choice in both competitions and industry, but I wasn't exposed to them during my PhD”

There you have it: you can improve your skills while learning the techniques that are commonly applied in industry. And if you start doing well in the competitions, it provides evidence of your capability, as we will see in Chapter Four.

Outside of Kaggle, another option is to contribute to open-source projects. A simple search on GitHub should reveal some projects you can start to sink your teeth into, and gain practical experience while doing so.

Finally, if you can get freelancing work, this is a great way to build a track record and demonstrates that you can operate in a commercial environment. And rather conveniently, you could even utilise the Experfy platform for that purpose.

To PhD or not to PhD

Do you need a PhD to be a data scientist? Not necessarily, but there are many advantages, as Sean Farrell noted: 

“The process of obtaining a PhD is a filter for creative problem solving skills [and it] shows you can master a particular field in a short space of time and become a world expert, which proves you’ll be able to do it again and again”

And apart from anything, it provides you with the time to study and to develop your skills. Furthermore, if you are interested in specialising within a specific area like image processing or natural language processing, then PhD research is certainly worth considering.

But going down this path is not the only way to data science. James did a PhD in Machine Learning (focused on researching a very specific type of method) and he feels that a lot of PhD research is not always applicable to industry, i.e. if your job is to apply machine learning rather than research it, you don’t necessarily need a PhD. As such, I asked him whether he thinks people should choose a PhD based on its relevance to industry and he said:

“If possible, but that’s really hard because most of what we do in industry is not state of the art, we use methods that have been around for years and apply them to different problems. There are exceptions of course: you might work at Google in research, for example. But most of the knowledge I use day-to-day, I learnt working [at Commonwealth Bank] and by competing in Kaggle. Of course, doing a PhD, you learn about the whole process, spend a lot of time doing experiments and learning how to do them properly, and that is valuable. But I wonder if you could learn that from other means?”

Given the right motivations and armed with an informative guide on how to become a data scientist (where could you find one of those I wonder?), I have no doubt it is possible to learn by yourself. But it is worth making the point again: there are no shortcuts; it requires a lot of self-study and getting your hands dirty – whatever path you take.

There is also the employability aspect to consider: are you more employable as a PhD graduate vs. spending the same time on self-study? I do not have sufficient proof to comment, but either way, it is more important whether you have truly spent the time building up expert capability (and how you can evidence this). PhD’s are certainly valuable but there are great data scientists with PhD’s and great ones without.

Other University Degrees 

So a PhD is not for you – perhaps it is the cost, or perhaps you have not yet developed the expertise necessary for research of this nature. Whatever the reason, there is no need to panic, because many universities are now offering Bachelors, Masters and Diplomas specifically designed for data science, where both computer science and mathematics/statistics are on the curriculum (the attentive reader will remember we discussed this in Chapter Two). 

Courses like these will certainly take you in the right direction, but take note: they won't be enough to convert you into a ready-made data scientist, because as we know – that takes experience. 

Learning Resources (Online Courses and Books)

In a similar sense, solely reading books or completing online courses will not make you an expert, and remember: this is an expert field. However, for arguments sake, let’s say you come from another quantitative field, and hypothetically, this was all you needed to master a chosen subject. That’s great, but don’t forget: you will still face competition, who – in all likeliness – will have far more practical and commercial experience in these areas. This is really important to be conscious of, and so we will return to this concept in Chapter Four. 

All this being said, books and online courses are incredibly useful tools to help kick-start your journey, and begin learning new areas or technologies (e.g. deep learning and Spark, respectively). And this takes us back to Sean McClure, who has already been referenced several times in this series. After the release of Part One, we got speaking and he highlighted the following article, initially posted on Quora and since summarised on KDnuggets by Matthew Mayo: How to Learn Machine Learning.

With contributions from Sean and two other well-known machine learning personalities, it proved to be very popular and is an excellent resource for specific recommendations on books, online courses and general advice. If you don’t want to miss out on any valuable tips, I recommend reading the full post on Quora.

Outside of this, I also asked our resident panel of data scientists on their book recommendations, and they came back with: 

  • Pattern Recognition and Machine Learning by Christopher Bishop
  • Machine Learning: a Probabilistic Perspective by Kevin P. Murphy
  • An Introduction to Statistical Learning by James, Witton, Hastie and Tibshirani, which, according to Dylan Hogg: “is a great introduction to statistical learning and is an accessible version of the more advanced classic”The Elements of Statistical Learning
  • Why: A Guide to Finding and Using Causes by Samantha Kleinberg (if you want to know why this is important, take a look at Yanir Seroussi’s blog post on: Why You Should Stop Worrying About Deep Learning and Deepen Your Understanding of Causality Instead)
  • For a different suggestion, Will Hanninger recommended The Pyramid Principle by Barbara Minto. It does not cover data science specifically, but is valuable for problem solving and presenting
  • And finally, for improving those all-important soft skills, Yanir recommended the classic book by Dale Carnegie: How to Win Friends and Influence People

Of relevance to this section, it is also worth noting that the Experfy has launched a Data Science Certification Course out of Harvard Innovation Launch Lab.

Presenting / Communicating

Even if you have a natural disposition for communicating with different groups of people including the non-technical, this should not be taken for granted. Quite frankly, an absence of effective communication in a commercial environment can be a death sentence to your work, and harm your chances of gaining employment in the first place. In short: it is one of the most crucial aspects of data science, yet it is often overlooked.

All this being said, how can you actually develop and improve your communication, outside of reading the relevant books outlined in the previous section? Ultimately, it goes back to the notion of experience, experience, experience, so grasp every opportunity you can to gain practice, and obtain feedback from others – it really is the only way.

To add to this, I would like to direct you to a post that was written by yours truly in response to feedback received after this was originally published. Essentially, the first iteration lacked sufficient detail to do this topic justice, so I returned to our favourite data scientists and spun the answers into a short article dedicated solely to this fundamental skill: Data Science: The Art of Communication.


Next up is Part Three, in which ‘The Job Market’ is examined, and this has relevance not just for those aspiring to the field, but for any data scientist seeking a career move. 

Boston city bkg

Made in Boston @

The Harvard Innovation Lab


Matching Providers

Matching providers 2
comments powered by Disqus.