Practical Advice for Breaking into Data Science
Doyou want to break into data science? But just don’t know how? If so, then read on as this article is for you.
Late last year in December of 2019, I released a YouTube video by the same name called Strategies for Learning Data Science in 2020 for my recently started YouTube channel called the Data Professor where I shared my strategies for learning data science as an academic professor who had self-taught himself data science and has been active in the field for the past 16 years since 2004.
In this article, I’ll be discussing how you too can embark on a similar path of becoming a data scientist by sharing with you some of the learning strategies that I use to learn data science.
What background do you need to break into data science?
Should you have a computer science degree in order to become a data scientist? Or if you are coming from a non-technical background, could you also make a similar transition into data science?
Firstly, I am not a computer scientist and as you may recall from my very first article on Medium (How a Biologist Became a Data Scientist: How I Transitioned from a Non-Technical Background into Data Science), my undergraduate degree was in biological science. And because of my fascination with computers and data, I have been self-studying the necessary concepts and skills necessary for doing data science.
Data science is a multidisciplinary field that encompasses several disciplines such as informatics, computer science, statistics, science, mathematics, data visualization and most importantly problem-solving. If you look into job descriptions of data scientists on LinkedIn, you will notice that there are various first degrees that data scientists may have. As you can see, the background is quite diverse. In fact, many are not computer science graduates. However, if you have a computer science degree you may have an advantage. And if you don’t, that’s okay too (myself included as a biology major).
Having the passion to learn data science and apply it to whatever field you are in, I believe that is the key to becoming a data scientist. Owing to the abundance of data in biology, this has incentivized my transition of becoming a data scientist. Particularly, by applying machine learning to analyze the big biological data in an effort (1) to find and discover new drugs, (2) to understand the mechanism of action of drugs and also (3) to create diagnostic tools that will be able to assist clinicians and health professionals in diagnosing patients for a particular disease of interest.
Where to Start?
This is a tough question. Where do you actually start when you want to break into data science? The answer can get quite diverse. If you ask a hundred data scientists, you might get a hundred different answers!
If you ask me, I started with a domain knowledge of biology. I had some data problems in biology that I need answers for. In my journey of eventually becoming a data scientist, I’ve slowly learned the necessary skillsets one by one. This was not done in 3 months or 1 year but have started since 2004. Having been doing data science for 16 years before the term was coined, having published more than 100 research articles, having taught data science (data mining back then) since 2006, could I say that I have mastered data science. Absolutely not! I am still continuing to learn new concepts in data science on a daily basis. The landscape of data science is so vast that it could take a lifetime to master (and perhaps not even master the entire field but only a subset of it).
As you will see in the cartoon illustration below, the starting point of breaking into data science is quite flexible. From whichever background that you are coming from, you could indeed break into data science and with the necessary effort, time and practice invested you will be on your way of becoming a data scientist.
How Much Time is Needed to Become a Data Scientist?
Another popular question that you may be wondering about is the amount of time that is needed to become a data scientist. You might have heard from somewhere that you can learn data science in 3 months or less.
But really, is that at all possible? The answer is yes and no. Confused? You see, in 3 months you might have a taste or feel of what data science has to offer. You might be able to gain a preliminary introduction to data science. You might even gain data literacy or become a citizen data scientist.
Let’s say that if you have a computer science degree, the time it would take to become a data scientist would not be so long because you already have the fundamentals of computing, you already have the technical background, you already know how to program (if not, that’s okay too!). So these technical skills that you have will make your transition much quicker.
Let’s say that if you are a web developer and you’re going to learn R or Python then you will be at a better position to learn both languages or either one of these languages than a non-technical person coming from a field such as biology (yep, that’s me!).
For a biology major the time that it takes to learn R or Python might be longer. However, I believe that if you have the mindset and passion to learn then I think that is all that is necessary. I read somewhere that if you spend say 10,000 hours you will be able to master skills or knowledge.
Let’s say that you spend about two hours a day learning about the concepts of programming and data science so I believe that within a year or two you will be able to learn enough to towards becoming a data scientist. So given that you also practice. I would argue that a data scientist is a way of life. Becoming one requires dedication and life-long learning for acquiring new skills that would allow you to be in the loop of the field. It’s a use it or lose it philosophy. So aside from life-long learning, I also highly recommend to apply the newfound knowledge and skills on data science projects.
Personally, my day-time job as an Associate Professor of Bioinformatics requires me to keep up-to-date on the latest developments of the field so as to foster innovation and the development of novel bioinformatics tools as well as the application of existing machine learning algorithms to extract knowledge and provide insights on the underlying mechanism of diseases as well as in the quest of discovering a new drug for potential therapeutic applications.
Do I Need To Learn How to Code?
So the next question that you may have is whether you will need to learn how to code in order to become a data scientist? This really depends on the circumstances, so yes and no. Before I learn how to code, I would use this software called WEKA. It is a point-and-click graphical user interface (GUI) software for performing data mining that I relied on for analyzing the data that I had compiled during the course of my PhD study.
Over time, I began to notice that data analysis via the point-and-click interface was not an efficient process. The main reason is that it requires a lot of manual work where I will have to physically use the mouse to click and perform various tasks in the software: (1) to import the data, (2) to specify the input parameter, (3) to initiate the training of the model, (4) to collect the data, (5) to put it into Excel, (6) to combine it and therefore all of these were quite tedious.
There was this one time that I could vividly remember during the course of my PhD study. It is when I controlled the 40-50 computers at the computer lab of the university. So on each of the computers, I would run some calculations (model building using the WEKA software) after which I would then manually collect the data from each of the 40–50 computers and then pooling the results for subsequent analysis and plot making in Microsoft Excel.
During the course of my PhD studies, I never did learn to code while I only relied on the data mining capabilities of the WEKA software to do all my machine learning model building using algorithms such as linear regression, J48 (the Java implementation of the C4.5 algorithm), back-propagation neural network and support vector machine. Towards the end of my PhD studies, I came across another great GUI-based multivariate analysis software known as The Unscrambler, which allowed me to run 2 additional machine learning algorithms namely principal component analysis and partial least squares regression.
Thus, by the end of my 4 years of PhD studies, I was able to graduate with 13 publications (research articles) thanks to the use of the 2 GUI-based software: WEKA and The Unscrambler software. Did I mentioned that during the course of my PhD studies, I did try to learn C++ and Java with no success. Pretty much stuck at the Hello World example and did not really gained any further progress.
So when did I learn coding? You might be wondering. Before revealing the answer, let me give you some context. So after receiving my PhD, I was hired as a Lecturer at the same university (fast forward to today, I have just celebrated my 14 years in academia + 4 years of PhD studies = 18 years of research).
Okay, now comes the part when I learn how to code. It was about a years into my Lecturership where I met a research assistant (I will hereafter refer to as the RA) who had joined my research lab to help translate some of the existing workflow that I had been doing into a programmatic workflow namely in Python. So the RA showed me this book that he had been using to learn Python called the Python Power! This was my second (and successful) attempt at learning how to code. So why did this second attempt worked? Let me explain that in the next section.
So with all this said, is it possible to do data science without coding? It is possible to use GUI-based software such as WEKA, Rapid Miner, Orange or other similar software (which in modern times is referred to as no-code AI) to perform machine learning model building.
But the entire end-to-end workflow of the entire data mining project was not entirely possible via the use of only 1 software such as WEKA. So I ended up making use of the text editor called Ultra Edit, which allowed me to apply recorded macros for semi-automating some of the data pre-processing that was required.
Plot making was predominantly created using Microsoft Excel and Powerpoint (for adding in-figure panel labels and for designing the layout of a multi-plot figure. Another software I also used for making more sophisticated plots such as a contour plot was made possible using the Sigma Plot software. The challenge was to manually transform the output of 1 software as the input of the second software. This process is repeated iteratively whereby the output of the second software is used as the input of the third software, etc.
As you can see, it is indeed possible to perform the entire end-to-end workflow of the entire data mining model building via GUI-based software but at the cost of manual work.
So thinking in retrospect, if I could start my PhD again, I would definitely put in more effort on learning how to code (the thing is, I did tried to learn how to code but without success; the second attempt a couple of years later was successful because of a change in mindset which I will be sharing with you in the forthcoming section Self-taught coder?).
You see, if you know how to code, even if just only a little bit, it will help to greatly speed up your workflow. Imagine automating the entire end-to-end project in R or Python code instead of having to do everything manually. So what’s the difference? The answer is time and reliability. Particularly, the automated approach is less prone to human error while being significantly faster to compute and complete.
What Language to Learn? R or Python?
So now the very important question is what language should you learn how to code. If you’ve been googling or if you’ve been watching videos on YouTube you may come across two languages that are very popular for data science. So the first one is R and the second one is Python. A popular question that is often asked and debated is whether to learn R or Python.
Is there a compelling reason for R or Python?
So the answer really depends on whether there is a compelling reason to choose R or Python. For example, you might want to use the Bio3D R package or the shiny package, so if that’s the case then choose R. Or perhaps, you really want to use the Biopython library in Python, then, by all means, go with Python.
Do you have a study partner or mentor?
So personally I’ve learned Python first not because of any particular reason of the language itself. So the decision to learn Python was rather due to the fact that the RA whom I mentioned earlier in the previous section had already used Python to code the workflow for the project that we were working on. This more or less, limited the choice to use Python and so my learning journey into coding started here.
So I had asked the RA to explain what the Python source code meant, line by line. The first explanation did not make sense for me. So it took some cartoon doodling on a piece of paper on what actually was happening when each line is executed. Shortly after, he had left the group to pursue his PhD studies. So my learning journey into coding continued.
So I came to the conclusion that by following the coding examples in books did not really ignited my learning process. So what actually, propelled my learning? It was framing my own data problem and based on this problem (which serves as the ultimate incentive) I would seek out ways on how to programmatically solve the data problem and produce the necessary results. The completion of this small task would provide small bits of victory that would help to fuel my learning motivation. By seeking out ways to solve data problems, what did I actually do?
The answer is quite straight forward. I discovered Stack Overflow. It is a magical and powerful resource that allowed me to self-teach my way through learning coding by learning from the examples of how others solved the coding problem. Sometimes, some of the questions had multiple answers that ultimately led to the same results. Particularly, some answers received more upvote while others might receive fewer upvoted which really is a subjective issue on whether it helped others in their learning journey as well. By comparing and learning from these multiple answers, this helped me realised that in order to get the job done (solve the data problem) I would not need to remember the syntax (which I blindly attempted to do when I was trying to learn C++ or Java several years prior but at no success).
So this time around, I was guided by the desire to solve data/coding problems and the logic of what tasks are necessary to arrive at the end solution. So how did I come up with the logic of what tasks are needed to be done? I would simply look at the data/coding problem and try to list out all the steps onto paper on how I would do this, if I had to do it manually. I would then proceed to sequentially code each step, one by one, until all has been solved. Before you know it, the pie that has been broken down into bite-sized pieces are tackled and completed one by one.
Know Your Python Libraries and R Packages
So the next step of becoming a data scientist is you need to become familiar with the standard library of Python or the standard package and modules of R. Particularly, you will need to know what packages or libraries in R or Python that you can use to wrangle your data, to pre-process your data and also to create your prediction model.
For example, if you handle data frames in Python you would use pandas, therefore, you will have to learn pandas so that you can merge different data frames together. In R, you can use dplyr and also the data frame built-in function in R.
If you want to create graphs in Python you would use matplotlib, seaborn, altair, plotly and plotnine while in R you could use the R base plot function, ggplot2 and plotly.
After you have built your model and you would like to deploy it. You could essentially choose one of 2 approaches: (1) deploy as an application programming interface (API) or (2) deploy as a web application.
In Python, you could deploy your scikit-learn models using flask (see this tutorial Deploying a Machine Learning Model as a REST APIby Nguyen Ngo). Cloud ML Engine from Google is also another option (see this tutorial Deploying scikit-learn Models at Scale by Yufeng Gao). In R, you could use the plumber package for creating the web API (see this tutorial How to make your machine learning model available as an API with the plumber package by Shirin Elsinghorst and the tutorial R can API and So Can You! by Heather Nolis.
To deploy machine learning models as a web application, it is recommended that you first save the model into a file via the pickle function in Python and via the saveRDS function in R. The actual web application can then be provisioned using flask and Streamlitin Python while to do this in R you can use the shiny package.
I have also created a tutorial video series on how you can create data science web applications in Python using the Streamlit library.
If you would like to create data science web applications in R, then please check out the following tutorial video series in using the shiny package.
Own Your Data Science Journey
And the most important part of becoming a data scientist is that you have to persevere. You have to try hard. You have to own your data science journey. This journey won’t be easy so you will have to put in the extra effort to make it work.
My data science journey has been very phenomenal. I have learned a lot of things. I have met a lot of talented people. I have the opportunity to do and get paid for what I really like. So I’m never bored of this job and there’s always new data, to be collected, to be analyzed and we have ongoing collaborations there’s really a continuous inflow of new data and new opportunities to learn about biology so it is a very very fulfilling job. If I could turn back the clock and decide again what I would like to do I will stick to the same path and become a data scientist. I would believe that data science fits well in my own Ikigai (check out the awesome article Ikigai: The secret to living a meaningful life by Ketaki Vaidya).
Final Take-Home Message
And most importantly of all, you have to code. You have to do data science project. If you can take away 2 key messages from this article, please take 2: (1) learn to code and (2) apply your coding skills to work on your own data science projects.
“The Best Way to Learn Data Science is to Do Data Science.”
— Chanin Nantasenamat, Data Professor