Top Five Mistakes of Greenhorn Data Scientists

Jan Zawadzki Jan Zawadzki
October 8, 2018 Big Data, Cloud & DevOps

Ready to learn Data Science? Browse courses like Effective Data Visualization developed by industry thought leaders and Experfy in Harvard Innovation Lab.

 

You binged online courses and landed your first Data Science job. Avoid these mistakes to be successful right away.

You prepared well to finally become a Data Scientist. You participated in Kaggle competitions and you binge watched online lectures. You feel prepared, but the work as a real-life Data Scientist will prove vastly different from what you might expect.

This article examines 5 common mistakes of early Data Scientists. The list was assembled together with Dr. Sébastien Foucaud, who has >20 years of experience in mentoring and leading young Data Scientists in both academia and industry. This post aims to help you better prepare for your work in real-life.

Let’s get started. 


1. Enter “Generation Kaggle”

Source: kaggle.com on June 30 18.

You have participated in Kaggle challenges and practiced your Data Science skills. It’s nice that you can stack decision trees and neural networks. Truth be told, you won’t do quite as much of model stacking as a Data Scientist. Remember as a general rule that you will spend 80% of your time preprocessing data and 20% of the remaining time building your model.

Being part of “Generation Kaggle” is helpful in many ways. The data often comes perfectly cleaned so that you can spend time tweaking your model. But that’s rarely the case in your real world job, where you have to assemble data from different sources with different formats and naming conventions.

Do the hard work and practice the skill you will use 80% of your time, data preprocessing. Scrape images or gather them from an API. Collect song lyrics from Genius. Prepare the data you need to solve a specific problem, then ingest it into your notebook and practice the machine learning life cycle. Being proficient in data preprocessing will undoubtedly make you a Data Scientist with immediate impact at your company.

2. Neural Networks are the cure to everything

Deep Learning models are superior to other machine learning models in the areas of computer vision or natural language processing. But they also have distinct disadvantages.

Neural networks are very data hungry. With less samples, you often fair better with a decision tree or logistic regression model. Neural Networks are also a black box. They are notoriously hard to interpret and to explain. If product owners or managers start to question the output of the model, you have to be able to explain the model. This is much easier for traditional models.

There are many great statistical learning models out there, as explained in this great post by James Le. Educate yourself about them. Know their advantages and disadvantages and apply a model according to the constraints of your use-case. Unless you’re working in the specialized field of computer vision or natural speech recognition, chances are that the most successful models will be traditional machine learning algorithms. You will soon discover that very often the simplest model, like a Logistic Regression, is the best model.

Source: Algorithm cheat-sheet from scikit-learn.org.

3. Machine Learning is the Product

Machine Learning has both enjoyed and suffered tremendous hype in the past decade. Too many start-ups promise Machine Learning to be the cure to any problem there exists.

Source: Google Trends for Machine Learning of the past 5 years

Machine Learning itself should never be the product. Machine Learning is a powerful tool to create a product that meets customer demands. If the customer benefits from receiving accurate item recommendations, machine learning can help. If a customer has the need to accurately identify objects in an image, machine learning can help. If the business benefits from presenting valuable ads to its users, machine learning can help.

As a Data Scientist, you need to plan a project with the goal of the customer as your main priority. Only then you evaluate if machine learning can help.

4. Confuse Causation with Correlation

About 90% of data has been produced in the past years. With the emergence of Big Data, data has become vastly available for Machine Learning practitioners. With so much data to evaluate, the chances increase that random correlations are discovered by learning models.

Source: http://www.tylervigen.com/spurious-correlations

The image above shows the age of Miss America and the total number of murders by steam, hot vapours and hot objects. Given that data, a learning algorithm will learn the pattern that the age of Miss America influences the number of murders by certain objects, and vice versa. However, both data points are virtually unrelated and both variables have absolutely no predictive power over the other variable.

When discovering patterns in data, apply your domain knowledge. Is it likely to be a correlation or causation? Answering this questions is key to deriving actions from data.

5. Optimize the wrong metrics

Developing a Machine Learning model follows the agile life-cycle. First, you define the idea and key metrics. Second, you prototype a result. Third, you continually improve until you satisfy the key metric.

When building a Machine Learning model, remember to do a manual error analysis. While the process is tedious and requires effort, it will help you improve the model efficiently in the following iterations.


Young Data Scientists provide tremendous value to companies. They’re fresh off taking online courses and can provide immediate help. They’re often self-taught, as few universities offer Data Science degrees, and thus show tremendous commitment and curiosity. They’re enthusiastic about the field they’ve chosen and are eager to learn more. Beware of the mentioned pitfalls to succeed in your first Data Science job.


Key takeaways:

  • Practice data curation
  • Study pros and cons of different models
  • Keep the model as simple as possible
  • Check your conclusion against causation vs. correlation
  • Optimize the most promising metrics
  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Jan Zawadzki

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Five Reasons why Businesses Struggle to Adopt Deep Learning

    Five Reasons why Businesses Struggle to Adopt Deep Learning

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.