{"id":921,"date":"2018-10-08T05:47:25","date_gmt":"2018-10-08T02:47:25","guid":{"rendered":"http:\/\/kusuaks7\/?p=526"},"modified":"2021-05-11T14:00:11","modified_gmt":"2021-05-11T14:00:11","slug":"top-five-mistakes-of-greenhorn-data-scientists","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/top-five-mistakes-of-greenhorn-data-scientists\/","title":{"rendered":"Top Five Mistakes of Greenhorn Data Scientists"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? Browse courses like <a href=\"https:\/\/www.experfy.com\/training\/courses\/effective-data-visualization\">Effective Data Visualization<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p>&nbsp;<\/p>\n<section name=\"5f84\">\n<p id=\"9185\" name=\"9185\">You binged online courses and landed your first Data Science job. Avoid these mistakes to be successful right&nbsp;away.<\/p>\n<p id=\"2e02\" name=\"2e02\">You prepared well to finally become a Data Scientist. You participated in Kaggle competitions and you binge watched online lectures. You feel prepared, but the work as a real-life Data Scientist will prove vastly different from what you might expect.<\/p>\n<p id=\"a0f1\" name=\"a0f1\">This article examines 5 common mistakes of early Data Scientists. The list was assembled together with&nbsp;<a data-href=\"https:\/\/www.linkedin.com\/in\/sfoucaud\/\" href=\"https:\/\/www.linkedin.com\/in\/sfoucaud\/\" rel=\"noopener noreferrer\" target=\"_blank\">Dr. S&eacute;bastien Foucaud<\/a>, who has &gt;20 years of experience in mentoring and leading young Data Scientists in both academia and industry. This post aims to help you better prepare for your work in real-life.<\/p>\n<p id=\"f88c\" name=\"f88c\">Let&rsquo;s get started.&nbsp;<\/p>\n<\/section>\n<section name=\"f1db\">\n<hr \/>\n<h3 id=\"14e7\" name=\"14e7\">1. Enter &ldquo;Generation Kaggle&rdquo;<\/h3>\n<figure id=\"0b64\" name=\"0b64\">\n<p><canvas height=\"37\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*AGIFyMIBRHIrjapuhC5nQg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*AGIFyMIBRHIrjapuhC5nQg.png\" \/><\/p>\n<\/figure>\n<p name=\"7e6c\" style=\"text-align: center;\">Source:&nbsp;<a data-href=\"http:\/\/www.kaggle.com\" href=\"http:\/\/www.kaggle.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">kaggle.com<\/a>&nbsp;on June 30&nbsp;18.<\/p>\n<p id=\"7e6c\" name=\"7e6c\">You have participated in Kaggle challenges and practiced your Data Science skills. It&rsquo;s nice that you can stack decision trees and neural networks. Truth be told, you won&rsquo;t do quite as much of model stacking as a Data Scientist. Remember as a general rule that you will spend 80% of your time preprocessing data and 20% of the remaining time building your model.<\/p>\n<figure id=\"4c7f\" name=\"4c7f\">\n<p><canvas height=\"56\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*33KsF3BhPXtmuiDZ-CHq8A.jpeg\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*33KsF3BhPXtmuiDZ-CHq8A.jpeg\" \/><\/p>\n<\/figure>\n<p id=\"a93b\" name=\"a93b\">Being part of &ldquo;Generation Kaggle&rdquo; is helpful in many ways. The data often comes perfectly cleaned so that you can spend time tweaking your model. But that&rsquo;s rarely the case in your real world job, where you have to assemble data from different sources with different formats and naming conventions.<\/p>\n<p id=\"d035\" name=\"d035\">Do the hard work and practice the skill you will use 80% of your time, data preprocessing.&nbsp;<a data-href=\"https:\/\/towardsdatascience.com\/https-medium-com-janzawadzki-sweet-or-cheat-build-a-sneaker-rater-after-finishing-andrew-ngs-2nd-course-49475fc75429\" href=\"https:\/\/towardsdatascience.com\/https-medium-com-janzawadzki-sweet-or-cheat-build-a-sneaker-rater-after-finishing-andrew-ngs-2nd-course-49475fc75429\" rel=\"noopener noreferrer\" target=\"_blank\">Scrape images<\/a>&nbsp;or gather them from an API. Collect song lyrics from&nbsp;<a data-href=\"https:\/\/github.com\/johnwmillr\/LyricsGenius\" href=\"https:\/\/github.com\/johnwmillr\/LyricsGenius\" rel=\"noopener noreferrer\" target=\"_blank\">Genius<\/a>. Prepare the data you need to solve a specific problem, then ingest it into your notebook and practice the machine learning life cycle. Being proficient in data preprocessing will undoubtedly make you a Data Scientist with immediate impact at your company.<\/p>\n<h3 id=\"0d06\" name=\"0d06\">2. Neural Networks are the cure to everything<\/h3>\n<p id=\"785a\" name=\"785a\">Deep Learning models are superior to other machine learning models in the areas of computer vision or natural language processing. But they also have distinct disadvantages.<\/p>\n<figure id=\"e850\" name=\"e850\">\n<p><canvas height=\"43\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*BQ0SxdqC9Pl_3ZQtd3e45A.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*BQ0SxdqC9Pl_3ZQtd3e45A.png\" \/><\/p>\n<\/figure>\n<p id=\"4da2\" name=\"4da2\">Neural networks are very data hungry. With less samples, you often fair better with a decision tree or logistic regression model. Neural Networks are also a&nbsp;<a data-href=\"https:\/\/www.quantamagazine.org\/new-theory-cracks-open-the-black-box-of-deep-learning-20170921\/\" href=\"https:\/\/www.quantamagazine.org\/new-theory-cracks-open-the-black-box-of-deep-learning-20170921\/\" rel=\"noopener noreferrer\" target=\"_blank\">black box<\/a>. They are notoriously hard to interpret and to explain. If product owners or managers start to question the output of the model, you have to be able to explain the&nbsp;model. This is much easier for traditional models.<\/p>\n<figure id=\"8cfc\" name=\"8cfc\">\n<p><canvas height=\"56\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*-8QzrLV5z1-mfv8bFhNPbw.jpeg\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*-8QzrLV5z1-mfv8bFhNPbw.jpeg\" \/><\/p>\n<\/figure>\n<p id=\"de6c\" name=\"de6c\">There are many great statistical learning models out there, as explained in this great&nbsp;<a data-href=\"https:\/\/towardsdatascience.com\/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11\" href=\"https:\/\/towardsdatascience.com\/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11\" target=\"_blank\" rel=\"noopener noreferrer\">post<\/a>&nbsp;by&nbsp;<a data-action=\"show-user-card\" data-action-type=\"hover\" data-action-value=\"52aa38cb8e25\" data-anchor-type=\"2\" data-href=\"https:\/\/medium.com\/@james_aka_yale\" data-user-id=\"52aa38cb8e25\" href=\"https:\/\/medium.com\/@james_aka_yale\" target=\"_blank\" rel=\"noopener noreferrer\">James Le<\/a>. Educate yourself about them. Know their advantages and disadvantages and apply a model according to the constraints of your use-case. Unless you&rsquo;re working in the specialized field of computer vision or natural speech recognition, chances are that the most successful models will be traditional machine learning algorithms. You will soon discover that very often the simplest model, like a Logistic Regression, is the best model.<\/p>\n<figure id=\"db16\" name=\"db16\">\n<p><canvas height=\"43\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*2NR51X0FDjLB13u4WdYc4g.jpeg\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*2NR51X0FDjLB13u4WdYc4g.jpeg\" \/><\/p>\n<\/figure>\n<p name=\"f25c\" style=\"text-align: center;\">Source: Algorithm cheat-sheet from&nbsp;<a data-href=\"https:\/\/bit.ly\/1IxDsim\" href=\"https:\/\/bit.ly\/1IxDsim\" rel=\"noopener noreferrer\" target=\"_blank\">scikit-learn.org<\/a>.<\/p>\n<h3 id=\"f25c\" name=\"f25c\">3. Machine Learning is the&nbsp;Product<\/h3>\n<p id=\"a6f7\" name=\"a6f7\">Machine Learning has both enjoyed and suffered tremendous hype in the past decade. Too many start-ups promise Machine Learning to be the cure to any problem there exists.<\/p>\n<figure id=\"2aae\" name=\"2aae\">\n<p><canvas height=\"25\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*yUQBX2V8XxXbvIZeYCze4g.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*yUQBX2V8XxXbvIZeYCze4g.png\" \/><\/p>\n<\/figure>\n<p name=\"6bb0\" style=\"text-align: center;\">Source: Google Trends for Machine Learning of the past 5&nbsp;years<\/p>\n<p id=\"6bb0\" name=\"6bb0\">Machine Learning itself should never be the product. Machine Learning is a powerful tool to create a product that meets customer demands.&nbsp;If the customer benefits from receiving accurate item recommendations, machine learning can help. If a customer has the need to accurately identify objects in an image, machine learning can help. If the business benefits from presenting valuable ads to its users, machine learning can help.<\/p>\n<p id=\"1c64\" name=\"1c64\">As a Data Scientist, you need to plan a project with the goal of the customer as your main priority. Only then you evaluate if machine learning can help.<\/p>\n<h3 id=\"5839\" name=\"5839\">4. Confuse Causation with Correlation<\/h3>\n<p id=\"0103\" name=\"0103\">About 90% of data has been produced in the&nbsp;<a data-href=\"https:\/\/www.sintef.no\/en\/latest-news\/big-data-for-better-or-worse\/\" href=\"https:\/\/www.sintef.no\/en\/latest-news\/big-data-for-better-or-worse\/\" rel=\"noopener noreferrer\" target=\"_blank\">past years<\/a>. With the emergence of Big Data, data has become vastly available for Machine Learning practitioners. With so much data to evaluate, the chances increase that random correlations are discovered by learning models.<\/p>\n<figure id=\"eb21\" name=\"eb21\">\n<p><canvas height=\"34\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*mxZkHl07npXV_Sz-UUaAKw.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*mxZkHl07npXV_Sz-UUaAKw.png\" \/><\/p>\n<\/figure>\n<p name=\"6f4c\" style=\"text-align: center;\">Source:&nbsp;<a data-href=\"http:\/\/www.tylervigen.com\/spurious-correlations\" href=\"http:\/\/www.tylervigen.com\/spurious-correlations\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">http:\/\/www.tylervigen.com\/spurious-correlations<\/a><\/p>\n<p id=\"6f4c\" name=\"6f4c\">The image above shows the age of Miss America and the total number of murders by steam, hot vapours and hot objects. Given that data, a learning algorithm will learn the pattern that the age of Miss America influences the number of murders by certain objects, and vice versa. However, both data points are virtually unrelated and both variables have absolutely no predictive power over the other variable.<\/p>\n<p id=\"e406\" name=\"e406\">When discovering patterns in data, apply your domain knowledge. Is it likely to be a correlation or causation? Answering this questions is key to deriving actions from data.<\/p>\n<h3 id=\"fd03\" name=\"fd03\">5. Optimize the wrong&nbsp;metrics<\/h3>\n<p id=\"db28\" name=\"db28\">Developing a Machine Learning model follows the agile life-cycle. First, you define the idea and key metrics. Second, you prototype a result.&nbsp;Third, you continually improve until you satisfy the key metric.<\/p>\n<figure id=\"9ee6\" name=\"9ee6\">\n<p><canvas height=\"46\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*LpHVolSuHlObAIXXecBJiA.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*LpHVolSuHlObAIXXecBJiA.png\" \/><\/p>\n<\/figure>\n<p id=\"b160\" name=\"b160\">When building a Machine Learning model, remember to do a manual error analysis. While the process is tedious and requires effort, it will help you improve the model efficiently in the following iterations.<\/p>\n<\/section>\n<section name=\"4eef\">\n<hr \/>\n<p id=\"069a\" name=\"069a\">Young Data Scientists provide tremendous value to companies. They&rsquo;re fresh off taking online courses and can provide immediate help. They&rsquo;re often self-taught, as few universities offer Data Science degrees, and thus show tremendous commitment and curiosity. They&rsquo;re enthusiastic about the field they&rsquo;ve chosen and are eager to learn more. Beware of the mentioned pitfalls to succeed in your first Data Science job.<\/p>\n<\/section>\n<section name=\"c8d9\">\n<hr \/>\n<h4 id=\"4304\" name=\"4304\">Key takeaways:<\/h4>\n<ul>\n<li id=\"ec04\" name=\"ec04\">Practice data curation<\/li>\n<li id=\"c123\" name=\"c123\">Study pros and cons of different models<\/li>\n<li id=\"012a\" name=\"012a\">Keep the model as simple as possible<\/li>\n<li id=\"ab83\" name=\"ab83\">Check your conclusion against causation vs. correlation<\/li>\n<li id=\"357a\" name=\"357a\">Optimize the most promising metrics<\/li>\n<\/ul>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>Young Data Scientists provide tremendous value to companies. They&rsquo;re fresh off taking online courses and can provide immediate help. They&rsquo;re often self-taught, as few universities offer Data Science degrees, and thus show tremendous commitment and curiosity. They&rsquo;re enthusiastic about the field they&rsquo;ve chosen and are eager to learn more. Beware of the mentioned pitfalls to succeed in your first Data Science job. This article examines 5 common mistakes of early Data Scientists. This post aims to help you better prepare for your work in real-life.<\/p>\n","protected":false},"author":344,"featured_media":3162,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2067],"class_list":["post-921","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2067,"user_id":344,"is_guest":0,"slug":"jan-zawadzki","display_name":"Jan Zawadzki","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Zawadzki","first_name":"Jan","job_title":"","description":"Jan Zawadzki is Data Scientist at Volkswagon Grooup Services with 4 years of global experience in machine learning and management consulting."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/921","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/344"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=921"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/921\/revisions"}],"predecessor-version":[{"id":6260,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/921\/revisions\/6260"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3162"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=921"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=921"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=921"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=921"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}