{"id":1583,"date":"2019-03-19T03:11:55","date_gmt":"2019-03-19T03:11:55","guid":{"rendered":"http:\/\/kusuaks7\/?p=1188"},"modified":"2023-07-17T13:39:27","modified_gmt":"2023-07-17T13:39:27","slug":"why-you-shouldnt-be-a-data-science-generalist","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/why-you-shouldnt-be-a-data-science-generalist\/","title":{"rendered":"Why you shouldn\u2019t be a data science generalist"},"content":{"rendered":"<p id=\"6b38\">I work at a\u00a0<a href=\"http:\/\/sharpestminds.com\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/sharpestminds.com\">data science mentorship startup<\/a>, and I\u2019ve found there\u2019s a single piece of advice that I catch myself giving over and over again to aspiring mentees. And it\u2019s really not what I would have expected it to be.<\/p>\n<p id=\"597c\">Rather than suggesting a new library or tool, or some resume hack, I find myself recommending that they first think about\u00a0<em>what kind of data scientist they want to be<\/em>.<\/p>\n<p id=\"635b\">The reason this is crucial is that data science isn\u2019t a single, well-defined field, and companies don\u2019t hire generic, jack-of-all-trades \u201cdata scientists\u201d, but rather individuals with very specialized skill sets.<\/p>\n<p id=\"0a87\">To see why, just imagine that you\u2019re a company trying to hire a data scientist. You almost certainly have a fairly well-defined problem in mind that you need help with, and that problem is going to require some fairly specific technical know-how and subject matter expertise. For example, some companies apply simple models to large datasets, some apply complex models to small ones, some need to train their models on the fly, and some don\u2019t use (conventional) models at all.<\/p>\n<p id=\"e7c3\">Each of these calls for a completely different skill set, so it\u2019s especially odd that the advice that aspiring data scientists receive tends to be so generic: \u201clearn how to use Python, build some classification\/regression\/clustering projects, and start applying for jobs.\u201d<\/p>\n<p id=\"d7ef\">Those of us who work in the industry bear a lot of the blame for this. We tend to lump an excessive number of things into the \u201cdata science\u201d bucket in casual conversations, blog posts and presentations. Building a robust data pipeline for production? That\u2019s a \u201cdata science problem.\u201d Inventing a new kind of neural network? That\u2019s a \u201cdata science problem.\u201d<\/p>\n<p id=\"4302\">That\u2019s not good, because it tends to cause aspiring data scientists to lose focus on specific problem classes, and instead become jacks of all trades\u200a\u2014\u200asomething that can make it harder to get noticed or break through, in a market that\u2019s already saturated with generalists.<\/p>\n<p id=\"0e13\">But it\u2019s hard to avoid becoming a generalist if you don\u2019t know which common problem classes you could specialize in in the fist place. That\u2019s why I put together a list of the five problem classes that are often lumped together under the \u201cdata science\u201d heading:<\/p>\n<h4 id=\"8642\">1. Data\u00a0engineer<\/h4>\n<p id=\"c355\"><strong>Job description:\u00a0<\/strong>You\u2019ll be managing data pipelines for companies that deal with large volumes of data. That means making sure that your data is being efficiently collected and retrieved from its source when needed, cleaned and preprocessed.<\/p>\n<p id=\"25e0\"><strong>Why it\u2019s important:\u00a0<\/strong>If you\u2019ve only ever worked with relatively small (&lt;5 Gb) datasets stored in\u00a0.csv or\u00a0.txt files, it might be hard to understand why there would exist people whose full-time jobs it is to build and maintain data pipelines. Here are a couple of reasons: 1) A 50 Gb dataset won\u2019t fit in your computer\u2019s RAM, so you generally need other ways to feed it into your model, and 2) that much data can take a ridiculous amount of time to process, and often has to be stored redundantly. Managing that storage takes specialized technical know-how.<\/p>\n<p id=\"7d47\"><strong>Requirements:\u00a0<\/strong>The technologies you\u2019ll be working with include Apache Spark, Hadoop and\/or Hive, as well as Kafka. You\u2019ll most likely need to have a solid foundation in SQL.<\/p>\n<p id=\"c98c\"><strong>The questions you\u2019ll be dealing with sound like:<\/strong><\/p>\n<p id=\"4ba3\">\u2192 \u201cHow do I build a pipeline that can handle 10 000 requests per minute?\u201d<\/p>\n<p id=\"1e8e\">\u2192 \u201cHow can I clean this dataset without loading it all in RAM?\u201d<\/p>\n<h4 id=\"a7a9\">2. Data\u00a0analyst<\/h4>\n<p id=\"d4ad\"><strong>Job description:\u00a0<\/strong>Your job will be to translate data into actionable business insights. You\u2019ll often be the go-between for technical teams and business strategy, sales or marketing teams. Data visualization is going to be a big part of your day-to-day.<\/p>\n<p id=\"be92\"><strong>Why it\u2019s important:\u00a0<\/strong>Highly technical people often have a hard time understanding why data analysts are so important, but they really are. Someone needs to convert a trained and tested model and mounds of user data into a digestible format so that business strategies can be designed around them. Data analysts help to make sure that data science teams don\u2019t waste their time solving problems that don\u2019t deliver business value.<\/p>\n<p id=\"ac90\"><strong>Requirements:\u00a0<\/strong>The technologies you\u2019ll be working with include Python, SQL, Tableau and Excel. You\u2019ll also need to be a good communicator.<\/p>\n<p id=\"0394\"><strong>The questions you\u2019ll be dealing with sound like:<\/strong><\/p>\n<p id=\"12d4\">\u2192 \u201cWhat\u2019s driving our user growth numbers?\u201d<\/p>\n<p id=\"96f2\">\u2192 \u201cHow can we explain to management that the recent increase in user fees is turning people away?\u201d<\/p>\n<h4 id=\"9052\">3. Data scientist<\/h4>\n<p id=\"fefd\"><strong>Job description:\u00a0<\/strong>Your job will be to clean and explore datasets, and make predictions that deliver business value. Your day-to-day will involve training and optimizing models, and often deploying them to production.<\/p>\n<p id=\"a056\"><strong>Why it\u2019s important:<\/strong>\u00a0When you have a pile of data that\u2019s too big for a human to parse, and too valuable to be ignored, you need some way of pulling digestible insights from it. That\u2019s the basic job of a data scientist: to convert datasets into digestible conclusions.<\/p>\n<p id=\"57c7\"><strong>Requirements:\u00a0<\/strong>The technologies you\u2019ll be working with include Python, scikit-learn, Pandas, SQL, and possibly Flask, Spark and\/or TensorFlow\/PyTorch. Some data science positions are purely technical, but the majority will require you to have some business sense, so that you don\u2019t end up solving problems that no one has.<\/p>\n<p id=\"163b\"><strong>The questions you\u2019ll be dealing with sound like:<\/strong><\/p>\n<p id=\"3273\">\u2192 \u201cHow many different user types do we really have?\u201d<\/p>\n<p id=\"d765\">\u2192 \u201cCan we build a model to predict which products will sell to which users?\u201d<\/p>\n<h4 id=\"5005\">4. Machine learning\u00a0engineer<\/h4>\n<p id=\"054f\"><strong>Job description:\u00a0<\/strong>Your job will be to build, optimize and deploy machine learning models to production. You\u2019ll generally be treating machine learning models as APIs or components, which you\u2019ll be plugging into a full-stack app or hardware of some kind, but you may also be called upon to design models yourself.<\/p>\n<p id=\"796e\"><strong>Requirements:\u00a0<\/strong>The technologies you\u2019ll be working with include Python, Javascript, scikit-learn, TensorFlow\/PyTorch (and\/or enterprise deep learning frameworks), and SQL or MongoDB (typically used for app DBs).<\/p>\n<p id=\"1b40\"><strong>The questions you\u2019ll be dealing with sound like:<\/strong><\/p>\n<p id=\"8141\">\u2192 \u201cHow do I integrate this Keras model into our Javascript app?\u201d<\/p>\n<p id=\"5dea\">\u2192 \u201cHow can I reduce the prediction time and prediction cost of our recommender system?\u201d<\/p>\n<h4 id=\"8732\">5. Machine learning researcher<\/h4>\n<p id=\"f51a\"><strong>Job description:\u00a0<\/strong>Your job will be to find new ways to solve challenging problems in data science and deep learning. You won\u2019t be working with out-of-the-box solutions, but rather will be making your own.<\/p>\n<p id=\"6e98\"><strong>Requirements:\u00a0<\/strong>The technologies you\u2019ll be working with include Python, TensorFlow\/PyTorch (and\/or enterprise deep learning frameworks), and SQL.<\/p>\n<p id=\"feab\"><strong>The questions you\u2019ll be dealing with sound like:<\/strong><\/p>\n<p id=\"81fe\">\u2192 \u201cHow do I improve the accuracy of our model to something closer to the state of the art?\u201d<\/p>\n<p id=\"0aab\">\u2192 \u201cWould a custom optimizer help decrease training time?\u201d<\/p>\n<p id=\"cf80\">The five job descriptions I\u2019ve laid out here definitely don\u2019t stand alone in all cases. At an early-stage startup, for instance, a data scientist might have to be a data engineer and\/or a data analyst, too. But most jobs will fall more neatly into one of these categories than the others\u200a\u2014\u200aand the larger the company, the more these categories will tend to apply.<\/p>\n<p id=\"57fc\">Overall, the thing to remember is that in order to get hired, you\u2019ll usually be better off building a more focused skillset: don\u2019t learn TensorFlow if you want to become a data analyst, and don\u2019t prioritize learning Pyspark if you want to become a machine learning researcher.<\/p>\n<p id=\"bc67\">Think instead about the kind of value you want to help companies build, and get good at delivering that value. That, more than anything else, is the best way to get in the door.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>First think about&nbsp;what kind of data scientist they want to be. The reason this is crucial is that data science isn&rsquo;t a single, well-defined field, and companies don&rsquo;t hire generic, jack-of-all-trades &ldquo;data scientists&rdquo;, but rather individuals with very specialized skill sets. Think instead about the kind of value you want to help companies build, and get good at delivering that value. That, more than anything else, is the best way to get in the door.<\/p>\n","protected":false},"author":251,"featured_media":4182,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2882],"class_list":["post-1583","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2882,"user_id":251,"is_guest":0,"slug":"jeremie-harris","display_name":"Jeremie Harris","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Harris","first_name":"Jeremie","job_title":"","description":"Jeremie Harris is Co-Founder at <a href=\"https:\/\/www.sharpestminds.com\/\">SharpestMinds<\/a> that finds new grads their first jobs in machine learning and data science. He has many publications to his credit."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1583","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/251"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1583"}],"version-history":[{"count":2,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1583\/revisions"}],"predecessor-version":[{"id":29264,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1583\/revisions\/29264"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4182"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1583"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1583"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1583"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1583"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}