{"id":1800,"date":"2019-07-04T05:48:44","date_gmt":"2019-07-04T05:48:44","guid":{"rendered":"http:\/\/kusuaks7\/?p=1405"},"modified":"2023-07-17T14:27:32","modified_gmt":"2023-07-17T14:27:32","slug":"why-youre-not-a-job-ready-data-scientist-yet","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/why-youre-not-a-job-ready-data-scientist-yet\/","title":{"rendered":"Why you\u2019re not a job-ready data scientist yet"},"content":{"rendered":"<p id=\"6450\">If there\u2019s one thing I\u2019ve learned from the data science mentorship\u00a0I work at, it\u2019s this: getting feedback on your data science job application or interview is\u00a0<em>virtually impossible<\/em>.<\/p>\n<p id=\"f423\">There are <a href=\"https:\/\/www.forbes.com\/sites\/lizryan\/2017\/03\/19\/the-real-reason-employers-wont-tell-you-why-they-hired-someone-else\/#68e378d2d0f0\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.forbes.com\/sites\/lizryan\/2017\/03\/19\/the-real-reason-employers-wont-tell-you-why-they-hired-someone-else\/#68e378d2d0f0\" data->good reasons<\/a>\u00a0that companies are cagey about giving feedback. For one, every piece of feedback a company gives to a rejected applicant is a potential lawsuit. Plus, there\u2019s the fact that many people don\u2019t respond well to negative feedback, and some get downright combative.<\/p>\n<p id=\"a54d\">And just imagine the time it would take for a recruiter to send a thoughtful feedback email to you\u2014and to the dozens (or hundreds) of other applicants they also have to consider. And there\u2019s the fact that, at the end of the day, they get\u00a0<em>absolutely nothing<\/em>\u00a0out of issuing any kind of feedback, no matter how helpful or obvious it may be.<\/p>\n<p id=\"6ea4\">The tragic end result of all this is a huge number of confused, directionless aspiring data scientists. But here\u2019s some good news: there aren\u2019t actually that many reasons why applicants get turned down from data science roles, and there\u2019s a lot you can do to cover those bases.<\/p>\n<p id=\"a3fe\">And those reasons\u200a\u2014\u200athe technical and nontechnical skills that most applicants don\u2019t have but that companies most badly want\u200a\u2014\u200aare what this post is all about.<\/p>\n<h3 id=\"0479\">Reason 1: Python-for-data-science skills<\/h3>\n<p id=\"447e\">The vast majority of data science roles are Python-based, so that\u2019s what I\u2019ll focus on here. A few tools distinguish novices from job-ready pros when it comes to Python for DS. They\u2019re great differentiators if you want to build outstanding projects that get noticed by employers.<\/p>\n<p id=\"b2a8\">To force yourself to improve your data science theory and implementation game, use these in a few projects, if you haven\u2019t already:<\/p>\n<ul>\n<li id=\"6884\"><strong>Data exploration.<\/strong>\u00a0You should have\u00a0<code>pandas<\/code>\u00a0functions like\u00a0<code>.corr()<\/code>,<code>scatter_matrix()<\/code>\u00a0,\u00a0<code>.hist()<\/code>\u00a0and\u00a0<code>.bar()<\/code>\u00a0on the tip of your tongue. You should always be looking for opportunities to visualize your data using PCA or t-SNE, using\u00a0<code>sklearn<\/code>&#8216;s\u00a0<code>PCA<\/code>\u00a0and\u00a0<code>TSNE<\/code>\u00a0functions.<\/li>\n<li id=\"94c4\"><strong>Feature selection.\u00a0<\/strong>90% of the time, your dataset will have way more features than you need (which leads to excessive training time, and a heightened risk of overfitting). Get familiar with basic filter methods (look up scikit-learn\u2019s\u00a0<code>VarianceThreshold<\/code>\u00a0and\u00a0<code>SelectKBest<\/code>\u00a0functions), and more sophisticated model-based feature selection methods (look up\u00a0<code>SelectFromModel<\/code>).<\/li>\n<li id=\"fd8d\"><strong>Hyperparameter search for model optimization.\u00a0<\/strong>You definitely should know what\u00a0<code>GridSearchCV<\/code>\u00a0does and how it works. Likewise for\u00a0<code>RandomSearchCV<\/code>. To really stand out, try experimenting with\u00a0<code>skopt<\/code>&#8216;s\u00a0<code>BayesSearchCV<\/code>\u00a0to learn how you can apply bayesian optimization to your hyperparameter search.<\/li>\n<li id=\"76e0\"><strong>Pipelines.<\/strong>\u00a0Use\u00a0<code>sklearn<\/code>&#8216;s\u00a0<code>pipeline<\/code>\u00a0library to wrap their preprocessing, feature selection and modeling steps together. Discomfort with\u00a0<code>pipeline<\/code>is a huge tell that a data scientist needs to get more familiar with their modeling toolkit.<\/li>\n<\/ul>\n<h3 id=\"15d2\">Reason 2: probability and statistics knowledge<\/h3>\n<p id=\"edde\">Probability and statistics don\u2019t always come up explicitly during on the job, but they\u2019re foundational to all data science work. As a result, it\u2019s easy to bomb an interview if you haven\u2019t read up on:<\/p>\n<ul>\n<li id=\"547a\"><strong>Bayes\u2019s theorem.<\/strong>\u00a0It\u2019s a foundational pillar of probability theory, and it comes up all the time in interviews. You should practice doing some basic Bayes theorem whiteboarding problems, and read the first chapter of\u00a0<a href=\"http:\/\/www.med.mcgill.ca\/epidemiology\/hanley\/bios601\/GaussianModel\/JaynesProbabilityTheory.pdf\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/www.med.mcgill.ca\/epidemiology\/hanley\/bios601\/GaussianModel\/JaynesProbabilityTheory.pdf\" data->this famous book<\/a>\u00a0to get a rock-solid understanding of the origin and meaning of the rule (bonus: it\u2019s actually a fun read!).<\/li>\n<li id=\"a3d6\"><strong>Basic probability.\u00a0<\/strong>You should be able to answer questions\u00a0<a href=\"https:\/\/github.com\/kojino\/120-Data-Science-Interview-Questions\/blob\/master\/probability.md\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/kojino\/120-Data-Science-Interview-Questions\/blob\/master\/probability.md\" data->like these<\/a>.<\/li>\n<li id=\"ab67\"><strong>Model evaluation.<\/strong>\u00a0In classification problems, for example, most n00bs default to using model accuracy as their metric, which is usually\u00a0<a href=\"https:\/\/stats.stackexchange.com\/questions\/312780\/why-is-accuracy-not-the-best-measure-for-assessing-classification-models\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/stats.stackexchange.com\/questions\/312780\/why-is-accuracy-not-the-best-measure-for-assessing-classification-models\" data->a terrible choice<\/a>. Get comfortable with\u00a0<code>sklearn<\/code>&#8216;s\u00a0<code>precision_score<\/code>,\u00a0<code>recall_score<\/code>,\u00a0<code>f1_score<\/code>\u00a0, and\u00a0<code>roc_auc_score<\/code>\u00a0functions, and the theory behind them. For regression tasks,\u00a0<a href=\"https:\/\/towardsdatascience.com\/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/towardsdatascience.com\/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0\" data->understanding why you would use<\/a>\u00a0<code>mean_squared_error<\/code>rather than\u00a0<code>mean_absolute_error<\/code>\u00a0(and vice-versa) is also crucial. It\u2019s really worth taking the time to check out all the model evaluation metrics listed in\u00a0<code>sklearn<\/code>&#8216;s\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/model_evaluation.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/scikit-learn.org\/stable\/modules\/model_evaluation.html\" data->official documentation<\/a>.<\/li>\n<\/ul>\n<h3 id=\"427b\">Reason 3: software engineering know-how<\/h3>\n<p id=\"bf78\">Increasingly, data scientists are required to take on software engineering work. Many employers insist that applicants understand how to manage their code and keep clean notebooks and scripts. In particular:<\/p>\n<ul>\n<li id=\"7a2f\"><strong>Version control.<\/strong>\u00a0You should know how to use\u00a0<code>git<\/code>\u00a0, and interact with your remote GitHub repos using the command line. If you don\u2019t, I suggest starting with\u00a0<a href=\"https:\/\/product.hubspot.com\/blog\/git-and-github-tutorial-for-beginners\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/product.hubspot.com\/blog\/git-and-github-tutorial-for-beginners\" data->this tutorial<\/a>.<\/li>\n<li id=\"92e7\"><strong>Web development.<\/strong>\u00a0Some companies like their data scientists to be comfortable accessing data that\u2019s stored on their web app, or via an API. Getting comfortable with the basics of web development is important, and the best way to do that is to\u00a0<a href=\"https:\/\/www.freecodecamp.org\/news\/how-to-build-a-web-application-using-flask-and-deploy-it-to-the-cloud-3551c985e492\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.freecodecamp.org\/news\/how-to-build-a-web-application-using-flask-and-deploy-it-to-the-cloud-3551c985e492\/\" data->learn a bit of Flask<\/a>.<\/li>\n<li id=\"f275\"><strong>Web scraping.<\/strong>\u00a0Sort of related to web development: sometimes, you\u2019ll need to automate data collection by scraping data from live websites. Two great tools to consider for this are\u00a0<code>BeautifulSoup<\/code>\u00a0and\u00a0<code>scrapy<\/code>.<\/li>\n<li id=\"4f74\"><strong>Clean code.<\/strong>\u00a0Learn how to use docstrings. Don\u2019t overuse inline comments. Break your functions up into smaller functions. Way smaller. There shouldn\u2019t be functions in your code longer than 10 lines of code. Give your functions good, descriptive names (\u00a0<code>function_1<\/code>\u00a0is not a good name). Follow pythonic convention and name your variables with underscores\u00a0<code>like_this<\/code>\u00a0and not\u00a0<code>LikeThis<\/code>\u00a0or\u00a0<code>likeThis<\/code>\u00a0. Don\u2019t write python modules (\u00a0<code>.py<\/code>\u00a0files) with more than 400 lines of code. Each module should have a clear purpose (e.g.\u00a0<code>data_processing.py<\/code>,\u00a0<code>predict.py<\/code>\u00a0). Learn what an\u00a0<code>if name == '__main__':<\/code>\u00a0code block does and\u00a0<a href=\"https:\/\/stackoverflow.com\/questions\/419163\/what-does-if-name-main-do\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/stackoverflow.com\/questions\/419163\/what-does-if-name-main-do\" data->why it\u2019s important<\/a>. Use list comprehension.\u00a0<a href=\"https:\/\/medium.com\/python-pandemonium\/never-write-for-loops-again-91a5a4c84baf\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/medium.com\/python-pandemonium\/never-write-for-loops-again-91a5a4c84baf\" data->Don\u2019t over-use<\/a>\u00a0<code>for<\/code>\u00a0loops. Add a\u00a0<code>README<\/code>\u00a0file to your project.<\/li>\n<\/ul>\n<h3 id=\"6398\">Reason 4: business\u00a0instinct<\/h3>\n<p id=\"678b\">An alarming number of people seem to think that getting hired is about showing that you\u2019re the most technically competent applicant to a role. It\u2019s not. In reality, companies want to hire people who can help them make more money, faster.<\/p>\n<p id=\"e06f\">In general that means moving beyond just technical ability, and building a number of additional skills:<\/p>\n<ul>\n<li id=\"41b0\"><strong>Making something people want.<\/strong>\u00a0When most people are in \u201cdata science learning mode\u201d, they follow a very predictable series of steps: import data, explore data, clean data, visualize data, model data, evaluate model. And that\u2019s fine when you\u2019re focused on learning a new library or technique, but going on autopilot is a really bad habit in a business environment, where everything you do costs the company time (money). You\u2019ll want to get good at thinking like a business, and making good guesses as to how you can best leverage your time to make meaningful contributions to your team and company. A great way to do this is to decide on some questions that you want your data science projects to answer before you begin them (so that you don\u2019t get carried away with irrelevant tasks that form part of the otherwise \u201cstandard\u201d DS workflow). Make these questions as practical as possible, and after you\u2019ve completed your project, reflect on how well you were able to answer them.<\/li>\n<li id=\"c8f3\"><strong>Asking the right questions.<\/strong>\u00a0Companies want to hire people who are able to keep the big picture in mind while they tune their models, and ask themselves questions like, \u201cam I building this because it\u2019s going to be legitimately helpful to my team and company, or because it\u2019s a cool use case for an algorithm I really like?\u201d and \u201cwhat key business metric am I trying to optimize, and is there a better way to do that?\u201d.<\/li>\n<li id=\"62b3\"><strong>Explaining your results.<\/strong>\u00a0Management needs you to tell them what products are selling well, or which users are leaving for a competitor and why, but they have no idea (and don\u2019t care about) what a precision\/recall curve is, or how hard it was for you to avoid overfitting your model. For that reason, a key skill is the ability to convey your results and their implications to nontechnical audiences. Try building a project and explaining it to a friend who hasn\u2019t taken math since high school (hint: your explanation shouldn\u2019t involve any algorithm names, or refer to hyperparameter tuning. Simple words are better words.).<\/li>\n<\/ul>\n<p id=\"4adb\">Of course no list of like this one can be exhaustive, but from what I\u2019ve seen coaching hundreds of early career data scientists through the job application and interview process (and from talking to our hiring partners themselves), it probably accounts for 70% of the rejections people get.<\/p>\n<p id=\"46c1\">Keep in mind that other, less well-defined things like personality fit can often be a factor, too. If you didn\u2019t get along with your interviewer, or if the conversation felt strained or awkward, it\u2019s always possible that your technical qualifications are solid, but that you didn\u2019t hit check the culture fit box. Companies regularly turn down applicants who would have been amazing technical performers for exactly this reason, so don\u2019t take a rejection or two too much to heart!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Companies regularly turn down applicants who would have been amazing technical performers. This leads to a huge number of confused, directionless aspiring data scientists. But here&rsquo;s some good news: there aren&rsquo;t actually that many reasons why applicants get turned down from data science roles, and there&rsquo;s a lot you can do to cover those bases. And those reasons\u200a&mdash;\u200athe technical and nontechnical skills that most applicants don&rsquo;t have but that companies most badly want\u200a&mdash;\u200aare what this post is all about.<\/p>\n","protected":false},"author":251,"featured_media":3203,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2882],"class_list":["post-1800","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2882,"user_id":251,"is_guest":0,"slug":"jeremie-harris","display_name":"Jeremie Harris","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Harris","first_name":"Jeremie","job_title":"","description":"Jeremie Harris is Co-Founder at <a href=\"https:\/\/www.sharpestminds.com\/\">SharpestMinds<\/a> that finds new grads their first jobs in machine learning and data science. He has many publications to his credit."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1800","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/251"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1800"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1800\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3203"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1800"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1800"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1800"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1800"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}