{"id":22481,"date":"2020-12-03T11:45:00","date_gmt":"2020-12-03T11:45:00","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/a-laymans-guide-to-data-science-part-3-data-science-workflow\/"},"modified":"2021-05-21T03:29:50","modified_gmt":"2021-05-21T03:29:50","slug":"a-laymans-guide-to-data-science-part-3-data-science-workflow","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/a-laymans-guide-to-data-science-part-3-data-science-workflow\/","title":{"rendered":"A Layman\u2019s Guide to Data Science Part 3: Data Science Workflow"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">This is Part 3 of this series. Here is Part 1:&nbsp;<a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/a-laymans-guide-to-data-science-how-to-become-a-good-data-scientist\/\" target=\"_blank\" rel=\"noreferrer noopener\">How to Become a (Good) Data Scientist \u2013 Beginner<\/a> Guide&nbsp;and Part 2:&nbsp;<a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/a-laymans-guide-to-data-science-part-2-how-to-build-a-data-project\/\" target=\"_blank\" rel=\"noreferrer noopener\">A Layman\u2019s Guide to Data Science. How to Build a Data Project&nbsp;<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By now, you have already gained enough&nbsp;<a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/a-laymans-guide-to-data-science-how-to-become-a-good-data-scientist\/\" target=\"_blank\" rel=\"noreferrer noopener\">knowledge and skills about Data Science<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/a-laymans-guide-to-data-science-part-2-how-to-build-a-data-project\/\" target=\"_blank\" rel=\"noreferrer noopener\">have built your first (or even your second and third) project<\/a>. At this point, it is time to improve your workflow to facilitate further development process.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a9cf\">There is no specific template for solving any data science problem (otherwise you\u2019d see it in the first textbook you come across). Each new dataset and each new problem will lead to a different roadmap. However, there are similar high-level steps in many different projects.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"91d6\">In this post, we offer a clean workflow that can be used as a basis for data science projects. Every stage and step in it, of course, can be addressed on its own and can even be implemented by different specialists in larger-scale projects.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0vPCngxWkeuoo0hGB.png\" alt=\"A Layman\u2019s Guide to Data Science Part 3: Data Science Workflow\"\/><figcaption>Data science workflow<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"3d18\"><strong>Framing the problem and the goals<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"aa83\">As you already know, at the starting point, you\u2019re asking questions and trying to get a handle on what data you need. Therefore, think of the problem you are trying to solve. What do you want to learn more about? For now, forget about modeling, evaluation metrics, and data science-related things. Clearly stating your problem and defining goals are the first step to providing a good solution. Without it, you could lose the track in the data-science forest.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"fb6e\">Data Preparation Phase<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"86c0\">In any Data Science project, getting the right kind of data is critical. Before any analysis can be done, you must acquire the relevant data, reformat it into a form that is amenable to computation and clean it.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/09ERHYodgaPiEUBre.png\" alt=\"A Layman\u2019s Guide to Data Science Part 3: Data Science Workflow\"\/><figcaption><strong>Acquire data<\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"39ef\">The first step in any data science workflow is to acquire the data to analyze. Data can come from a variety of sources:<\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>imported from CSV files from your local machine;<\/li><li>queried from SQL servers;<\/li><li>stripped from online repositories such as public websites;<\/li><li>streamed on-demand from online sources via an API;<\/li><li>automatically generated by physical apparatus, such as scientific lab equipment attached to computers;<\/li><li>generated by computer software, such as logs from a webserver.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f0cc\">In many cases, collecting data can become messy, especially if the data isn\u2019t something people have been collecting in an organized fashion. You\u2019ll have to work with different sources and apply a variety of tools and methods to collect a dataset.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"6de3\">There are several key points to remember while collecting data:<\/h3>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0vdgF6Syk_7pr6s5U.png\" alt=\"Image for post\"\/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2383\"><strong><em>Data provenance<\/em><\/strong>: It is important to accurately track provenance, i.e. where each piece of data comes from and whether it is still up-to-date, since data often needs to be re-acquired later to run new experiments. Re-acquisition can be helpful if the original data sources get updated or if researchers want to test alternate hypotheses. Besides, we can use provenance to trace back downstream analysis errors to the original data sources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fbdd\"><strong><em>Data management<\/em><\/strong>: To avoid data duplication and confusion between different versions, it is critical to assign proper names to data files that they create or download and then organize those files into directories. When new versions of those files are created, corresponding names should be assigned to all versions to be able to keep track of their differences. For instance, scientific lab equipment can generate hundreds or thousands of data files that scientists must name and organize before running computational analyses on them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6080\"><strong><em>Data storage<\/em><\/strong>: With modern almost limitless access to data, it often happens that there is so much data that it cannot fit on a hard drive, so it must be stored on remote servers. While cloud services are gaining popularity, a significant amount of data analysis is still done on desktop machines with data sets that fit on modern hard drives (i.e., less than a terabyte).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0g_R2UiA9_nuBynsA.png\" alt=\"A Layman\u2019s Guide to Data Science Part 3: Data Science Workflow\"\/><figcaption><strong>Reformat and clean data<\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c95d\">Raw data is usually not in a convenient format to run an analysis, since it was formatted by somebody else without that analysis in mind. Moreover, raw data often contains semantic errors, missing entries, or inconsistent formatting, so it needs to be \u201ccleaned\u201d prior to analysis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e191\"><strong><em>Data wrangling (munging)<\/em><\/strong>&nbsp;is the process of cleaning data, putting everything together into one workspace, and making sure your data has no faults in it. It is possible to reformat and clean the data either manually or by writing scripts. Getting all of the values in the correct format can involve stripping characters from strings, converting integers to floats, or many other things. Afterwards, it is necessary to deal with missing values and null values that are common in sparse matrices. The process of handling them is called&nbsp;<strong><em>missing data imputation<\/em><\/strong>&nbsp;where the missing data are replaced with substituted data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e01d\"><strong><em>Data integration<\/em><\/strong>&nbsp;is a related challenge, since data from all sources needs to be integrated into a central MySQL relational database, which serves as the master data source for his analyses.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d86e\">Usually it consumes a lot of time and cannot be fully automated, but at the same time, it can provide insights into the data structure and quality as well as the models and analyses that might be optimal to apply.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/06yOPCxhAou-IQqYY.png\" alt=\"Explore the data\"\/><figcaption><strong>Explore the data<\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3f2f\">Here\u2019s where you\u2019ll start getting summary-level insights of what you\u2019re looking at, and extracting the large trends. At this step, there are three dimensions to explore: whether the data imply supervised learning or unsupervised learning? Is this a classification problem or is it a regression problem? Is this a prediction problem or an inference problem? These three sets of questions can offer a lot of guidance when solving your data science problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fdc5\">There are many tools that help you understand your data quickly. You can start with checking out the first few rows of the data frame to get the initial impression of the data organization. Automatic tools incorporated in multiple libraries, such as Pandas\u2019 .describe(), can quickly give you the mean, count, standard deviation and you might already see things worth diving deeper into. With this information you\u2019ll be able to determine which variable is our target and which features we think are important.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"854e\">Analysis Phase<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bbc6\">Analysis is the core phase of data science that includes writing, executing, and refining computer programs to analyze and obtain insights from the data prepared at the previous phase. Though there are many programming languages for data science projects ranging from interpreted \u201cscripting\u201d languages such as Python, Perl, R, and MATLAB to compiled ones such as Java, C, C++, or even Fortran, the workflow for writing analysis software is similar across the languages.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f1a1\">As you can see, analysis is a repeated&nbsp;<em>iteration cycle<\/em>&nbsp;of editing scripts or programs, executing to produce output files, inspecting the output files to gain insights and discover mistakes, debugging, and re-editing.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/04cOAR0F_lfuwEkjC.png\" alt=\"Baseline Modeling\"\/><figcaption><strong>Baseline Modeling<\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d23f\">As a data scientist, you will build a lot of models with a variety of algorithms to perform different tasks. At the first approach to the task, it is worthwhile to avoid advanced complicated models, but to stick to simpler and more traditional&nbsp;<strong><em>linear regression<\/em><\/strong>&nbsp;for regression problems and&nbsp;<strong><em>logistic regression<\/em><\/strong>&nbsp;for classification problems as a baseline upon which you can improve.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7cc9\">At the model preprocessing stage you can separate out features from dependent variables, scale the data, and use a train-test-split or cross validation to prevent overfitting of the model \u2014 the problem when a model too closely tracks the training data and doesn\u2019t perform well with new data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4f98\">With the model ready, it can be fitted on the training data and tested by having it predict\u00a0<em>y\u00a0<\/em>values for the\u00a0<em>X_test<\/em>\u00a0data. Finally, the model is evaluated with the help of\u00a0<a href=\"https:\/\/www.saedsayad.com\/model_evaluation_r.htm\" target=\"_blank\" rel=\"noreferrer noopener\">metrics<\/a>\u00a0that are appropriate for the task, such as\u00a0<a href=\"https:\/\/blog.minitab.com\/blog\/adventures-in-statistics-2\/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit\" target=\"_blank\" rel=\"noreferrer noopener\">R-squared<\/a>\u00a0for regression problems and\u00a0<a href=\"https:\/\/blog.exsilio.com\/all\/accuracy-precision-recall-f1-score-interpretation-of-performance-measures\/\" target=\"_blank\" rel=\"noreferrer noopener\">accuracy<\/a>\u00a0or\u00a0<a href=\"https:\/\/developers.google.com\/machine-learning\/crash-course\/classification\/roc-and-auc\" target=\"_blank\" rel=\"noreferrer noopener\">ROC-AUC\u00a0<\/a>scores for classification tasks.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0-YCl3CQbg8UCxoeZ.png\" alt=\"A Layman\u2019s Guide to Data Science Part 3: Data Science Workflow\"\/><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0FrWcwJEVh4dAGBN2.png\" alt=\"Secondary Modeling\"\/><figcaption><strong>Secondary Modeling<\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"14b7\">Now it is time to go into a deeper analysis and, if necessary, use more advanced models, such as, for example&nbsp;<strong>neural networks<\/strong>,&nbsp;<strong>XGBoost<\/strong>, or Random Forests. It is important to remember that such models can initially render worse results than simple and easy-to-understand models due to a small dataset that cannot provide enough data or to the collinearity problem with features providing similar information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f4ba\">Therefore, the key task of the secondary modeling step is parameter tuning. Each algorithm has a set of parameters you can optimize. Parameters are the variables that a machine learning technique uses to adjust to the data. Hyperparameters that arethe variables that govern the training process itself, such as the number of nodes or hidden layers in a neural network, are tuned by running the whole training job, looking at the aggregate accuracy, and adjusting.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"39e2\">Reflection Phase<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"882e\">Data scientists frequently alternate between the&nbsp;<em>analysis<\/em>&nbsp;and&nbsp;<em>reflection<\/em>&nbsp;phases: whereas the analysis phase focuses on programming, the reflection phase involves thinking and communicating about the outputs of analyses. After inspecting a set of output files, a data scientist, or a group of data scientists can make comparisons between output variants and explore alternative paths by adjusting script code and\/or execution parameters. Much of the data analysis process is trial-and-error: a scientist runs tests, graphs the output, reruns them, graphs the output and so on. Therefore, graphs are the central comparison tool that can be displayed side-by-side on monitors to visually compare and contrast their characteristics. A supplementary tool is taking notes, both physical and digital to keep track of the line of thought and experiments.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"c0e3\"><strong>Communication Phase<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6890\">The final phase of data science is disseminating results either in the form of a data science product or as written reports such as internal memos, slideshow presentations, business\/policy white papers, or academic research publications.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6ee9\">A&nbsp;<strong><em>data science product<\/em><\/strong>&nbsp;implies getting your model into production. In most companies, data scientists will be working with the software engineering team to write the production code. The software can be used both to reproduce the experiments or play with the prototype systems and as an independent solution to tackle a known issue on the market, like, for example, assessing the risk of financial fraud.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ade5\">Alternatively to the data product, you can create a&nbsp;<strong>data science report<\/strong>. You can showcase your results with a presentation and offer a technical overview on the process. Remember to keep your audience in mind: go into more detail if presenting to fellow data scientists or focus on the findings if you address the sales team or executives. If your company allows publishing the results, it is also a good opportunity to have feedback from other specialists. Additionally, you can write a blog post and push your code to GitHub so the data science community can learn from your success. Communicating your results is an important part of the scientific process, so this phase should not be overlooked.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By now, you have already gained enough knowledge and skills about Data Science and have built your first (or even your second and third) project. At this point, it is time to improve your workflow to facilitate further development process.<\/p>\n","protected":false},"author":570,"featured_media":16922,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[183],"tags":[97,125,1081,94,92],"ppma_author":[3261],"class_list":["post-22481","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence","tag-business-intelligence","tag-data-analysis","tag-data-science","tag-machine-learning"],"authors":[{"term_id":3261,"user_id":570,"is_guest":0,"slug":"max-ved","display_name":"Max Ved","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_cbaf23d5-a78a-4ceb-8f6e-343134811364-150x150.jpg","author_category":"","user_url":"https:\/\/sciforce.solutions\/","last_name":"Ved","first_name":"Max","job_title":"","description":"Max Ved, a Scientist Entrepreneur, is Co-Founder &amp; CTO at SciForce, an IT company specialized in the development of software solutions.\u00a0"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22481","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22481"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22481\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/16922"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22481"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22481"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22481"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22481"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}