{"id":700,"date":"2018-05-24T02:42:54","date_gmt":"2018-05-23T23:42:54","guid":{"rendered":"http:\/\/kusuaks7\/?p=305"},"modified":"2021-05-11T13:55:23","modified_gmt":"2021-05-11T13:55:23","slug":"data-science-framework","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/data-science-framework\/","title":{"rendered":"Data Science Framework"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>&nbsp;like&nbsp;<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p id=\"e76c\" name=\"e76c\" style=\"text-align: center;\"><canvas height=\"40\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*OOZpGGVL8g-7_GAh.jpg\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*OOZpGGVL8g-7_GAh.jpg\" style=\"width: 650px; height: 350px;\" \/><\/p>\n<p id=\"b1ad\" name=\"b1ad\">A lot of material is available on &lsquo;how to learn machine learning (ML)\/data science (DS)?&rsquo; but when we work on actual ML\/DS project, we realize that the core aspects (modeling &amp; evaluation) that we learnt is actually just a small part of the overall solution. When working as a data scientist, nobody tells us whats the ML\/DS problem that we need to solve or the prediction that we need to make, we need to understand the business process first and identify the problem and qualify the problem suitable for a ML\/DS solution.<\/p>\n<p id=\"1fb3\" name=\"1fb3\">Then we need to collect underlying data being used by the business and assess whether its enough &amp; useful to convert this business problem to ML\/DS problem. Further, we explore the data &amp; prepare it to be consumed by prediction algorithms\/models &amp; evaluate the model performance before deploying the model in production. In between, we also need to identify a suitable evaluation methodology &amp; agree monitoring &amp; support activities with business.<\/p>\n<p id=\"1ac9\" name=\"1ac9\">In this article, I will cover these aspects to give you a holistic view of Data Science Framework built on CRISP\/DM methodology:<\/p>\n<ul>\n<li id=\"5ba3\" name=\"5ba3\">Business understanding<\/li>\n<li id=\"11f9\" name=\"11f9\">Data understanding<\/li>\n<li id=\"44be\" name=\"44be\">Data preparation<\/li>\n<li id=\"bbd0\" name=\"bbd0\">Modeling<\/li>\n<li id=\"bc2f\" name=\"bc2f\">Evaluation<\/li>\n<li id=\"eee6\" name=\"eee6\">Deployment<\/li>\n<\/ul>\n<p id=\"9500\" name=\"9500\"><em>These three activities are performed in iterative manner to reach most optimized &amp; generalized model avoiding under-fitting or over-fitting.<\/em><\/p>\n<p id=\"ec49\" name=\"ec49\"><em>Data preparation &lt;-&gt; Modeling &lt;-&gt; Evaluation<\/em><\/p>\n<h3 id=\"ea0e\" name=\"ea0e\"><strong>Business understanding<\/strong><\/h3>\n<p id=\"78ed\" name=\"78ed\">This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data science problem definition and a preliminary plan designed to achieve the objectives.<\/p>\n<div id=\"1a59\" name=\"1a59\">-Define the business problem<\/div>\n<div name=\"f585\">-Set the criteria for success<\/div>\n<div name=\"30ab\">-Convert business problem to DS problem<\/div>\n<div name=\"2ea3\">-Categorize the DS problem (Classification\/Regression\/Anomaly Detection etc)<\/div>\n<div name=\"e6a3\">-Prepare a high-level plan to achieve results<\/div>\n<div name=\"9b97\">-Visulaize the DS pipeline in context of objective (Evaluation Criteria\/Algorithms\/Transformations)<\/div>\n<h3 name=\"dc55\"><strong>Data understanding<\/strong><\/h3>\n<p id=\"ea74\" name=\"ea74\">The data understanding phase starts with initial data collection and proceeds with activities that enable you to become familiar with the data, identify data quality problems, discover first insights into the data, and\/or detect interesting subsets to form hypotheses regarding hidden information.<\/p>\n<div id=\"ef9f\" name=\"ef9f\">-Collect &amp; integrate initial data<\/div>\n<div name=\"fe31\">-Understand the attributes &amp; its relationship<\/div>\n<div name=\"05c6\">-Identify data quality issues<\/div>\n<div name=\"ad38\">-Perform EDA (Exploratory Data Analysis)<\/div>\n<h3 name=\"de7a\"><strong>Data preparation<\/strong><\/h3>\n<p id=\"ad32\" name=\"ad32\">The data preparation phase covers all activities needed to construct the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools.<\/p>\n<div id=\"b5e1\" name=\"b5e1\">-Integrate data sources<\/div>\n<div name=\"d875\">-Handle missing values, outliers<\/div>\n<div name=\"ce9a\">-Apply Feature Engineering<\/div>\n<h3 name=\"5b45\"><strong>Modeling<\/strong><\/h3>\n<p id=\"c7b5\" name=\"c7b5\">In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, going back to the data preparation phase is often necessary.<\/p>\n<div id=\"079d\" name=\"079d\">-Apply multiple models<\/div>\n<div name=\"66d1\">-Choose most optimal model<\/div>\n<div name=\"ecbd\">-Create a feedback pipeline<\/div>\n<div name=\"70c4\">-Ensemble\/Stack different models<\/div>\n<h3 name=\"6b2c\"><strong>Evaluation<\/strong><\/h3>\n<p id=\"357e\" name=\"357e\">Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data science results should be reached.<\/p>\n<div id=\"a3cc\" name=\"a3cc\">-Refine evaluation criteria<\/div>\n<div name=\"fbc3\">-Evaluate the models<\/div>\n<div name=\"0e95\">-Handle Overfitting\/Underfitting<\/div>\n<h3 name=\"ac44\"><strong>Deployment<\/strong><\/h3>\n<p id=\"6ef4\" name=\"6ef4\">Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable DS process across the enterprise. In many cases, it is the customer, not the data analyst, who carries out the deployment steps. However, even if the analyst will carry out the deployment effort, it is important for the customer to understand up front what actions need to be carried out in order to actually make use of the created models.<\/p>\n<div id=\"8773\" name=\"8773\">-Prepare a detailed deployment plan<\/div>\n<div name=\"8675\">-Agree on post-deployment monitoring &amp; support<\/div>\n<div name=\"70a1\">-Monitor &amp; support<\/div>\n<h3 name=\"985f\"><strong>Reference<\/strong><\/h3>\n<div id=\"8f9c\" name=\"8f9c\"><a data-href=\"https:\/\/www.the-modeling-agency.com\/crisp-dm.pdf\" href=\"https:\/\/www.the-modeling-agency.com\/crisp-dm.pdf\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">CRISP-DM Guide<\/a><\/div>\n<div name=\"fd6f\"><a data-href=\"http:\/\/datascienceguide.github.io\/data-science-framework\" href=\"http:\/\/datascienceguide.github.io\/data-science-framework\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">Data science framework overview<\/a><\/div>\n<div name=\"9c8d\"><a data-href=\"http:\/\/www.kdnuggets.com\/2016\/10\/edison-data-science-framework.html\" href=\"http:\/\/www.kdnuggets.com\/2016\/10\/edison-data-science-framework.html\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">EDISON Data Science Framework to define the Data Science Profession<\/a><\/div>\n","protected":false},"excerpt":{"rendered":"<p>When working as a data scientist, nobody tells us what&rsquo;s the ML\/DS problem that we need to solve or the prediction that we need to make, we need to understand the business process first and identify the problem and qualify the problem suitable for a ML\/DS solution. Then we need to collect underlying data being used by the business and assess whether it&rsquo;s enough &amp; useful to convert this business problem to ML\/DS problem. This article covers these aspects to give you a holistic view of Data Science Framework built on CRISP\/DM methodology.<\/p>\n","protected":false},"author":280,"featured_media":3751,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1811],"class_list":["post-700","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1811,"user_id":280,"is_guest":0,"slug":"ankit-rathi","display_name":"Ankit Rathi","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Rathi","first_name":"Ankit","job_title":"","description":"Ankit Rathi is Lead Architect at SITA, the leading &amp; innovative IT organization in ATI, delivering end-to-end analytics platforms using Data Science, Big Data &amp; Cloud. He is a Data Science Architect with extensive experience is designing &amp; developing data-intensive technology solutions including data architecture, data science, big data &amp; cloud."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/700","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/280"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=700"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/700\/revisions"}],"predecessor-version":[{"id":6313,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/700\/revisions\/6313"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3751"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=700"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=700"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=700"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=700"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}