{"id":22556,"date":"2021-01-12T10:01:58","date_gmt":"2021-01-12T10:01:58","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/avoid-mistakes-training-machine-learning-model\/"},"modified":"2023-09-06T13:08:54","modified_gmt":"2023-09-06T13:08:54","slug":"avoid-mistakes-training-machine-learning-model","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/avoid-mistakes-training-machine-learning-model\/","title":{"rendered":"Avoid These 8 Mistakes Before Training A Machine Learning Model"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22556\" class=\"elementor elementor-22556\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-14ec266 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"14ec266\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b1fb491\" data-id=\"b1fb491\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3af32b1 elementor-widget elementor-widget-text-editor\" data-id=\"3af32b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"630f\">One of the most common misconceptions in Machine Learning is that ML Engineers get a CSV dataset and they spend the majority of the time optimizing the hyperparameters of a model.<\/p>\n\n<p id=\"64bc\">If you work in the industry, you know that\u2019s far from the truth. ML Engineers spend most of the time planning&nbsp;<strong>how to construct the training set that resembles real-world data distribution for a certain problem.<\/strong><\/p>\n\n<p id=\"fed8\">When you\u2019ve managed to construct such training set, just add a few well-crafted features and the Machine Learning model won\u2019t have a hard time finding the decision boundary.<\/p>\n\n<p id=\"f02d\">In this article, we\u2019re going to go through 8 Machine Learning tips that will help you to train a model with fewer screw-ups. These tips are most useful when you need to construct the training set, e.g. you didn\u2019t get it from Kaggle.<\/p>\n\n<p id=\"427d\">At the end of the article, I also share a link to the Jupyter Notebook template, which you can incorporate into your Machine Learning workflow.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9375f1f elementor-widget elementor-widget-heading\" data-id=\"9375f1f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Dataset sample<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c8145f3 elementor-widget elementor-widget-image\" data-id=\"c8145f3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18424\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0YSHJ443HMwyCdSpI-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@roadtripwithraj?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Road Trip with Raj<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-185a760 elementor-widget elementor-widget-text-editor\" data-id=\"185a760\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"677f\">It\u2019s easier to learn with examples. Let\u2019s create a sample dataset with random features.<\/p>\n\n<p id=\"2f6e\">One row represents a customer with his features and a binary target variable. customer_id is an index in the DataFrame.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bfe9c43 elementor-widget elementor-widget-text-editor\" data-id=\"bfe9c43\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">np<strong>.<\/strong>random<strong>.<\/strong>seed(42)n <strong>=<\/strong> 1000<br>df <strong>=<\/strong> pd<strong>.<\/strong>DataFrame(<br>    {<br>        \"customer_id\": [\"customer_%d\" <strong>%<\/strong> i <strong>for<\/strong> i <strong>in<\/strong> range(n)],<br>        \"product_a_ratio\": np<strong>.<\/strong>random<strong>.<\/strong>random_sample(n),<br>        \"std_price_product_a\": np<strong>.<\/strong>random<strong>.<\/strong>normal(0, 1, n),<br>        \"n_purchases_product_a\": np<strong>.<\/strong>random<strong>.<\/strong>randint(0, 10, n),<br>    }<br>)<br>df<strong>.<\/strong>loc[100, \"std_price_product_a\"] <strong>=<\/strong> pd<strong>.<\/strong>NA<br>df[\"product_b_ratio\"] <strong>=<\/strong> df[\"product_a_ratio\"]<br>df[\"y\"] <strong>=<\/strong> np<strong>.<\/strong>random<strong>.<\/strong>randint(0, 2, n)<br>df<strong>.<\/strong>set_index(\"customer_id\", inplace<strong>=<\/strong>True)df<strong>.<\/strong>head()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d47165f elementor-widget elementor-widget-image\" data-id=\"d47165f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"186\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-1024x186.png\" class=\"attachment-large size-large wp-image-18425\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-1024x186.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-300x54.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-768x139.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-1536x279.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-2048x372.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-610x111.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-750x136.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fy9UXM_-t_wPSWZR4dFATQ-1140x207.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Sample dataset with features and target<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85f0bee elementor-widget elementor-widget-heading\" data-id=\"85f0bee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">1. Check the target<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e69ef2c elementor-widget elementor-widget-image\" data-id=\"e69ef2c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"707\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-1024x707.jpeg\" class=\"attachment-large size-large wp-image-18426\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-1024x707.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-300x207.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-768x530.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-1536x1060.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-2048x1414.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-610x421.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-750x518.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0K5AFzCehxxDG3hIy-1140x787.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@jrarce?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Ricardo Arce<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9345ca3 elementor-widget elementor-widget-text-editor\" data-id=\"9345ca3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1af2\">Despite being obvious that no positive customer is also marked as a negative, it does happen in the real world and it\u2019s worthwhile to check it.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9a0f1da elementor-widget elementor-widget-text-editor\" data-id=\"9a0f1da\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>assert<\/strong> (<br>    len(set(df[df<strong>.<\/strong>y <strong>==<\/strong> 0]<strong>.<\/strong>index)<strong>.<\/strong>intersection(df[df<strong>.<\/strong>y <strong>==<\/strong> 1])) <strong>==<\/strong> 0<br>), \"Positive customers have intersection with negative customers\"<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b88c2e elementor-widget elementor-widget-text-editor\" data-id=\"7b88c2e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"be09\">I use the assert statement above in a Jupyter Notebook, which breaks execution if there is a mistake in the training set. So when I construct a new training set and use the Jupyter command: \u201cRestart kernel and run all cells\u201d, I can be sure that the trainset has the required properties.<\/p>\n\n<p id=\"fe68\">When doing Exploratory Data Analysis (EDA), we need to aware that real-world datasets have mistakes in unexpected places. One of the goals of EDA is to discover them.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e38719e elementor-widget elementor-widget-heading\" data-id=\"e38719e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">2. Check duplicates<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7093255 elementor-widget elementor-widget-image\" data-id=\"7093255\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18427\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0guVwH_QldOSMOAc6-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@ferrez?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Stefano Ferretti<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-584d94a elementor-widget elementor-widget-text-editor\" data-id=\"584d94a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"249a\">How do the duplicates come into the training set?<\/p>\n\n<p id=\"1964\">Many times with joins in SQL databases!<\/p>\n\n<p id=\"6450\">E.g. you join an SQL table with another table by customer_id key. If any of those tables have multiples entries for a customer_id, it will create duplicates.<\/p>\n\n<p id=\"bad4\">How can we make sure there aren\u2019t any duplicates in our training set?<\/p>\n\n<p id=\"7841\">We can use an assert statement that will break execution in case a duplicated customer_id appears:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-283870a elementor-widget elementor-widget-text-editor\" data-id=\"283870a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>assert<\/strong> len(df[df<strong>.<\/strong>index<strong>.<\/strong>duplicated()]) <strong>==<\/strong> 0, \"There are duplicates in trainset\"<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9bab8c elementor-widget elementor-widget-heading\" data-id=\"a9bab8c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">3. Check missing values<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9ce8646 elementor-widget elementor-widget-image\" data-id=\"9ce8646\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"805\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-1024x805.jpeg\" class=\"attachment-large size-large wp-image-18428\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-1024x805.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-300x236.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-768x603.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-1536x1207.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-2048x1609.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-610x479.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-750x589.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0xyUMUZXAQDMArDpt-1140x896.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@lazycreekimages?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Michael Dziedzic<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-734eee7 elementor-widget elementor-widget-text-editor\" data-id=\"734eee7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"4f1f\"><strong>In my experience, missing values appear for two reasons:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2a63150 elementor-widget elementor-widget-text-editor\" data-id=\"2a63150\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>real missing values \u2014 the customer doesn\u2019t have an entry for a certain feature,<\/li><li>mistakes in a dataset \u2014 we didn\u2019t map the NULL to the default value when constructing the training set, because we didn\u2019t expect it.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ba9d6c5 elementor-widget elementor-widget-text-editor\" data-id=\"ba9d6c5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"61c3\">In the latter case, we can simply fix the query and map the NULL value. For real <a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/working-with-missing-data-in-machine-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">missing values<\/a>, we need to know how our model handles them.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a2d6f7 elementor-widget elementor-widget-text-editor\" data-id=\"4a2d6f7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p>For example,\u00a0<a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/Advanced-Topics.html#missing-value-handle\" target=\"_blank\" rel=\"noreferrer noopener\">LightGBM<\/a>\u00a0supports missing values by default and we can set the desired behavior.<\/p><p>LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true. When zero_as_missing=false (default), the unshown values in sparse matrices (and LightSVM) are treated as zeros. When zero_as_missing=true, NA and zeros (including unshown values in sparse matrices (and LightSVM)) are treated as missing.<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec55aac elementor-widget elementor-widget-text-editor\" data-id=\"ec55aac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a7f6\">Let\u2019s check if our dataset contains missing values.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-29345bd elementor-widget elementor-widget-text-editor\" data-id=\"29345bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>for<\/strong> col <strong>in<\/strong> df<strong>.<\/strong>columns:<br>    <strong>assert<\/strong> df[df[col]<strong>.<\/strong>isnull()]<strong>.<\/strong>shape[0] <strong>==<\/strong> 0, \"%s col has %d missing values\" <strong>%<\/strong> (col, df[df[col]<strong>.<\/strong>isnull()]<strong>.<\/strong>shape[0])<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5059f00 elementor-widget elementor-widget-image\" data-id=\"5059f00\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"154\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-1024x154.png\" class=\"attachment-large size-large wp-image-18429\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-1024x154.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-300x45.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-768x115.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-1536x231.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-2048x307.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-610x92.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-750x113.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W7L9rgD51DTYN7eesfVqPQ-1140x171.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a4517d8 elementor-widget elementor-widget-text-editor\" data-id=\"a4517d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3a54\">std_price_product_a column has a single missing value. Let\u2019s remove the entry and rerun the check.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5ac1824 elementor-widget elementor-widget-text-editor\" data-id=\"5ac1824\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">df <strong>=<\/strong> df[df['std_price_product_a']<strong>.<\/strong>notnull()]<strong>.<\/strong>copy()<strong>for<\/strong> col <strong>in<\/strong> df<strong>.<\/strong>columns:<br>    <strong>assert<\/strong> df[df[col]<strong>.<\/strong>isnull()]<strong>.<\/strong>shape[0] <strong>==<\/strong> 0, \"%s col has %d missing values\" <strong>%<\/strong> (col, df[df[col]<strong>.<\/strong>isnull()]<strong>.<\/strong>shape[0])<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-932bf98 elementor-widget elementor-widget-text-editor\" data-id=\"932bf98\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cfb7\">By now, we see that these checks can be very useful. We didn\u2019t spend a second on debugging! These checks notify us right away about the unexpected values in the dataset.<\/p>\n\n<p id=\"cdac\">When a missing value is expected for a certain feature, we can whitelist it so the check won\u2019t break execution next time.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-86dc024 elementor-widget elementor-widget-heading\" data-id=\"86dc024\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">4. Check feature scales<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f1bd3e elementor-widget elementor-widget-image\" data-id=\"5f1bd3e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18430\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0c-X_m-MDvPpqGo8L-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@jdent?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Jason Dent<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-69a7e76 elementor-widget elementor-widget-text-editor\" data-id=\"69a7e76\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"092c\">When working on feature engineering, we define certain features on a 0\u20131 scale or some other scale. It\u2019s worthwhile to check if a feature in between desired boundaries.<\/p>\n\n<p id=\"835f\">In this example, we only check if features are on a scale between 0 and 1, but I would suggest you add more checks that are appropriate for your dataset.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7e13e8d elementor-widget elementor-widget-text-editor\" data-id=\"7e13e8d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">features_on_0_1_scale <strong>=<\/strong> [<br>    'product_a_ratio',<br>    'product_b_ratio',<br>    'y',<br>]<strong>for<\/strong> col <strong>in<\/strong> features_on_0_1_scale:<br>    <strong>assert<\/strong> df[col]<strong>.<\/strong>min() <strong>&gt;=<\/strong> 0 <strong>and<\/strong> df[col]<strong>.<\/strong>max() <strong>&lt;=<\/strong> 1, \"%s is not on 0 - 1 scale\" <strong>%<\/strong> col<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9f2e329 elementor-widget elementor-widget-heading\" data-id=\"9f2e329\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">5. Check feature types<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-58557ee elementor-widget elementor-widget-image\" data-id=\"58557ee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18431\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/049B11RXsS3o0CCst-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@nhillier?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Nick Hillier<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0dd3c6f elementor-widget elementor-widget-text-editor\" data-id=\"0dd3c6f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"33d2\">Before you start with training the model, I suggest you manually set the data type for every feature. At first, it might feel redundant, but you will thank me later.<\/p>\n\n<p id=\"6d1b\">We can set the feature types in a for loop:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e66bc0d elementor-widget elementor-widget-text-editor\" data-id=\"e66bc0d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">feature_types <strong>=<\/strong> {<br>    \"product_a_ratio\": \"float64\",<br>    \"std_price_product_a\": \"float64\",<br>    \"n_purchases_product_a\": \"int64\",<br>    \"product_b_ratio\": \"float64\",<br>    \"y\": \"int64\",<br>}<strong>for<\/strong> feature, dtype <strong>in<\/strong> feature_types<strong>.<\/strong>items():<br>    df<strong>.<\/strong>loc[:, feature] <strong>=<\/strong> df[feature]<strong>.<\/strong>astype(dtype)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-08c185d elementor-widget elementor-widget-text-editor\" data-id=\"08c185d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a1b0\">Why is this useful? What happens to the integer data type when we add a missing value?<\/p>\n\n<p id=\"1246\">The column data type changes from integer to the object data type. When we convert it to the numpy array it has floats instead of integers. The classifier could misinterpret ordinal or categorical features as continuous features.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-456806b elementor-widget elementor-widget-heading\" data-id=\"456806b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Let\u2019s look at an example below:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0fa3846 elementor-widget elementor-widget-text-editor\" data-id=\"0fa3846\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">df<strong>.<\/strong>n_purchases_product_a<strong>.<\/strong>values[:10]# output<br>array([0, 8, 0, 2, 7, 2, 3, 7, 0, 5])# add NA to first row<br>df<strong>.<\/strong>loc[0, \"n_purchases_product_a\"] <strong>=<\/strong> pd<strong>.<\/strong>NA<br>df<strong>.<\/strong>n_purchases_product_a[:10]<strong>.<\/strong>values# output<br>array([0.0, 8.0, 0.0, 2.0, 7.0, 2.0, 3.0, 7.0, 0.0, 5.0], dtype=object)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d17a7d9 elementor-widget elementor-widget-heading\" data-id=\"d17a7d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">6. Unique features<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-691501a elementor-widget elementor-widget-image\" data-id=\"691501a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18432\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Q9zZ5EWHcPYzXs_f-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@noahdavis?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Noah N\u00e4f<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-427a3b5 elementor-widget elementor-widget-text-editor\" data-id=\"427a3b5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"740b\">It\u2019s a well-established practice in Machine Learning to define features in a list that we use in the model.<\/p>\n\n<p id=\"ac50\">It\u2019s worthwhile to check if a certain feature goes into the model more than once. It seems trivial but you can mistakenly duplicate features when coding and rerunning the Jupyter Notebook for X-th time.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a5ac51 elementor-widget elementor-widget-text-editor\" data-id=\"4a5ac51\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">features <strong>=<\/strong> [<br>    \"product_a_ratio\",<br>    \"std_price_product_a\",<br>    \"n_purchases_product_a\",<br>    \"product_b_ratio\",<br>]<strong>assert<\/strong> len(set(features)) <strong>==<\/strong> len(features), \"Features names are not unique\"<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cfdbb55 elementor-widget elementor-widget-text-editor\" data-id=\"cfdbb55\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6c90\">I would suggest you also list the features that you don\u2019t use in the model. That way you can spot a feature that should be in the model but it is not.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9d333d elementor-widget elementor-widget-text-editor\" data-id=\"a9d333d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">set(df<strong>.<\/strong>columns) <strong>-<\/strong> set(features)# Output{'y'}<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-16c8877 elementor-widget elementor-widget-heading\" data-id=\"16c8877\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">7. Check correlations<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d41a079 elementor-widget elementor-widget-image\" data-id=\"d41a079\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18433\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Aa8GoTNt2MRefIg1-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@sharonmccutcheon?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Sharon McCutcheon<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9344f6 elementor-widget elementor-widget-text-editor\" data-id=\"a9344f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7722\">Checking the correlation between the features (and target) is essential when modeling.<\/p>\n\n<p id=\"58e3\">Linear Regression is a well-known algorithm that has problems with multicollinearity \u2014 when your model includes multiple features that are similar to each other.<\/p>\n\n<p id=\"ec55\">Highly correlated features are also problematic with models that don\u2019t have a problem with multicollinearity, like Random Forest or Boosting. Eg. the model divides feature importance between correlated feature A and feature B, which makes feature importance misleading.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d237825 elementor-widget elementor-widget-heading\" data-id=\"d237825\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Let\u2019s plot the correlation matrix and try to spot highly correlated features:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b51ec8 elementor-widget elementor-widget-text-editor\" data-id=\"2b51ec8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">corr <strong>=<\/strong> df[features]<strong>.<\/strong>corr()fig, ax <strong>=<\/strong> plt<strong>.<\/strong>subplots()<br>ax <strong>=<\/strong> sns<strong>.<\/strong>heatmap(corr, vmin<strong>=-<\/strong>1, vmax<strong>=<\/strong>1, center<strong>=<\/strong>1, square<strong>=<\/strong>True)<br>ax<strong>.<\/strong>set_xticklabels(ax<strong>.<\/strong>get_xticklabels(), rotation<strong>=<\/strong>45, horizontalalignment<strong>=<\/strong>\"right\");<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ae2fbe0 elementor-widget elementor-widget-image\" data-id=\"ae2fbe0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"484\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug-1024x484.png\" class=\"attachment-large size-large wp-image-18434\" alt=\"plot the correlation matrix and try to spot highly correlated features\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug-1024x484.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug-300x142.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug-768x363.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug-610x288.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug-750x354.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug-1140x538.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Q_IbXpNdynQ_QyaqXm69Ug.png 1372w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-91ce574 elementor-widget elementor-widget-text-editor\" data-id=\"91ce574\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"27ec\">In the correlation matrix above, we can observe that product_a_ratio and product_b_ratio are highly correlated.<\/p>\n\n<p id=\"edb0\">We need to be careful when removing correlated features as they can still add information despite high correlation.<\/p>\n\n<p id=\"c7e8\">In our example, features have a Pearson Correlation (PC) equal to 1.0, so we can safely remove one of them. But if the PC would be 0.9 then we could reduce the overall accuracy of the model by removing such a feature.<\/p>\n\n<p id=\"8521\">A good practice is also to add a comment, with a reason why we excluded the feature.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8ab21da elementor-widget elementor-widget-text-editor\" data-id=\"8ab21da\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">features <strong>=<\/strong> [<br>    \"product_a_ratio\",<br>    \"std_price_product_a\",<br>    \"n_purchases_product_a\",<br>    <em>#    \"product_b_ratio\", # feature has pearson correlation 1.0 with product_a_ratio<\/em><br>]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-790877a elementor-widget elementor-widget-heading\" data-id=\"790877a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">8. Write notes<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-91a2327 elementor-widget elementor-widget-image\" data-id=\"91a2327\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"681\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-1024x681.jpeg\" class=\"attachment-large size-large wp-image-18435\" alt=\"Write notes\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-1024x681.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-300x199.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-768x511.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-1536x1021.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-2048x1361.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-610x405.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-750x499.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0I3982I5k9Gie7hHR-1140x758.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by&nbsp;<a href=\"https:\/\/unsplash.com\/@laurensauderstudio?utm_source=medium&amp;utm_medium=referral\" rel=\"noopener\">Lauren Sauder<\/a>&nbsp;on&nbsp;<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1158f9f elementor-widget elementor-widget-text-editor\" data-id=\"1158f9f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"dc98\">After you\u2019ve trained the model and checked the metrics, the next step is to do a sanity check with a few samples that the model classified with confidence (customers classified with probability 0 or 1).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-098f977 elementor-widget elementor-widget-heading\" data-id=\"098f977\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">I usually review:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4fc66f1 elementor-widget elementor-widget-text-editor\" data-id=\"4fc66f1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>top 5 positive predictions that are marked as positives in the training set,<\/li><li>top 5 negative predictions that are marked as positives in the training set,<\/li><li>top 5 positive predictions that are marked as negatives in the training set.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bad53f6 elementor-widget elementor-widget-text-editor\" data-id=\"bad53f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2b85\">This requires some manual work. When retraining the model, many times those top 5 predictions change. After the X-th change, it is not clear, if you\u2019ve already reviewed the sample or not.<\/p>\n\n<p id=\"2fdf\">To help you remember that you\u2019ve already reviewed a customer, add a notes column to your DataFrame and write a short note to each sample that you review:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3a898a2 elementor-widget elementor-widget-text-editor\" data-id=\"3a898a2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">df<strong>.<\/strong>loc['customer_0', \"notes\"] <strong>=<\/strong> \"Positive in training set, but should be negative\"<br>df<strong>.<\/strong>loc['customer_1', \"notes\"] <strong>=<\/strong> \"good prediction as positive\"<br>df<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f352756 elementor-widget elementor-widget-image\" data-id=\"f352756\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"353\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-1024x353.png\" class=\"attachment-large size-large wp-image-18436\" alt=\"add a notes column to your DataFrame\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-1024x353.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-300x103.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-768x265.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-1536x529.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-2048x706.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-610x210.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-750x259.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1HRwHOtzaSgVPKwgIIJ3Pnw-1140x393.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Write a short comment to each entry that you\u2019ve already reviewed<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dbda8df elementor-widget elementor-widget-heading\" data-id=\"dbda8df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conclusion<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2ddd739 elementor-widget elementor-widget-image\" data-id=\"2ddd739\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"655\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-1024x655.jpeg\" class=\"attachment-large size-large wp-image-18437\" alt=\"Avoid These 8 Mistakes Before Training A Machine Learning Model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-1024x655.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-300x192.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-768x492.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-1536x983.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-2048x1311.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-610x390.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-750x480.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Kf5-ENL1yByZuVwn-1140x730.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@aarsoph?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Kristijan Arsov<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2d1cee6 elementor-widget elementor-widget-text-editor\" data-id=\"2d1cee6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1cfe\">While I was working as a Software Engineer, I found tests essential to quality software development. Tests (when well written) guarantee that the software works with given input arguments.<\/p>\n\n<p id=\"5e81\">The tips that I\u2019m sharing here guarantee that the dataset has the desired properties. This template can be thought of as a sequence of tests before training the model.<\/p>\n\n<p id=\"b27c\">The template drastically reduces redundant sanity checks. There are fewer \u201cWhat did I screw up again\u201d moments.<\/p>\n\n<p id=\"5a9e\">You can download the\u00a0<a href=\"https:\/\/romanorac.github.io\/assets\/notebooks\/2020-03-23-machine-learning-tips.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Jupyter notebook<\/a>\u00a0template here.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Here are 8 Machine Learning tips that will help you to train a model with fewer screw-ups. These tips are most useful when you need to construct the training set.<\/p>\n","protected":false},"author":784,"featured_media":18438,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[1231,991,852],"ppma_author":[3778],"class_list":["post-22556","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-bugs","tag-machine-learning-model","tag-training"],"authors":[{"term_id":3778,"user_id":784,"is_guest":0,"slug":"roman-orac","display_name":"Roman Orac","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_b7d17fbf-b990-4540-aa64-0ff5333f3943-150x150.jpg","user_url":"https:\/\/www.sportradar.com\/","last_name":"Orac","first_name":"Roman","job_title":"","description":"Roman Orac is Senior Data Scientist at <a href=\"http:\/\/www.sportradar.com\/\">Sportradar<\/a>, a global leader in understanding and leveraging the power of sports data and digital content for its clients around the world."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22556","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/784"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22556"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22556\/revisions"}],"predecessor-version":[{"id":32535,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22556\/revisions\/32535"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18438"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22556"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22556"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22556"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22556"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}