{"id":1198,"date":"2019-02-15T10:32:01","date_gmt":"2019-02-15T10:32:01","guid":{"rendered":"http:\/\/kusuaks7\/?p=803"},"modified":"2023-07-17T16:27:02","modified_gmt":"2023-07-17T16:27:02","slug":"working-with-missing-data-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/working-with-missing-data-in-machine-learning\/","title":{"rendered":"Working with Missing Data in Machine Learning"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>&nbsp;like&nbsp;<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p>Missing values are representative of the messiness of real world data. There can be a multitude of reasons why they occur &mdash; ranging from human errors during data entry, incorrect sensor readings, to software bugs in the data processing pipeline.<\/p>\n<p>The normal reaction is frustration. Missing data are probably the most widespread source of errors in your code, and the reason for most of the exception-handling. If you try to remove them, you might reduce the amount of data you have available dramatically &mdash; probably the worst that can happen in machine learning.<\/p>\n<p>Still, often there are hidden patterns in missing data points. Those patterns can provide additional insight in the problem you&rsquo;re trying to solve.<\/p>\n<p><em>We can treat missing values in data the same way as silence in music &mdash; on the surface they might be considered negative (not contributing any information), but inside lies a lot of potential.<\/em><\/p>\n<h2 style=\"margin-left: -1.2pt;\"><strong>Methods<\/strong><\/h2>\n<p><em>Note: we will be using Python and a&nbsp;<\/em><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/adult\" target=\"_blank\" rel=\"noopener noreferrer\"><em>census data set<\/em><\/a><em>&nbsp;(modified for the purposes of this tutorial)<\/em><\/p>\n<p>You might be surprised to find out how many methods for dealing missing data exist. This is a testament to both how important this issue is, and also that there is a lot of potential for creative problem solving.<\/p>\n<p>The first thing you should do is count how many you have and try to visualize their distributions. For this step to work properly you should manually inspect the data (or at least a subset of it) to try to determine how they are designated. Possible variations are: &lsquo;NaN&rsquo;, &lsquo;NA&rsquo;, &lsquo;None&rsquo;, &lsquo; &rsquo;, &lsquo;?&rsquo; and others. If you have something different than &lsquo;NaN&rsquo; you should standardize them by using np.nan. To construct our visualizations we will use the handy&nbsp;<a href=\"https:\/\/github.com\/ResidentMario\/missingno\" target=\"_blank\" rel=\"noopener noreferrer\">missingno<\/a>package.<\/p>\n<p>import missingno as msno<br \/>\nmsno.matrix(census_data)<\/p>\n<p><img decoding=\"async\" alt=\"experfy-blog\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*0c0gaVECFgj8pnqaDOwHkQ.png\" style=\"width: 1029px; height: 455px;\" \/>&nbsp;<\/p>\n<p style=\"text-align: center;\">Missing data visualisation. White fields indicate&nbsp;NA&rsquo;s<\/p>\n<p>import pandas as pd<br \/>\ncensus_data.isnull().sum()<\/p>\n<p>age&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 325<br \/>\nworkclass&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2143<br \/>\nfnlwgt&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;325<br \/>\neducation&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 325<br \/>\neducation.num&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 325<br \/>\nmarital.status&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;325<br \/>\noccupation&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2151<br \/>\nrelationship&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; 326<br \/>\nrace&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;326<br \/>\nsex&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 326<br \/>\ncapital.gain&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;326<br \/>\ncapital.loss&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;326<br \/>\nhours.per.week&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 326<br \/>\nnative.country&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 906<br \/>\nincome&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;326<br \/>\ndtype: int64<\/p>\n<p>Let&rsquo;s start with the most simple thing you can do: removal. As mentioned before, while this is a quick solution, and might work in some cases when the proportion of missing values is relatively low (&lt;10%), most of the time it will make you lose a ton of data. Imagine that just because of missing values in one of your features you have to drop the whole observation, even if the rest of the features are perfectly filled and informative!<\/p>\n<p>import numpy as np<br \/>\ncensus_data = census_data.replace(&#39;np.nan&#39;, 0)<\/p>\n<p>The second-worst method of doing this is replacement with 0 (or -1). While this would help you run your models, it can be extremely dangerous.&nbsp;The reason for this is that sometimes this value can be misleading. Imagine a regression problem where negative values occur (such as predicting temperature) &mdash; well in that case this becomes an actual data point.<\/p>\n<p>Now that we have those out of the way, let&rsquo;s become more creative. We can split the type of missing values by their parent datatype:<\/p>\n<p style=\"margin-left:-1.2pt;\"><strong>Numerical NaNs<\/strong><\/p>\n<p>A standard and often very good approach is to replace the missing values with mean, median or mode. For numerical values you should go with mean, and if there are some outliers try median (since it is much less sensitive to them).<\/p>\n<p><strong>from<\/strong> sklearn.preprocessing <strong>import<\/strong> Imputer<br \/>\nimputer = Imputer(missing_values=np.nan, strategy=&#39;median&#39;, axis=0)<br \/>\ncensus_data[[&#39;fnlwgt&#39;]] = imputer.fit_transform(census_data[[&#39;fnlwgt&#39;]])<\/p>\n<p style=\"margin-left:-1.2pt;\"><strong>Categorical NaNs<\/strong><\/p>\n<p>Categorical values can be a bit trickier, so you should definitely pay attention to your model performance metrics after editing (compare before and after). The standard thing to do is to replace the missing entry with the most frequent one:<\/p>\n<p>census_data[&#39;marital.status&#39;].value_counts()<\/p>\n<p>Married-civ-spouse&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 14808<br \/>\nNever-married&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;10590<br \/>\nDivorced&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4406<br \/>\nSeparated&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1017<br \/>\nWidowed&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;979<br \/>\nMarried-spouse-absent&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 413<br \/>\nMarried-AF-spouse&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 23<br \/>\nName: marital.status, dtype: int64<\/p>\n<p>def replace_most_common(x):<br \/>\n&nbsp;&nbsp;&nbsp; if pd.isnull(x):<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;return most_common<br \/>\n&nbsp;&nbsp;&nbsp; else:<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return x<\/p>\n<p>census_data = census_data[&#39;marital.status&#39;].map(replace_most_common)<\/p>\n<h2 style=\"margin-left: -1.2pt;\"><strong>Conclusion<\/strong><\/h2>\n<p>The take-home message is that you should be aware of the different methods available to get more out of missing data, and more importantly start regarding it as a source of possible insight instead of annoyance!<\/p>\n<p>Happy coding&nbsp;\ud83d\ude42<\/p>\n<h2 style=\"margin-left: -1.2pt;\"><strong>Bonus &mdash; advanced methods and visualizations<\/strong><\/h2>\n<p>You can theoretically impute missing values by fitting a regression model, such as linear regression or k nearest neighbors. The implementation of this is left as an example to the reader.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"experfy-blog\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*K0YiK-wEEWT9N7MG.JPG\" style=\"width: 215px; height: 197px;\" \/><\/p>\n<p style=\"text-align: center;\">A visual example of&nbsp;kNN.<\/p>\n<p>Here are some visualisations that are also available from the wonderful&nbsp;<a href=\"https:\/\/github.com\/ResidentMario\/missingno\" target=\"_blank\" rel=\"noopener noreferrer\">missingno<\/a>&nbsp;package, which can help you uncover relationships, in the form of a correlation matrix or a dendrogram:<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"experfy-blog\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*ZXQgVJwFCTwcLBSfnXG-Tg.png\" style=\"width: 841px; height: 485px;\" \/>&nbsp;<\/p>\n<p style=\"text-align: center;\">Correlation matrix of missing values. Values which are often missing together can help you solve the&nbsp;problem.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"experfy-blog\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*Qws3Gr44rvleTN3CH859dw.png \" style=\"width: 685px; height: 391px;\" \/><\/p>\n<p style=\"text-align: center;\">Dendrogram of missing&nbsp;values<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ready to learn Data Science? Browse courses&nbsp;like&nbsp;Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab. Missing values are representative of the messiness of real world data. There can be a multitude of reasons why they occur &mdash; ranging from human errors during data entry, incorrect sensor readings, to<\/p>\n","protected":false},"author":100,"featured_media":4374,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2637],"class_list":["post-1198","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2637,"user_id":100,"is_guest":0,"slug":"boyan-angelov","display_name":"Boyan Angelov","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Angelov","first_name":"Boyan","job_title":"","description":"Boyan Angelov leads the machine learning efforts at a Berlin startup, building an AI to help tech companies get better candidates. He has started using machine learning during his work on microbial metagenomes at the Max Planck Institute for Marine Microbiology. The discoveries he made there were in the applications of dimensionality reduction methods. Later he worked in the clinical trials space, focusing on information retrieval and natural language processing. &nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1198","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/100"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1198"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1198\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4374"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1198"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1198"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1198"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1198"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}