Working with Missing Data in Machine Learning

Boyan Angelov Boyan Angelov
February 15, 2019 Big Data, Cloud & DevOps

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Missing values are representative of the messiness of real world data. There can be a multitude of reasons why they occur — ranging from human errors during data entry, incorrect sensor readings, to software bugs in the data processing pipeline.

The normal reaction is frustration. Missing data are probably the most widespread source of errors in your code, and the reason for most of the exception-handling. If you try to remove them, you might reduce the amount of data you have available dramatically — probably the worst that can happen in machine learning.

Still, often there are hidden patterns in missing data points. Those patterns can provide additional insight in the problem you’re trying to solve.

We can treat missing values in data the same way as silence in music — on the surface they might be considered negative (not contributing any information), but inside lies a lot of potential.

Methods

Note: we will be using Python and a census data set (modified for the purposes of this tutorial)

You might be surprised to find out how many methods for dealing missing data exist. This is a testament to both how important this issue is, and also that there is a lot of potential for creative problem solving.

The first thing you should do is count how many you have and try to visualize their distributions. For this step to work properly you should manually inspect the data (or at least a subset of it) to try to determine how they are designated. Possible variations are: ‘NaN’, ‘NA’, ‘None’, ‘ ’, ‘?’ and others. If you have something different than ‘NaN’ you should standardize them by using np.nan. To construct our visualizations we will use the handy missingnopackage.

import missingno as msno
msno.matrix(census_data)

experfy-blog 

Missing data visualisation. White fields indicate NA’s

import pandas as pd
census_data.isnull().sum()

age                                  325
workclass                       2143
fnlwgt                               325
education                        325
education.num                325
marital.status                   325
occupation                     2151
relationship                     326
race                                 326
sex                                  326
capital.gain                     326
capital.loss                     326
hours.per.week              326
native.country                906
income                           326
dtype: int64

Let’s start with the most simple thing you can do: removal. As mentioned before, while this is a quick solution, and might work in some cases when the proportion of missing values is relatively low (<10%), most of the time it will make you lose a ton of data. Imagine that just because of missing values in one of your features you have to drop the whole observation, even if the rest of the features are perfectly filled and informative!

import numpy as np
census_data = census_data.replace('np.nan', 0)

The second-worst method of doing this is replacement with 0 (or -1). While this would help you run your models, it can be extremely dangerous. The reason for this is that sometimes this value can be misleading. Imagine a regression problem where negative values occur (such as predicting temperature) — well in that case this becomes an actual data point.

Now that we have those out of the way, let’s become more creative. We can split the type of missing values by their parent datatype:

Numerical NaNs

A standard and often very good approach is to replace the missing values with mean, median or mode. For numerical values you should go with mean, and if there are some outliers try median (since it is much less sensitive to them).

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values=np.nan, strategy='median', axis=0)
census_data[['fnlwgt']] = imputer.fit_transform(census_data[['fnlwgt']])

Categorical NaNs

Categorical values can be a bit trickier, so you should definitely pay attention to your model performance metrics after editing (compare before and after). The standard thing to do is to replace the missing entry with the most frequent one:

census_data['marital.status'].value_counts()

Married-civ-spouse                14808
Never-married                       10590
Divorced                                  4406
Separated                                 1017
Widowed                                   979
Married-spouse-absent            413
Married-AF-spouse                    23
Name: marital.status, dtype: int64

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

census_data = census_data['marital.status'].map(replace_most_common)

Conclusion

The take-home message is that you should be aware of the different methods available to get more out of missing data, and more importantly start regarding it as a source of possible insight instead of annoyance!

Happy coding 🙂

Bonus — advanced methods and visualizations

You can theoretically impute missing values by fitting a regression model, such as linear regression or k nearest neighbors. The implementation of this is left as an example to the reader.

experfy-blog

A visual example of kNN.

Here are some visualisations that are also available from the wonderful missingno package, which can help you uncover relationships, in the form of a correlation matrix or a dendrogram:

experfy-blog 

Correlation matrix of missing values. Values which are often missing together can help you solve the problem.

experfy-blog

Dendrogram of missing values

 

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Boyan Angelov

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Why Advanced Persistent Threats Are Targeting the Internet of Things

    Why Advanced Persistent Threats Are Targeting the Internet of Things

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.