Avoid These 8 Mistakes Before Training A Machine Learning Model

Roman Orac Roman Orac
January 12, 2021 AI & Machine Learning

One of the most common misconceptions in Machine Learning is that ML Engineers get a CSV dataset and they spend the majority of the time optimizing the hyperparameters of a model.

If you work in the industry, you know that’s far from the truth. ML Engineers spend most of the time planning how to construct the training set that resembles real-world data distribution for a certain problem.

When you’ve managed to construct such training set, just add a few well-crafted features and the Machine Learning model won’t have a hard time finding the decision boundary.

In this article, we’re going to go through 8 Machine Learning tips that will help you to train a model with fewer screw-ups. These tips are most useful when you need to construct the training set, e.g. you didn’t get it from Kaggle.

At the end of the article, I also share a link to the Jupyter Notebook template, which you can incorporate into your Machine Learning workflow.

Dataset sample

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Road Trip with Raj on Unsplash

It’s easier to learn with examples. Let’s create a sample dataset with random features.

One row represents a customer with his features and a binary target variable. customer_id is an index in the DataFrame.

np.random.seed(42)n = 1000
df = pd.DataFrame(
{
"customer_id": ["customer_%d" % i for i in range(n)],
"product_a_ratio": np.random.random_sample(n),
"std_price_product_a": np.random.normal(0, 1, n),
"n_purchases_product_a": np.random.randint(0, 10, n),
}
)
df.loc[100, "std_price_product_a"] = pd.NA
df["product_b_ratio"] = df["product_a_ratio"]
df["y"] = np.random.randint(0, 2, n)
df.set_index("customer_id", inplace=True)df.head()
Avoid These 8 Mistakes Before Training A Machine Learning Model
Sample dataset with features and target

1. Check the target

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Ricardo Arce on Unsplash

Despite being obvious that no positive customer is also marked as a negative, it does happen in the real world and it’s worthwhile to check it.

assert (
len(set(df[df.y == 0].index).intersection(df[df.y == 1])) == 0
), "Positive customers have intersection with negative customers"

I use the assert statement above in a Jupyter Notebook, which breaks execution if there is a mistake in the training set. So when I construct a new training set and use the Jupyter command: “Restart kernel and run all cells”, I can be sure that the trainset has the required properties.

When doing Exploratory Data Analysis (EDA), we need to aware that real-world datasets have mistakes in unexpected places. One of the goals of EDA is to discover them.

2. Check duplicates

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Stefano Ferretti on Unsplash

How do the duplicates come into the training set?

Many times with joins in SQL databases!

E.g. you join an SQL table with another table by customer_id key. If any of those tables have multiples entries for a customer_id, it will create duplicates.

How can we make sure there aren’t any duplicates in our training set?

We can use an assert statement that will break execution in case a duplicated customer_id appears:

assert len(df[df.index.duplicated()]) == 0, "There are duplicates in trainset"

3. Check missing values

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Michael Dziedzic on Unsplash

In my experience, missing values appear for two reasons:

  • real missing values — the customer doesn’t have an entry for a certain feature,
  • mistakes in a dataset — we didn’t map the NULL to the default value when constructing the training set, because we didn’t expect it.

In the latter case, we can simply fix the query and map the NULL value. For real missing values, we need to know how our model handles them.

For example, LightGBM supports missing values by default and we can set the desired behavior.

LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true. When zero_as_missing=false (default), the unshown values in sparse matrices (and LightSVM) are treated as zeros. When zero_as_missing=true, NA and zeros (including unshown values in sparse matrices (and LightSVM)) are treated as missing.

Let’s check if our dataset contains missing values.

for col in df.columns:
assert df[df[col].isnull()].shape[0] == 0, "%s col has %d missing values" % (col, df[df[col].isnull()].shape[0])
Avoid These 8 Mistakes Before Training A Machine Learning Model

std_price_product_a column has a single missing value. Let’s remove the entry and rerun the check.

df = df[df['std_price_product_a'].notnull()].copy()for col in df.columns:
assert df[df[col].isnull()].shape[0] == 0, "%s col has %d missing values" % (col, df[df[col].isnull()].shape[0])

By now, we see that these checks can be very useful. We didn’t spend a second on debugging! These checks notify us right away about the unexpected values in the dataset.

When a missing value is expected for a certain feature, we can whitelist it so the check won’t break execution next time.

4. Check feature scales

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Jason Dent on Unsplash

When working on feature engineering, we define certain features on a 0–1 scale or some other scale. It’s worthwhile to check if a feature in between desired boundaries.

In this example, we only check if features are on a scale between 0 and 1, but I would suggest you add more checks that are appropriate for your dataset.

features_on_0_1_scale = [
'product_a_ratio',
'product_b_ratio',
'y',
]for col in features_on_0_1_scale:
assert df[col].min() >= 0 and df[col].max() <= 1, "%s is not on 0 - 1 scale" % col

5. Check feature types

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Nick Hillier on Unsplash

Before you start with training the model, I suggest you manually set the data type for every feature. At first, it might feel redundant, but you will thank me later.

We can set the feature types in a for loop:

feature_types = {
"product_a_ratio": "float64",
"std_price_product_a": "float64",
"n_purchases_product_a": "int64",
"product_b_ratio": "float64",
"y": "int64",
}for feature, dtype in feature_types.items():
df.loc[:, feature] = df[feature].astype(dtype)

Why is this useful? What happens to the integer data type when we add a missing value?

The column data type changes from integer to the object data type. When we convert it to the numpy array it has floats instead of integers. The classifier could misinterpret ordinal or categorical features as continuous features.

Let’s look at an example below:

df.n_purchases_product_a.values[:10]# output
array([0, 8, 0, 2, 7, 2, 3, 7, 0, 5])# add NA to first row
df.loc[0, "n_purchases_product_a"] = pd.NA
df.n_purchases_product_a[:10].values# output
array([0.0, 8.0, 0.0, 2.0, 7.0, 2.0, 3.0, 7.0, 0.0, 5.0], dtype=object)

6. Unique features

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Noah Näf on Unsplash

It’s a well-established practice in Machine Learning to define features in a list that we use in the model.

It’s worthwhile to check if a certain feature goes into the model more than once. It seems trivial but you can mistakenly duplicate features when coding and rerunning the Jupyter Notebook for X-th time.

features = [
"product_a_ratio",
"std_price_product_a",
"n_purchases_product_a",
"product_b_ratio",
]assert len(set(features)) == len(features), "Features names are not unique"

I would suggest you also list the features that you don’t use in the model. That way you can spot a feature that should be in the model but it is not.

set(df.columns) - set(features)# Output{'y'}

7. Check correlations

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Sharon McCutcheon on Unsplash

Checking the correlation between the features (and target) is essential when modeling.

Linear Regression is a well-known algorithm that has problems with multicollinearity — when your model includes multiple features that are similar to each other.

Highly correlated features are also problematic with models that don’t have a problem with multicollinearity, like Random Forest or Boosting. Eg. the model divides feature importance between correlated feature A and feature B, which makes feature importance misleading.

Let’s plot the correlation matrix and try to spot highly correlated features:

corr = df[features].corr()fig, ax = plt.subplots()
ax = sns.heatmap(corr, vmin=-1, vmax=1, center=1, square=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment="right");
plot the correlation matrix and try to spot highly correlated features

In the correlation matrix above, we can observe that product_a_ratio and product_b_ratio are highly correlated.

We need to be careful when removing correlated features as they can still add information despite high correlation.

In our example, features have a Pearson Correlation (PC) equal to 1.0, so we can safely remove one of them. But if the PC would be 0.9 then we could reduce the overall accuracy of the model by removing such a feature.

A good practice is also to add a comment, with a reason why we excluded the feature.

features = [
"product_a_ratio",
"std_price_product_a",
"n_purchases_product_a",
# "product_b_ratio", # feature has pearson correlation 1.0 with product_a_ratio
]

8. Write notes

Write notes
Photo by Lauren Sauder on Unsplash

After you’ve trained the model and checked the metrics, the next step is to do a sanity check with a few samples that the model classified with confidence (customers classified with probability 0 or 1).

I usually review:

  • top 5 positive predictions that are marked as positives in the training set,
  • top 5 negative predictions that are marked as positives in the training set,
  • top 5 positive predictions that are marked as negatives in the training set.

This requires some manual work. When retraining the model, many times those top 5 predictions change. After the X-th change, it is not clear, if you’ve already reviewed the sample or not.

To help you remember that you’ve already reviewed a customer, add a notes column to your DataFrame and write a short note to each sample that you review:

df.loc['customer_0', "notes"] = "Positive in training set, but should be negative"
df.loc['customer_1', "notes"] = "good prediction as positive"
df
add a notes column to your DataFrame
Write a short comment to each entry that you’ve already reviewed

Conclusion

Avoid These 8 Mistakes Before Training A Machine Learning Model
Photo by Kristijan Arsov on Unsplash

While I was working as a Software Engineer, I found tests essential to quality software development. Tests (when well written) guarantee that the software works with given input arguments.

The tips that I’m sharing here guarantee that the dataset has the desired properties. This template can be thought of as a sequence of tests before training the model.

The template drastically reduces redundant sanity checks. There are fewer “What did I screw up again” moments.

You can download the Jupyter notebook template here.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Roman Orac

    Tags
    BugsMachine Learning ModelTraining
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    The Art Of Asynchronous: Optimizing Efficiency In Remote Teams

    The Art Of Asynchronous: Optimizing Efficiency In Remote Teams

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.