Machine Learning for Product Managers Part III — Caveats

Uzma Barlaskar Uzma Barlaskar
May 11, 2018 Big Data, Cloud & DevOps

Ready to learn Machine Learning? Browse courses like Machine Learning Foundations: Supervised Learning developed by industry thought leaders and Experfy in Harvard Innovation Lab.

This is a continuation of the three part series on machine learning for product managers.

The first note focused on what problems are best suited for application of machine learning techniques. The second note talked about what additional skill-sets a PM needs when building machine learning products. This note will focus on what are the common mistakes made in building ML products.

The goal of the note is to provide someone with limited ML understanding a general sense of the common pitfalls so that you can have a conversation with your data scientists and engineers about these. I will not get into the depth of these issues in this note. However, if you have questions, do comment. I may write a separate note if there’s more interest in one of these areas.

Data issues

No data : This seems obvious (and maybe even hilarious that I am mentioning it on a blog about ML). However, when I was running my company PatternEQ, where we were selling ML solutions to companies, I was surprised by how many companies wanted to use ML and had built up ‘smart software’ strategies but didn’t have any data. You cannot use machine learning if you have no data. You either need to have data that your company collects or you can acquire public data or accumulate data through partnerships with other companies that have data. Without data, there’s no ML. Period. (This is also a good filter to use when evaluating the work claimed by many AI startups. These startups claim they have a cool AI technology but have no data to run these techniques on.)

Small data: Most of the ML techniques published today focus on Big Data and tend to work well on large datasets. You can apply ML on small data sets too, but you have to be very careful so that the model is not affected by outliers and that you are not relying on overcomplicated models. Applying ML to small data may come with more overhead. Thus, it’s perfectly fine to apply statistical techniques instead of ML to analyze small datasets. For example, most clinical trials tend to have small sample sizes and are analyzed through statistical techniques, and the insights are perfectly valid.

Sparse data: Sometimes even when you have a lot of data, your actual usable data may still be sparse. Say, for example, on Amazon, you have hundreds of millions of shoppers and also tens of millions of products you can buy. Each shopper buys only a few hundred of the millions of products available and hence, you don’t have feedback on most of the products. This makes it harder to recommend a product if no one or very few users have bought it. When using sparse datasets, you have to be deliberate about the models and tools you are using, as off the shelf techniques will provide sub-par results. Also, computations with sparse data sets are less efficient as most of the dataset is empty.

High dimensional data: If your data has a lot of attributes, even that can be hard for models to consume and they also take up more computational and storage resources. High dimensional data will need to be converted into a small dimensional space to be able to be used for most ML models. Also, you need to be careful when throwing away dimensions, to ensure that you are not throwing away signal but only the dimensions that are redundant. This is where feature selection matters a lot.

Knowing which dimensions are really important for the outcome that you desire is both a matter of intuition and statistics. PMs should involve themselves in feature selection discussions with engineers/data scientists. This is where product intuition helps. For example, say we are trying to predict how good a video is, and while you can look at how much of the video the user watched as a metric to understand engagement on the video, you found through UX studies that users may leave a tab open and switch to another tab while a video is playing. So watch time does not correlate absolutely with quality. Thus, we might want to include other features such as was there any activity in the tab while the video was playing to truly understand the quality of the video.

Data cleaning: You can’t just use data off the shelf and apply a model to it. A large part of the success of ML is dependent on the quality of the data. And by quality, I don’t just mean how feature rich it is but how well it’s cleaned up. Have you removed all outliers, have you normalized all fields, are there bad fields in your data and corrupted fields — all of these can make or break your model. As they say, Garbage In Garbage Out.

Fit issues

Overfitting

To explain overfitting, I have an interesting story to tell. During the financial crisis of 2007, there was a quant meltdown. Events that were not supposed to be correlated, ended up being correlated, and many assumptions that were considered inviolable were violated. The algorithms went haywire and within 3 days, quant funds amassed enormous losses.

I was working as an engineer at a quant hedge fund called D.E.Shaw during the 2007 financial meltdown. D.E. Shaw suffered relatively fewer losses than other quant funds at that time. Why ? The other quant funds were newer and their algorithms had been trained on data from the recent years preceding 2007, which had never seen a downturn. Thus, when prices started crashing the models didn’t know how to react. D.E. Shaw on the other hand had faced a similar crash of the Russian rouble in 1998. D.E. Shaw too suffered losses but since then its algorithms have been calibrated to expect scenarios like this. And, hence, its algorithms did not crash as badly as some of the other firms.

This was an extreme case of overfitting. In layman terms, the models optimized for hindsight and less for foresight. The competitor quant models were trained on assumptions that held true only when the stock markets were doing well and thus, when the crash happened, they couldn’t predict the right outcomes and ended up making wrong decisions, thus leading to more losses.

How do you avoid this? Ensure that you can test your models on a wide variety of data. Also, take a hard look at your assumptions. Will they still hold true if there are shifts in economy, user behavior changes ? Here’s an article I had written a while back that talks more about the topic of managing assumptions.

Underfitting

Underfitting results when your model is too simple for the data it’s trying to learn from. Say for example, you are trying to predict if shoppers at Safeway will purchase cake mix or not. Cake mix is a discretionary purchase. Factors such as disposable income, price of cake mix, competitors in the vicinity etc. will impact the prediction. But, if you do not take into account these economic factors such as employment rates, inflation rates as well as growth of other grocery stores and only focus on shopping behavior inside of Safeway, your model will not be predict sales accurately.

How do you avoid this? This is where product/customer intuition and market understanding will come in handy. If your model isn’t performing well, ask if you have gathered all the data needed to accurately understand the problem. Can you add more data from other sources that may help provide a better picture about the underlying behavior you are trying to model ?

Computation Cost

One overlooked area when building ML products is how much computationally expensive machine learning is. With services such as AWS and Azure, you can bootstrap and build machine learning capabilities. However, at scale, you will need to do the hard math of how much computational cost you are willing to incur to provide the machine learning features to your users. Based on the cost, you may need to trade-off the quality of the predictions. For example, you may not be able to store all the data about your products. Or you may not be able to provide the freshest recommendations and have to pre-compute ahead of time etc. Knowing how your engineering team is trading off computation cost against ML precision/recall etc. will help you understand if the quality of the product is being compromised.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Uzma Barlaskar

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Modern Data Architecture

    Modern Data Architecture

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.