This is a continuation of the three part series on machine learning for product managers.
The first note focused on what problems are best suited for application of machine learning techniques. The second note talked about what additional skill-sets a PM needs when building machine learning products. This note will focus on what are the common mistakes made in building ML products.
The goal of the note is to provide someone with limited ML understanding a general sense of the common pitfalls so that you can have a conversation with your data scientists and engineers about these. I will not get into the depth of these issues in this note. However, if you have questions, do comment. I may write a separate note if there’s more interest in one of these areas.
No data : This seems obvious (and maybe even hilarious that I am mentioning it on a blog about ML). However, when I was running my company PatternEQ, where we were selling ML solutions to companies, I was surprised by how many companies wanted to use ML and had built up ‘smart software’ strategies but didn’t have any data. You cannot use machine learning if you have no data. You either need to have data that your company collects or you can acquire public data or accumulate data through partnerships with other companies that have data. Without data, there’s no ML. Period. (This is also a good filter to use when evaluating the work claimed by many AI startups. These startups claim they have a cool AI technology but have no data to run these techniques on.)
Small data: Most of the ML techniques published today focus on Big Data and tend to work well on large datasets. You can apply ML on small data sets too, but you have to be very careful so that the model is not affected by outliers and that you are not relying on overcomplicated models. Applying ML to small data may come with more overhead. Thus, it’s perfectly fine to apply statistical techniques instead of ML to analyze small datasets. For example, most clinical trials tend to have small sample sizes and are analyzed through statistical techniques, and the insights are perfectly valid.
Sparse data: Sometimes even when you have a lot of data, your actual usable data may still be sparse. Say, for example, on Amazon, you have hundreds of millions of shoppers and also tens of millions of products you can buy. Each shopper buys only a few hundred of the millions of products available and hence, you don’t have feedback on most of the products. This makes it harder to recommend a product if no one or very few users have bought it. When using sparse datasets, you have to be deliberate about the models and tools you are using, as off the shelf techniques will provide sub-par results. Also, computations with sparse data sets are less efficient as most of the dataset is empty.
High dimensional data: If your data has a lot of attributes, even that can be hard for models to consume and they also take up more computational and storage resources. High dimensional data will need to be converted into a small dimensional space to be able to be used for most ML models. Also, you need to be careful when throwing away dimensions, to ensure that you are not throwing away signal but only the dimensions that are redundant. This is where feature selection matters a lot.
Knowing which dimensions are really important for the outcome that you desire is both a matter of intuition and statistics. PMs should involve themselves in feature selection discussions with engineers/data scientists. This is where product intuition helps. For example, say we are trying to predict how good a video is, and while you can look at how much of the video the user watched as a metric to understand engagement on the video, you found through UX studies that users may leave a tab open and switch to another tab while a video is playing. So watch time does not correlate absolutely with quality. Thus, we might want to include other features such as was there any activity in the tab while the video was playing to truly understand the quality of the video.
Data cleaning: You can’t just use data off the shelf and apply a model to it. A large part of the success of ML is dependent on the quality of the data. And by quality, I don’t just mean how feature rich it is but how well it’s cleaned up. Have you removed all outliers, have you normalized all fields, are there bad fields in your data and corrupted fields — all of these can make or break your model. As they say, Garbage In Garbage Out.
To explain overfitting, I have an interesting story to tell. During the financial crisis of 2007, there was a quant meltdown. Events that were not supposed to be correlated, ended up being correlated, and many assumptions that were considered inviolable were violated. The algorithms went haywire and within 3 days, quant funds amassed enormous losses.
I was working as an engineer at a quant hedge fund called D.E.Shaw during the 2007 financial meltdown. D.E. Shaw suffered relatively fewer losses than other quant funds at that time. Why ? The other quant funds were newer and their algorithms had been trained on data from the recent years preceding 2007, which had never seen a downturn. Thus, when prices started crashing the models didn’t know how to react. D.E. Shaw on the other hand had faced a similar crash of the Russian rouble in 1998. D.E. Shaw too suffered losses but since then its algorithms have been calibrated to expect scenarios like this. And, hence, its algorithms did not crash as badly as some of the other firms.
This was an extreme case of overfitting. In layman terms, the models optimized for hindsight and less for foresight. The competitor quant models were trained on assumptions that held true only when the stock markets were doing well and thus, when the crash happened, they couldn’t predict the right outcomes and ended up making wrong decisions, thus leading to more losses.
How do you avoid this? Ensure that you can test your models on a wide variety of data. Also, take a hard look at your assumptions. Will they still hold true if there are shifts in economy, user behavior changes ? Here’s an article I had written a while back that talks more about the topic of managing assumptions.
Underfitting results when your model is too simple for the data it’s trying to learn from. Say for example, you are trying to predict if shoppers at Safeway will purchase cake mix or not. Cake mix is a discretionary purchase. Factors such as disposable income, price of cake mix, competitors in the vicinity etc. will impact the prediction. But, if you do not take into account these economic factors such as employment rates, inflation rates as well as growth of other grocery stores and only focus on shopping behavior inside of Safeway, your model will not be predict sales accurately.
How do you avoid this? This is where product/customer intuition and market understanding will come in handy. If your model isn’t performing well, ask if you have gathered all the data needed to accurately understand the problem. Can you add more data from other sources that may help provide a better picture about the underlying behavior you are trying to model ?
One overlooked area when building ML products is how much computationally expensive machine learning is. With services such as AWS and Azure, you can bootstrap and build machine learning capabilities. However, at scale, you will need to do the hard math of how much computational cost you are willing to incur to provide the machine learning features to your users. Based on the cost, you may need to trade-off the quality of the predictions. For example, you may not be able to store all the data about your products. Or you may not be able to provide the freshest recommendations and have to pre-compute ahead of time etc. Knowing how your engineering team is trading off computation cost against ML precision/recall etc. will help you understand if the quality of the product is being compromised.