Underspecification: The Dangerously Underdiscussed Problem Facing Machine Learning

Andre Ye Andre Ye
January 6, 2021 AI & Machine Learning

good model loss ≠ good learning

All machine learning models originate in the computer lab. They’re initialized, trained, tested, redesigned, trained again, fine-tuned, tested yet again before they are deployed.

Afterwards, they fulfill their duty in epidemiological modelling, stock trading, shopping item recommendation, and cyber attack detection, among many other purposes. Unfortunately, success in the lab may not always mean success in the real world — even if the model does well on the test data.

This big problem — that the machine learning models being developed on the computer to serve a purpose in the real world can often crash — has had little research or discussion. Usually, the solution is to assume that the data given to the model was not representative, or something similar.

A valid solution as that may be, recently in November of 2020 a group of several researchers proposed a more broad explanation for this phenomenon: underspecification. While underspecification certainly has been explored deeply, the researchers write, “its implications for the gap between iid (model training ‘in the lab’) and application-specific generalization are neglected.”

Take this example — we’re trying to use machine learning to find the future maximum number of birds there will be in an exhibit at the zoo.

The first step is to create a model modelling the number of birds were on the zoo x years since the exhibit opened. The model our algorithm derives is a quartic one, with three extrema. The number of birds in the exhibit steadily rises through reproduction, then suddenly plummets (perhaps a disease).

Underspecification: The Dangerously Underdiscussed Problem Facing Machine Learning
Created by author.

The next step is to find the maximum value of this model we’ve created. We’ll create an optimizer, which initializes at a random point and uses derivatives to find in which direction it should move.

For instance, in this case the initialized optimizer can move left or right: if it moves left, the number of birds goes up; if it moves right, the number of birds goes down. Since the optimizer’s goal is to find the maximum number of birds, it’ll move to the left.

Underspecification: The Dangerously Underdiscussed Problem Facing Machine Learning
This is oversimplifying the actual process of optimization. However, the point is still there: random initialization can determine a lot in optimization. Created by author.

Viola! Our crafty model has found the maximum number of birds there will be in the zoo exhibit will be reached 1.948 years before the exhibit opened.

Obviously, the model has come to a false conclusion. What went wrong?

This is an easy question to answer. We didn’t specify the actual domain of the phenomena we’re trying to model. We should have set some sort of constraint in the code of our model such that the optimizer searches only valid years.

optimizer.find_maximum(search_range="0<=x")

Using this would have made the model we created better model the phenomena of bird population growth (and sudden decline). By specifying these limits, we would be able to eliminate answers that are the product of artificial representations, like an object existing before its creation.

In this case, it’s fairly easy to see that the answer the model spits out is patently wrong, and to slap a limitation on it so it behaves. However, this easiness of fixing the model is accompanied, not by coincidence, with the fact that it can’t really do much. The model takes in one input — the year — and predicts based on a polynomial the population.

What’s harder — a lot more harder — is to do the same thing with massive neural networks, which are the most prone to underspecification. Furthermore, these massive neural networks are often used to model phenomena a lot more complex than birds in a zoo, like the behavior patterns of millions of users or the distinguishing features of thousands of faces.

So, we have a problem (usually, finding the best parameters for a neural network), and multiple solutions (combinations of parameters for a neural network). That is, they are parameters that can achieve similarly good loss, both in training and validation.

Underspecification becomes an issue when the problem has only one solution, but the method through which we find solutions yields multiple.

There’s one true solution that models the phenomena desired, but we can’t put specifications on the solution-finding process (because it’s too complicated), so we find multiple that all look great by our metric.

Underspecification: The Dangerously Underdiscussed Problem Facing Machine Learning
“exists before creation” is used in reference to solutions that technically work in the modelling landscape, but not in the actual phenomena it is intended for use in. Created by author.

Let’s consider a neural network trained to model the best video to recommend to a user. It takes in inputs about the user’s behavior, and outputs the probability from zero to one that they would enjoy watching a certain video, for tens of thousands of videos. There’s likely only one set of weights that truly captures the thought process of a user as they watch a video, but many weights that can artificially connect input to target quite well.

Think of it as answering 1 + 1 = ?. The natural way is to understand that the notation x + ymeans “add the two numbers” and to answer 2. An artificial way to answer is to say that x + y means (2x)/y. Indeed, this definition satisfies 1+1=2. But it doesn’t capture the phenomena intended, and doesn’t work as well for all situations or variants of the phenomena.

(With something as complex as a massive neural network, it’s not infeasible at all to expect it to dream up as solution as contrived as (2x)/y, even with measures like regularization.)

Other solutions may look good in the lab, because they’re artificial solutions, like the maximum number of birds existing 1.948 years before they were even put in the exhibit. However, when they’re deployed in the real world, they won’t perform as well, because they haven’t actually learned the mechanics of the phenomena as completely as another solution. They might still be parsing (x+y+2)/2 instead of simply adding the two numbers.

Building upon this — any specification is a condition or fixing of parameters. A model with many fixed parameters will have low underspecification. However, large models with many trainable parameters, like deep neural networks, are especially prone to underspecification.

Further, random initialization plays a problem in underspecification. Because there are so many solutions that look good to the optimizer, and the optimizer can’t tell which ones actually measure the phenomenon or not, we can’t really tell if a trained model really does understand the data. Hence, something as arbitrary as a random seed can be the difference in deployment success.

This is why multiple models can be trained and evaluated on the same data and yield the same loss, while performing very different in deployment.

Underspecification occurring in ML pipelines can block the reliability of models that behave as expected in deployment.

Underspecification occurs in many facets of machine learning. The researchers that introduced underspecification in this context ran ‘stress tests’ — probing the output based on a carefully designed input to measure the model’s biases — on a variety of models and contexts.

From genomics to epidemiology, and small neural networks to deep convolutional networks to natural language processing, underspecification applies. This has big implications for how we view learning.

Are natural language processing systems actually learning general linguistic principles like grammar, or are they taking artificial shortcuts? Results of stress tests on these systems, along with a large body of research, cast doubt on the idea that our deepest NLP models are actually learning the language.

There is an interest in creating models that understand the human language if their purpose is to serve humans. A NLP model that relies on linguistic shortcut and not linguistic understanding to perform a task might work nicely in a lab, but as soon as it is deployed it will have trouble.

Underspecification: The Dangerously Underdiscussed Problem Facing Machine Learning
Created by author.

We’ve always known that machine learning algorithms were lazy, but we haven’t considered them essentially lying to us. Since often phenomena in neural networks is answered with “it’s a black box”, we haven’t explored it to much at all.

Underspecification in this respect is relatively knew, and very important. How to solve it is an important question not only to the practical problem of making model deployment more stable, but how to make models actually learn the mechanics of the data instead of taking shortcuts or generating nonsense artificial solutions.

Going forward, this will become an important idea to investigate and even — perhaps — to solve.

Read the paper that introduced underspecification in relation to model deployment here. If you’re interested in this topic, you may also like some other articles in a similar vein:

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Andre Ye

    Tags
    Artificial IntelligenceMachine LearningModel DeploymentUnderspecification
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    We Need To Rethink Technology For The 21st Century

    We Need To Rethink Technology For The 21st Century

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.