Why Do We Teach Under-Parameterized Models?

Kevin Zatloukal
11 min readAug 31, 2023

When I first studied machine learning, over 20 years ago, the course started with classical statistical techniques (so-called “maximum likelihood”) and used that as a jumping-off point for machine learning (ML) approaches to prediction. I used the same approach in an article I wrote for OSAM about applying ML to investing. Even though the classical statistical approach is clearly inferior to ML in actual prediction tasks, it is useful as a stepping stone toward understanding ML.

In addition to the critical distinction between classical statistics and ML that has been around for decades, these days, there is a similarly important distinction between conventional / classical and deep-learning approaches or, perhaps more to the point, between “under-parameterized” and “over-parameterized” ML models. Just as with statistics vs ML, these two approaches require thinking about the prediction problem in distinct ways and the newer approach has clearly superior prediction accuracy. The winning solutions to Kaggle competitions, I am told, almost always use approaches that build over-parameterized models.

That similarity leads me to think that, while under-parameterized models may be useful as a step to understanding over-parameterized ones, our ML courses should make clear to students that over-parameterized models are what they will almost always want to use in practice.

From Statistics to Conventional ML

The goal in any supervised learning problem is to use training data (examples) to build a model that can make accurate predictions on new data. The universe of possible models are split up into “classes”, each containing models that are identical except for some number of parameters (a.k.a. “weights”). An ML approach comes with an algorithm for finding the choice of parameters — the model within that class — that will do the best job of accurately predicting unseen examples.

In classical statistics, we are taught to choose the model that best fits the training data, the “maximum likelihood” model for those data. Conventional ML starts with the empirical observation that the maximum likelihood model is usually not the most accurate on unseen examples.

Almost always, you can improve accuracy by choosing the weights, not by naively minimizing errors on the training data (maximizing the likelihood of the predictions), but by minimizing a function that also includes a penalty on models that are too complex, essentially Occam’s Razor.

In practice, the results look something like this:

from arxiv:1812.1118

Here, the dashed curve shows the training error and the solid curve shows the error on unseen (“test”) examples, as we consider more and more complex models. Initially, test error decreases along with error on the training data, but beyond some point (the “sweet spot” above), a model that is complex enough to improve accuracy further on training data actually gives worse predictions on test data.

Models past the sweet spot are said to “over-fit” the training data. Usually, we understand this by considering that the training data itself is rarely perfect but instead contains some noise. Once you give your model too much complexity — too much capacity to learn — it will learn the random noise as well the signal present in the training data. Learning the noise reduces errors on the training data but can (and usually does) cause more errors on the test data because the latter has different random noise.

To find a model at the sweet spot, we introduce a complexity penalty into our calculations that causes models capable of fitting the random noise to score worse than models that do not fit the noise, even though they have lower error on the training data.

For classical statistical problems, the maximum likelihood approach works well because it is used with models that are so simple that their capacity to learn is never past the sweet spot. If we want to improve accuracy further, though, we need to consider more complex models. Since we don’t start out knowing where the sweet spot is, we need to consider models that are sufficiently complex to over-fit and then add in a complexity penalty in order to push back to the sweet spot.

From Conventional to Over-Parameterized Models

The incredible observation made in the last decade is that this is only half of the picture. Something miraculous often happens if you keep going, letting your models be more and more complex:

from arxiv:1812.1118

Continuing to make the models more complex can eventually improve performance again! For these problems, the picture ends up with two different dips in error rate, a so-called “double descent” curve.

Once again, there is a particular model complexity where the error rate changes direction. Unlike the “sweet spot”, whose location is initially unknown, we actually know where this second location is: it is the point where the error rate on the training data becomes zero! This second point is called the “interpolation threshold” because the model now interpolates the training data, hitting every data point exactly without error.

Prior to the last decade, it was widely known that zero training error was a bad: “A model with zero training error is overfit to the training data and will typically generalize poorly” (The Elements of Statistical Learning, 2009, quote here). For various reasons (e.g., computational limitations), no one tried continuing to increase the complexity anyway. Those who finally did try made the incredible discovery shown in the picture above, that overfit models can be even more accurate.

This new picture gives us two reasons why over-parameterized models are preferable to under-parameterized ones in practice. First, they usually (though not always) have better performance. We see in the picture that the second dip drops further down than the first. Second, they are much easier to tune. Rather than searching for a sweet spot, we simply keep increasing the model complexity until the training error is zero and then continue beyond that as long as the performance improvements give a worthwhile improvement in test error rate. (Bigger models have higher computational costs, so not all improvements in accuracy are worthwhile.)

Two Distinct Approaches

One problem with the picture above is that it makes the regime of over-parameterized models seem like a straightforward extension of the under-parameterized regime: “just do the same thing but now let the models get even more complex”? In reality, working with over-parameterized models requires a change in approach, as was needed to move from statistical models to ML.

One way to think about the under-parameterized regime is that we find our model by limiting ourselves to models of a certain complexity and then, within that space, find the one that reduces the training error as much as possible. Once the models have enough capacity to achieve zero training error, however, this approach no longer makes sense. There are many models (infinitely many) that can achieve zero error, so which do we pick?

What we do instead, then, is limit ourselves to models with zero training error and then, within that space, find the one that is least complex. We will consider models with as many parameters as we our machines can handle, and our model searching algorithm will find the one that uses those parameters the least in order to achieve zero error.

If I can get slightly more technical for a moment, we can see further distinctions between the two regimes in the mathematics. With under-parameterized models, it usually works best to measure model complexity with the L1 norm because that tends toward models that don’t use some of the parameters at all (setting them to zero). With over-parameterized models, it usually works best to measure model complexity with the L2 norm, which uses all of the parameters but simply keeps their values small.

Indeed, another one of the remarkable discoveries of the last decade is that it is often unnecessary to explicitly minimize model complexity. It appears that this is the case because the standard model fitting algorithm, gradient descent, naturally finds the solution with the smallest L2 norm. The algorithm itself is implicitly finding the “simplest” model in that sense.

Why Do Over-Parameterized Models Work?

Understanding when and why over-parameterization works is an area of active research, but let me mention a couple of likely suspects.

The first suspect is averaging models. Even with under-parameterized models, it was known that averaging together many complex models can actually produce a simpler model. In the following picture, think of a more complex model as one that allows more bend in the curve dividing blue dots from orange dots (a classification problem). Averaging together many wavy curves could actually produce a curve that is fairly straight. (Credit to Pedro Domingos, who drew a picture like this one of his lectures.)

The average of complex models can be a simpler model.

The most important class of over-parameterized models are deep neural networks. The large language models powering ChatGPT, for example, are said to have more parameters than the number of examples used to train them! That is enough space to memorize the input, so there is little question that they are working in the over-parameterized regime.

Deep neural networks are often trained using “dropout”, which implicitly causes the algorithm to build an average of many different models. Each iteration, some parameters are left out, requiring the algorithm to train a model using just the remaining parameters. The final result is an average of all of the models built in different iterations.

The other important class of over-parameterized models are ensembles of trees such as Random Forests. The latter are explicitly an average of trees. Each individual tree is usually large enough to perfectly fit the training data, meaning they are almost certainly over-fit, but the forest combines them all together by averaging. Adding each new tree actually makes the model more complex, in terms of parameter count, but on almost every problem, more complexity combined with more averaging produces a more accurate model.

The other suspect I want to mention is related to the fact that models with zero training error are definitely learning the noise in the training data. Specifically, this suspect is the idea that more complex models are able to fit the noise with less damage to the rest of the model.

For example, suppose that we are trying to learn this curve that separates the blue dots from the orange ones but the data has been corrupted in such a way that one blue dot was turned orange. Here, the dark blue / orange dots are training data and light blue / orange dots are test data.

If we restrict ourselves to models of medium complexity — only a small amount of “bend” — then we can pick off this one orange point without affecting the rest of the training data, but the “bulge” we introduced is so large that, unbeknownst to us, it includes a couple of the test data points.

In contrast, if we use even more complex models, that allow even more bend, we can pick off that one point without taking any test data points with them, giving us less error in the test data as well as the training data. The resulting model looks like a line nearly everywhere, except for the small spot where it has to fit the noisy data point.

As I said above, researchers are still trying to figure out which of these suspects (or other ones) best explains the success of over-parameterized models, but both of these make sense to me and give me enough confidence to believe what the data is already telling us: that over-parameterized models are almost always best. As noted above, the winners of Kaggle competitions are almost always over-parameterized, specifically, deep neural networks or ensembles of trees.

Revisiting the Wide Receiver Data Model

The OSAM article I cited above looked at the problem of predicting the success of college wide receivers in the NFL using very simple ensembles of trees. Recently, I revisited this problem to see if I could demonstrate the same “double descent” behavior that others had seen with trees. Sure enough, you can see it here:

The vertical axis is the error rate (labelled “accuracy”) on test data. The horizontal axis is a measure of complexity: the label N/M means an ensemble of M trees, each containing N nodes.

The first part of the curve shows the behavior for ensembles of 10 trees as we increase the complexity of the individual trees from 5 nodes to 50. There, we see the traditional first descent of the under-parameterized regime. Moving from 1 to 5 to 10 nodes in the trees makes them more accurate, but moving from 10 nodes up to 50 makes the results worse.

However, using 50 nodes per tree works fine if we average enough of them together. The second part of the curve shows how the error rate drops as we average together 30, 50, up to 250 trees, each containing 50 nodes. As we can see, the final result is much better than what we would get if we stopped at the initial “sweet spot” of 10 trees containing 10 nodes each.

You might wonder, since 10 nodes was best when we averaged together only 10 trees, would it work better to stick to 10 nodes as we build ensembles of 30, 50, up to 250 trees? The answer is no.

The orange curve in this picture shows the results with trees containing only 10 or fewer nodes, rather than 50 nodes. The results are clearly worse. As we can see, we want more nodes in the trees as well (50 or more nodes, not just 10) as we do more averaging.

It is worth pointing out that my wide receiver data set is extremely small: only several hundred examples. This problem thus shows that over-parameterization is beneficial even on “small data” learning problems.

Conclusion

The results of Kaggle competitions and even my own experiments (not just wide receiver data but also factor models) suggest that, in practice, over-parameterized models give more accurate predictions on most problems. They are also easier to fine tune since there is no search for a “sweet spot”. While under-parameterized models remain useful for explaining the core ideas of machine learning, they seem hard to recommend for practical use. Like statistical models, they seem best as a stepping stone, this time to the better models of the over-parameterized regime.

--

--