Regression – Why Overfitted Models Tend to Have Large Coefficients

biaslinear modelregressionregularizationvariance

I imagine that the larger a coefficient on a variable is, the more ability the model has to "swing" in that dimension, providing an increased opportunity to fit noise. Although I think I've got a reasonable sense of the relationship between the variance in the model and large coefficients, I don't have as good a sense as to why they occur in overfit models. Is it incorrect to say that they are a symptom of overfitting and coefficient shrinkage is more a technique for reducing the variance in the model? Regularization via coefficient shrinkage seems to operate on the principle that large coefficients are the result of an overfitted model, but perhaps I'm misinterpreting the motivation behind the technique.

My intuition that large coefficients are generally a symptom of overfitting comes from the following example:

Let's say we wanted to fit $n$ points that all sit on the x-axis. We can easily construct a polynomial whose solutions are these points: $f(x) = (x-x_1)(x-x_2)….(x-x_{n-1})(x-x_n)$. Let's say our points are at $x=1,2,3,4$. This technique gives all coefficients >= 10 (except for one coefficient). As we add more points (and thereby increase the degree of the polynomial) the magnitude of these coefficients will increase quickly.

This example is how I'm currently connecting the size of the model coefficients with the "complexity" of the generated models, but I'm concerned that this case is to sterile to really be indicative of real-world behavior. I deliberately built an overfitted model (a 10th degree polynomial OLS fit on data generated from a quadratic sampling model) and was surprised to see mostly small coefficients in my model:

set.seed(123)
xv = seq(-5,15,length.out=1e4)
x=sample(xv,20)
gen=function(v){v^2 + 7*rnorm(length(v))}
y=gen(x)
df = data.frame(x,y)

model = lm(y~poly(x,10,raw=T), data=df)
summary(abs(model$coefficients))
#     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
# 0.000001 0.003666 0.172400 1.469000 1.776000 5.957000


data.frame(sort(abs(model$coefficients)))
#                                   model.coefficients
# poly(x, 10, raw = T)10                  7.118668e-07
# poly(x, 10, raw = T)9                   3.816941e-05
# poly(x, 10, raw = T)8                   7.675023e-04
# poly(x, 10, raw = T)7                   6.565424e-03
# poly(x, 10, raw = T)6                   1.070573e-02
# poly(x, 10, raw = T)5                   1.723969e-01
# poly(x, 10, raw = T)3                   6.341401e-01
# poly(x, 10, raw = T)4                   8.007111e-01
# poly(x, 10, raw = T)1                   2.751109e+00
# poly(x, 10, raw = T)2                   5.830923e+00
# (Intercept)                             5.956870e+00

Maybe the take-away from this example is that two thirds of the coefficients are less than 1, and relative to the other coefficients, there are three coefficients that are unusually large (and the variables associated with these coefficients also happen to be those most closely related to the true sampling model).

Is (L2) regularization just a mechanism to diminish the variance in a model and thereby "smooth" the curve to better fit future data, or is it taking advantage of a heuristic derived from the observation that overfiited models tend to exhibit large coefficients? Is it an accurate statement that overfitted models tend to exhibit large coefficients? If so, can anyone perhaps explain the mechanism behind the phenomenon a little and/or direct me to some literature?

Best Answer

In the regularisation context a "large" coefficient means that the estimate's magnitude is larger than it would have been, if a fixed model specification had been used. It's the impact of obtaining not just the estimates, but also the model specification, from the data.

Consider what a procedure like stepwise regression will do for a given variable. If the estimate of its coefficient is small relative to the standard error, it will get dropped from the model. This could be because the true value really is small, or simply because of random error (or a combination of the two). If it's dropped, then we no longer pay it any attention. On the other hand, if the estimate is large relative to its standard error, it will be retained. Notice the imbalance: our final model will reject a variable when the coefficient estimate is small, but we will keep it when the estimate is large. Thus we are likely to overestimate its value.

Put another way, what overfitting means is you're overstating the impact of a given set of predictors on the response. But the only way that you can overstate the impact is if the estimated coefficients are too big (and conversely, the estimates for your excluded predictors are too small).

What you should do is incorporate into your experiment a variable selection procedure, eg stepwise regression via step. Then repeat your experiment multiple times, on different random samples, and save the estimates. You should find that all the estimates of the coefficients $\beta_3$ to $\beta_{10}$ are systematically too large, when compared to not using variable selection. Regularisation procedures aim to fix or mitigate this problem.

Here's an example of what I'm talking about.

repeat.exp <- function(M)
{
    x <- seq(-2, 2, len=25)
    px <- poly(x, 10)
    colnames(px) <- paste0("x", 1:10)
    out <- setNames(rep(NA, 11), c("(Intercept)", colnames(px)))
    sapply(1:M, function(...) {
        y <- x^2 + rnorm(N, s=2)
        d <- data.frame(px, y)
        b <- coef(step(lm(y ~ x1, data=d), y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, trace=0))
        out[names(b)] <- b
        out
    })
}

set.seed(53520)
z <- repeat.exp(M=1000)

# some time later...
rowMeans(abs(z), na.rm=TRUE)

(Intercept)          x1          x2          x3          x4          x5          x6          x7          x8          x9         x10 
   1.453553    3.162100    6.533642    3.108974    3.204341    3.131208    3.118276    3.217231    3.293691    3.149520    3.073062 

Contrast this to what happens when you don't use variable selection, and just fit everything blindly. While there is still some error in the estimates of $\beta_3$ to $\beta_{10}$, the average deviation is much smaller.

repeat.exp.base <- function(M)
{
    x <- seq(-2, 2, len=25)
    px <- poly(x, 10)
    colnames(px) <- paste0("x", 1:10)
    out <- setNames(rep(NA, 11), c("(Intercept)", colnames(px)))
    sapply(1:M, function(...) {
        y <- x^2 + rnorm(N, s=2)
        d <- data.frame(px, y)
        b <- coef(lm(y ~ ., data=d))
        out[names(b)] <- b
        out
    })
}

set.seed(53520)
z2 <- repeat.exp.base(M=1000)

rowMeans(abs(z2))
(Intercept)          x1          x2          x3          x4          x5          x6          x7          x8          x9         x10 
   1.453553    1.676066    6.400629    1.589061    1.648441    1.584861    1.611819    1.607720    1.656267    1.583362    1.556168 

Also, both L1 and L2 regularisation make the implicit assumption that all your variables, and hence coefficients, are in the same units of measurement, ie a unit change in $\beta_1$ is equivalent to a unit change in $\beta_2$. Hence the usual step of standardising your variables before applying either of these techniques.

Related Question