In the regularisation context a "large" coefficient means that the estimate's magnitude is larger than it would have been, if a fixed model specification had been used. It's the impact of obtaining not just the estimates, but also the model specification, from the data.
Consider what a procedure like stepwise regression will do for a given variable. If the estimate of its coefficient is small relative to the standard error, it will get dropped from the model. This could be because the true value really is small, or simply because of random error (or a combination of the two). If it's dropped, then we no longer pay it any attention. On the other hand, if the estimate is large relative to its standard error, it will be retained. Notice the imbalance: our final model will reject a variable when the coefficient estimate is small, but we will keep it when the estimate is large. Thus we are likely to overestimate its value.
Put another way, what overfitting means is you're overstating the impact of a given set of predictors on the response. But the only way that you can overstate the impact is if the estimated coefficients are too big (and conversely, the estimates for your excluded predictors are too small).
What you should do is incorporate into your experiment a variable selection procedure, eg stepwise regression via step
. Then repeat your experiment multiple times, on different random samples, and save the estimates. You should find that all the estimates of the coefficients $\beta_3$ to $\beta_{10}$ are systematically too large, when compared to not using variable selection. Regularisation procedures aim to fix or mitigate this problem.
Here's an example of what I'm talking about.
repeat.exp <- function(M)
{
x <- seq(-2, 2, len=25)
px <- poly(x, 10)
colnames(px) <- paste0("x", 1:10)
out <- setNames(rep(NA, 11), c("(Intercept)", colnames(px)))
sapply(1:M, function(...) {
y <- x^2 + rnorm(N, s=2)
d <- data.frame(px, y)
b <- coef(step(lm(y ~ x1, data=d), y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, trace=0))
out[names(b)] <- b
out
})
}
set.seed(53520)
z <- repeat.exp(M=1000)
# some time later...
rowMeans(abs(z), na.rm=TRUE)
(Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1.453553 3.162100 6.533642 3.108974 3.204341 3.131208 3.118276 3.217231 3.293691 3.149520 3.073062
Contrast this to what happens when you don't use variable selection, and just fit everything blindly. While there is still some error in the estimates of $\beta_3$ to $\beta_{10}$, the average deviation is much smaller.
repeat.exp.base <- function(M)
{
x <- seq(-2, 2, len=25)
px <- poly(x, 10)
colnames(px) <- paste0("x", 1:10)
out <- setNames(rep(NA, 11), c("(Intercept)", colnames(px)))
sapply(1:M, function(...) {
y <- x^2 + rnorm(N, s=2)
d <- data.frame(px, y)
b <- coef(lm(y ~ ., data=d))
out[names(b)] <- b
out
})
}
set.seed(53520)
z2 <- repeat.exp.base(M=1000)
rowMeans(abs(z2))
(Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1.453553 1.676066 6.400629 1.589061 1.648441 1.584861 1.611819 1.607720 1.656267 1.583362 1.556168
Also, both L1 and L2 regularisation make the implicit assumption that all your variables, and hence coefficients, are in the same units of measurement, ie a unit change in $\beta_1$ is equivalent to a unit change in $\beta_2$. Hence the usual step of standardising your variables before applying either of these techniques.
Best Answer
This is a well known issue with high-order polynomials, known as Runge's phenomenon. Numerically it is associated with ill-conditioning of the Vandermonde matrix, which makes the coefficients very sensitive to small variations in the data and/or roundoff in the computations (i.e. the model is not stably identifiable). See also this answer on the SciComp SE.
There are many solutions to this problem, for example Chebyshev approximation, smoothing splines, and Tikhonov regularization. Tikhonov regularization is a generalization of ridge regression, penalizing a norm $||\Lambda \theta]||$ of the coefficient vector $\theta$, where for smoothing the weight matrix $\Lambda$ is some derivative operator. To penalize oscillations, you might use $\Lambda \theta=p^{\prime\prime}[x]$, where $p[x]$ is the polynomial evaluated at the data.
EDIT: The answer by user hxd1011 notes that some of the numerical ill-conditioning problems can be addressed using orthogonal polynomials, which is a good point. I would note however that the identifiability issues with high-order polynomials still remain. That is, numerical ill-conditioning is associated with sensitivity to "infinitesimal" perturbations (e.g. roundoff), while "statistical" ill-conditioning concerns sensitivity to "finite" perturbations (e.g. outliers; the inverse problem is ill-posed).
The methods mentioned in my second paragraph are concerned with this outlier sensitivity. You can think of this sensitivity as violation of the standard linear regression model, which by using an $L_2$ misfit implicitly assumes the data is Gaussian. Splines and Tikhonov regularization deal with this outlier sensitivity by imposing a smoothness prior on the fit. Chebyshev approximation deals with this by using an $L_{\infty}$ misfit applied over the continuous domain, i.e. not just at the data points. Though Chebyshev polynomials are orthogonal (w.r.t. a certain weighted inner product), I believe that if used with an $L_2$ misfit over the data they would still have outlier sensitivity.