Solved – Why is MSE of cross validation higher compared to a linear model

cross-validationelastic netglmnetmser

I am tuning the lambda parameter of an elastic net with the glmnet package. For this purpose the package provides a cross validation based on the function cv.glmnet. With the function I am able to print the MSE of each tested lambda. I am also able to select a set of variables according to the tuned lambda parameter. These variables can then be put into a linear model. However, when I compare the best MSE of the cross validation results with the MSE of a linear model with the selected variables, the results are quite different.

Question: Why is the MSE of the cross validation higher compared to the MSE of the linear model?

Below, you can find some reproducible code, which shows what I have done so far.



# Some example data
N <- 1000
y <- rnorm(N, 5, 10)
x1 <- y + rnorm(N, 2, 10)
x2 <- y + rnorm(N, - 5, 20)
x3 <- y + rnorm(N, 10, 200)
x4 <- rnorm(N, 20, 50)
x5 <- rnorm(N, - 7, 200)
x6 <- rbinom(N, 1, exp(x1) / (exp(x1) + 1))
x7 <- rbinom(N, 1, exp(x2) / (exp(x2) + 1))
x8 <- rbinom(N, 1, exp(x3) / (exp(x3) + 1))
x9 <- rbinom(N, 1, exp(x4) / (exp(x4) + 1))
x10 <- rbinom(N, 1, exp(x5) / (exp(x5) + 1))

data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

# Cross validation
cv <- cv.glmnet(x = as.matrix(data[ , colnames(data) %in% "y" == FALSE]), 
            y = y, alpha = 0.5, family = "gaussian")

# Variable selection
cv_mod <- glmnet(x = as.matrix(data[ , colnames(data) %in% "y" == FALSE]), 
             y = y, alpha = 0.5, family = "gaussian", lambda = cv$lambda.min)
vars_mod <- names(cv_mod$beta[ , 1])[as.numeric(cv_mod$beta[ , 1]) != 0]

# Linear model
md_lm <- lm(y ~., data[ , colnames(data) %in% vars_mod])

# Comparison of MSE
cv$cvm[cv$lambda == cv$lambda.min] # MSE of cross validation with best lambda
mean(md_lm$residuals^2) # MSE of linear model

I ran this code several times with different seeds and most of the time the MSE of the cross validation is higher.

Best Answer

I'd say this is a good example of why validation of a model built on a specific sampled dataset (== any dataset) is extremely important: apparently the sample you have the (bad) luck of drawing influences the mean squared error/model fit. Cross-validation is a way of studying this and provide a less sample specific estimate of the MSE. The reason it is (often) higher is that the original linear regression is probably somewhat overfit. This leads to smaller errors, while this level of performance is less likely to occur in "other" datasets (or subsamples of the entire set).

Further, the reason the crossvalidated MSE is not always higher than the original MSE is the inherent randomness of the procedure.