Solved – Why is MSE of cross validation higher compared to a linear model

cross-validationelastic netglmnetmser

I am tuning the lambda parameter of an elastic net with the glmnet package. For this purpose the package provides a cross validation based on the function cv.glmnet. With the function I am able to print the MSE of each tested lambda. I am also able to select a set of variables according to the tuned lambda parameter. These variables can then be put into a linear model. However, when I compare the best MSE of the cross validation results with the MSE of a linear model with the selected variables, the results are quite different.

Question: Why is the MSE of the cross validation higher compared to the MSE of the linear model?

Below, you can find some reproducible code, which shows what I have done so far.

library("glmnet")

set.seed(1234)

# Some example data
N <- 1000
y <- rnorm(N, 5, 10)
x1 <- y + rnorm(N, 2, 10)
x2 <- y + rnorm(N, - 5, 20)
x3 <- y + rnorm(N, 10, 200)
x4 <- rnorm(N, 20, 50)
x5 <- rnorm(N, - 7, 200)
x6 <- rbinom(N, 1, exp(x1) / (exp(x1) + 1))
x7 <- rbinom(N, 1, exp(x2) / (exp(x2) + 1))
x8 <- rbinom(N, 1, exp(x3) / (exp(x3) + 1))
x9 <- rbinom(N, 1, exp(x4) / (exp(x4) + 1))
x10 <- rbinom(N, 1, exp(x5) / (exp(x5) + 1))

data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

# Cross validation
cv <- cv.glmnet(x = as.matrix(data[ , colnames(data) %in% "y" == FALSE]), 
            y = y, alpha = 0.5, family = "gaussian")

# Variable selection
cv_mod <- glmnet(x = as.matrix(data[ , colnames(data) %in% "y" == FALSE]), 
             y = y, alpha = 0.5, family = "gaussian", lambda = cv$lambda.min)
vars_mod <- names(cv_mod$beta[ , 1])[as.numeric(cv_mod$beta[ , 1]) != 0]

# Linear model
md_lm <- lm(y ~., data[ , colnames(data) %in% vars_mod])

# Comparison of MSE
cv$cvm[cv$lambda == cv$lambda.min] # MSE of cross validation with best lambda
mean(md_lm$residuals^2) # MSE of linear model

I ran this code several times with different seeds and most of the time the MSE of the cross validation is higher.

Best Answer

I'd say this is a good example of why validation of a model built on a specific sampled dataset (== any dataset) is extremely important: apparently the sample you have the (bad) luck of drawing influences the mean squared error/model fit. Cross-validation is a way of studying this and provide a less sample specific estimate of the MSE. The reason it is (often) higher is that the original linear regression is probably somewhat overfit. This leads to smaller errors, while this level of performance is less likely to occur in "other" datasets (or subsamples of the entire set).

Further, the reason the crossvalidated MSE is not always higher than the original MSE is the inherent randomness of the procedure.

Related Solutions

Solved – Performing Cross Validation to Compare Lasso and Other Regression Models in R

I'm not entirely sure I understand precisely where in the analysis pipeline your question is, but I think I can address it by walking through the steps you'll want to take. The software portion of your question is off-topic on CV, but the questions about CV are on-topic, so I'll answer those.

My question is: is it technically proper CV to determine the overall CV error by averaging the error on each fold given that the lambda chosen for each fold will be producing a different lasso result?

The elementary model development process is usually presented with respect to three partitions of your whole data set: train, test and validate. Training and test data are used together to tune model hyperparameters. Validation data is used to assess the performance of alternative models against data that wasn't used in model construction. The notion is that this is representative of new data that the model might encounter.

A slightly more sophisticated elaboration on this process is nested cross-validation. This is preferred because, across the whole process, all data is eventually used in testing and training the model. Instead of using one partitioning of the data, you can do CV partitioning on the whole data set (the outer partition) and then again on the data left over when you hold out one of the outer partitions (the inner set). Here, you tune model hyperparameters on the inner set and have out-of-sample performance evaluated on the outer holdout set. The final model is prepared by composing a final partition over the entire data set, using CV to select a final tuple of hyperparameters and then, at last, estimating a single model on all available data given that selected tuple. In this way, the model building process kind of telescopes on itself, collapsing CV steps as we estimate the final model.

It doesn't matter that alternative inner sets might give you different $\lambda_\text{min}$. What you're characterizing with your out-of-sample performance metrics is the model selection process itself. At the end of the day, you'll still only estimate one model, and that's the value of $\lambda_\text{min}$ that you care about. In the preceeding steps, you don't need to know the particular value of $\lambda_\text{min}$ except as a means to achieve out-of-sample estimates.

While I know that there is some discussion about using stepwise regression, I have used the stepAIC function to prune my variable set.

This is a bit of an understatement: it's not a discussion, it's a consensus that stepwise results are dubious. If you're fitting a lasso anyway, you can get statistically valid model by omitting the stepwise regression step from your analysis. Moreover, since the lasso step won't "see" the stepwise step, your results will have too-narrow error bands and cross-validation results will be irreparably biased. And lasso makes the entire stepwise step pointless anyway, because they solve the same problem! Lasso solves all of the variable selection problems that stepwise attempts to while avoiding the wealth of widely-accepted criticisms of stepwise strategies. There's no downside to using lasso on its own in this case. I'm convinced the only reason stepwise methods are included in R is for pedagogical reasons, and so that the functionality is available should someone need to demonstrate why it's hazardous.

Solved – How to interpret this glmnet() code and its output in R

This smells incorrect, you probably wanted:

fit <- cv.glmnet(model, y, k=k)
coef(fit, "lambda.min")

which will return the coefficients using the internal fit from the cross validation.

Unless ridge_model has the same predictors, weights, mixing parameter, etc, plugging in a penalty parameter from one model into another seems odd; but if that were the same, ridge_model would be the same as fit$glmnet.fit above and redundant.

Best Answer

Related Solutions

Solved – Performing Cross Validation to Compare Lasso and Other Regression Models in R

Solved – How to interpret this glmnet() code and its output in R

Related Question