Elastic Net Regression – Why Do Regression and Elastic Net Provide Different Results?

elastic netregression

I fit Elastic Net model on 20-50 variables. Elastic Net selects 10 (but actually I can choose any model on the solution path for the next step).

Next, I take these 10 variables and fit standard regression with them. The estimated parameters differ and usually one variable appears to explain most of the variance.

Question is – why? Is it because regression fits the model in one step, getting all parameter estimates at once, whereas Elastic Net does estimation it step by step, variable by variable (I do not know the algorithm)?

Does shrinkage in the Elastic Net influence parameter estimates in such a way that their interpretation becomes – let's try this wording – untrue? If yes, then I would select Elastic Net for best forecasting, but rather choose estimates from regression for interpretation.

Best Answer

There is no free lunch in statistics. Elastic Net reduces overfitting (lowers variance) at the cost of increasing bias. With OLS, you could fit a model with all 50 variables. This OLS model would have very low bias (under certain assumptions, the coefficient estimates may be unbiased) but suffer from high variance (overfitting).

In your case, you mentioned that the OLS coefficients look very different than the Elastic Net coefficients, even though both models use the same 10 variables. The difference may be due to bias introduced by the fact that Elastic Net does not compute the coefficients by minimizing the residual sum of squares (which is how OLS computes the coefficients). Elastic net computes the coefficients by minimizing the "penalized" residual sum of squares.

Alternatively, the coefficient estimates may be different between OLS and Elastic Net due to sample size. With small sizes, p-values from OLS may not be reliable. With small sample sizes, the bias from elastic net may also be high.

Here's a simulated example using $n=25$. The "true model" contains only two variables, $x1$ and $x2$, with the "true coefficients" of 2 and 3. Due to small sample size and high irreducible error, the p-value for x1 is high (>22%). The coefficient differences between the two models are also high.

set.seed(1983)

nobs <- 25

x1 <- rnorm(nobs, 10, 10)
x2 <- rnorm(nobs, 20, 20)
x3 <- rnorm(nobs, 30, 30)
x4 <- rnorm(nobs, 40, 40)

y <- 100 + 2*x1 + 3*x2 + rnorm(nobs,0,100)

df <- data.frame(y=y, x1=x1, x2=x2, x3=x3, x4=x4)

### fit a linear model

lm.mod <- lm(y ~ ., data=df)

summary(lm.mod)

### fit an elastic net model using 5-fold CV

library(caret)

set.seed(1984)

enet.mod <- train(y ~ ., data=df, method="glmnet", tuneLength=5, trControl=trainControl(method="cv", number=5))

coef(enet.mod$finalModel, enet.mod$bestTune$lambda)

### compute diffs between coefs

lm.mod$coefficients - t(coef(enet.mod$finalModel, enet.mod$bestTune$lambda))[1,]

When the sample size is increased to $n = 1000$, the p-value for $x1$ is very low and the coefficient differences between the two models are small.

set.seed(1983)

nobs <- 1000

x1 <- rnorm(nobs, 10, 10)
x2 <- rnorm(nobs, 20, 20)
x3 <- rnorm(nobs, 30, 30)
x4 <- rnorm(nobs, 40, 40)

y <- 100 + 2*x1 + 3*x2 + rnorm(nobs,0,100)

df <- data.frame(y=y, x1=x1, x2=x2, x3=x3, x4=x4)

### fit a linear model

lm.mod <- lm(y ~ ., data=df)

summary(lm.mod)

### fit an elastic net model using 5-fold CV

library(caret)

set.seed(1984)

enet.mod <- train(y ~ ., data=df, method="glmnet", tuneLength=5, trControl=trainControl(method="cv", number=5))

coef(enet.mod$finalModel, enet.mod$bestTune$lambda)

### compute diffs between coefs

lm.mod$coefficients - t(coef(enet.mod$finalModel, enet.mod$bestTune$lambda))[1,]
Related Question