Solved – Why is Lasso and Ridge not giving better results than OLS

elastic netlassoleast squaresridge regression

I am trying to find an example in which Lasso and Ridge regression are doing better than simple OLS.

I am trying to run the Boston example that appears in the MASS library in R. The dependent variable is medv (median house price).

I'll jump to the end. Ridge, Lasso and Elastic net, are all giving similar results to OLS. I can't understand why. Moreover, I found an example here:

http://rstudio-pubs-static.s3.amazonaws.com/257535_4218def5fb0945a7a5c09126f385aa59.html

In which the MSE is much smaller than mine ! Can you kindly help me find what is wrong with my code ?

set.seed(1)
cv.out = cv.glmnet(X.train,y.train, alpha = 0, nfolds = 10, lambda = seq(0, .1, length = 15)) #cv.glmnet will create it's own lambda sequency by default
plot(cv.out)
best.lambda = cv.out$lambda.min
best.lambda
log(best.lambda)

ridge.model = glmnet(X.train, y.train, alpha = 0, nlambda = 100)
ridge.pred = predict(ridge.model, s = best.lambda, newx = X.test)
ridge.testmse = mean((y.test - ridge.pred)^2)
predict(ridge.model, type = "coefficients", s = best.lambda)
ridge.testmse

My code for Lasso and elastic net is very similar. Thank you.

Best Answer

I think a quote from the accuracy section in the example you linked can be very informative (emphasis added):

OLS is ideal when the underlying relationship is Linear and we have n>>p. But if n is not much larger than p or p>n (unfeasible for OLS), there can be a lot of variability in the fit which can result in either overfitting and very poor predictive ability.

The Boston dataset has 506 observations of 14 variables. Therefore, we are in the case n>>p where OLS is said to be ideal. In other words, you don't have the problem that Lasso is intended to solve.

Furthermore, variable selection (like in Lasso) is useful when some predictors are not significant or predictors are highly correlated. In the Boston dataset most predictors are highly significant, correlation between them is moderate and VIFs aren't exaggerate. Thus, other circumstances that could make variable selection useful just don't hold here.

Related Solutions

Solved – comparing OLS, ridge and lasso

Since the goal is prediction you can choose some distance measure and then calculate the distance between your predictions $\hat{Y}$ and the true values $Y$, choosing the method with the minimum distance. The most common distance measurement for this setting is the mean squared error: MSE = $\sum_{i=1}^n \left(\hat{Y_i} - Y_i\right)^2$. The hardest part is appropriately choosing your training and testing sets and then performing model selection procedure required by the Lasso and Ridge regularization.

First, split your data into two parts: training $X_{training}$ and testing $X_{testing}$. A common choice is 66% training, 34% testing but the choice of proportion is influenced by the size of your data set. Now forget about the testing set for a while. Train, or fit, the three models using just the training data. Model selection for the regularization parameter $\lambda$ should be chosen via cross-validation since prediction is your goal here. Finally, using the best $\lambda$, perform prediction on the testing data to obtain values for $\hat{Y}$. I haven't used the Lasso2 or MASS implementations of regularized regression but the glmnet package makes the process very straightforward because it provides a cross-validatiion function. Here's some R code to do this on the Prostate data:

require(glmnet)
data(Prostate, package = "lasso2")
## Split into training and test
n_obs = dim(Prostate)[1]
proportion_split = 0.66
train_index = sample(1:n_obs, round(n_obs * proportion_split))
y = Prostate$lpsa
X = as.matrix(Prostate[setdiff(colnames(Prostate), "lpsa")])
Xtr = X[train_index,]
Xte = X[-train_index,]
ytr = y[train_index]
yte = y[-train_index]
## Train models
ols = lm(ytr ~ Xtr)
lasso = cv.glmnet(Xtr, ytr, alpha = 1)
ridge = cv.glmnet(Xtr, ytr, alpha = 0)
## Test models
y_hat_ols = cbind(rep(1, n_obs - length(train_index)), Xte) %*% coef(ols)
y_hat_lasso = predict(lasso, Xte)
y_hat_ridge = predict(ridge, Xte)
## compare
sum((yte - y_hat_ols)^2)
sum((yte - y_hat_lasso)^2)
sum((yte - y_hat_ridge)^2)

Note that the sample() function randomly chooses the rows for training and testing so the MSE will change every time you run it. And since the Prostate data is really just meant as a demonstration dataset, no clear winner is likely to emerge.

Feature Selection: Why Ridge Regression Lacks Interpretability Compared to LASSO

If you order 1 million ridge-shrunk, scaled, but non-zero features, you will have to make some kind of decision: you will look at the n best predictors, but what is n? The LASSO solves this problem in a principled, objective way, because for every step on the path (and often, you'd settle on one point via e.g. cross validation), there are only m coefficients which are non-zero.
Very often, you will train models on some data and then later apply it to some data not yet collected. For example, you could fit your model on 50.000.000 emails and then use that model on every new email. True, you will fit it on the full feature set for the first 50.000.000 mails, but for every following email, you will deal with a much sparser and faster, and much more memory efficient, model. You also won't even need to collect the information for the dropped features, which may be hugely helpful if the features are expensive to extract, e.g. via genotyping.

Another perspective on the L1/L2 problem exposed by e.g. Andrew Gelman is that you often have some intuition what your problem may be like. In some circumstances, it is possible that reality is truly sparse. Maybe you have measured millions of genes, but it is plausible that only 30.000 of them actually determine dopamine metabolism. In such a situation, L1 arguably fits the problem better.
In other cases, reality may be dense. For example, in psychology, "everything correlates (to some degree) with everything" (Paul Meehl). Preferences for apples vs. oranges probably does correlate with political leanings somehow - and even with IQ. Regularization might still make sense here, but true zero effects should be rare, so L2 might be more appropriate.

Best Answer

Related Solutions

Solved – comparing OLS, ridge and lasso

Feature Selection: Why Ridge Regression Lacks Interpretability Compared to LASSO

Related Question