Solved – comparing OLS, ridge and lasso

lassoleast squaresrridge regression

I am trying to compare OLR, ridge and lasso in my situation. I could calculate SE for OLR and lasso but not for ridge. The following is Prostrate data from lasso2 package.

require(lasso2)
require(MASS)
data(Prostate)

fit.lm = lm(lpsa~.,data=Prostate)
summary(fit.lm)$coefficients
               Estimate  Std. Error    t value     Pr(>|t|)
(Intercept)  0.669399027 1.296381277  0.5163597 6.068984e-01
lcavol       0.587022881 0.087920374  6.6767560 2.110634e-09
lweight      0.454460641 0.170012071  2.6731081 8.956206e-03
age         -0.019637208 0.011172743 -1.7575995 8.229321e-02
lbph         0.107054351 0.058449332  1.8315753 7.039819e-02
svi          0.766155885 0.244309492  3.1360054 2.328823e-03
lcp         -0.105473570 0.091013484 -1.1588785 2.496408e-01
gleason      0.045135964 0.157464467  0.2866422 7.750601e-01
pgg45        0.004525324 0.004421185  1.0235545 3.088513e-01

fit.rd = lm.ridge(lpsa~.,data=Prostate, lamda = 6.0012)
#summary(fit.rd)$coefficients, doesnot provide SE 

lfit = l1ce(lpsa~.,data=Prostate,bound=(1:500)/500)
summary(lfit[[10]])$coefficients
                 Value  Std. Error   Z score  Pr(>|Z|)
(Intercept) 2.43614448 2.130515543 1.1434530 0.2528505
lcavol      0.03129045 0.125288320 0.2497475 0.8027826
lweight     0.00000000 0.274549270 0.0000000 1.0000000
age         0.00000000 0.018287840 0.0000000 1.0000000
lbph        0.00000000 0.095587974 0.0000000 1.0000000
svi         0.00000000 0.390936045 0.0000000 1.0000000
lcp         0.00000000 0.149824868 0.0000000 1.0000000
gleason     0.00000000 0.260274039 0.0000000 1.0000000
pgg45       0.00000000 0.007285054 0.0000000 1.0000000

I have a couple of questions:

(1) How can we calculate Std. Error in case of ridge regression ?

(2) Is it valid to compare Std. Error for deciding which (ridge, lasso or OLS) method to use ? Or there are other methods ? If so how can I get them ?

Best Answer

Since the goal is prediction you can choose some distance measure and then calculate the distance between your predictions $\hat{Y}$ and the true values $Y$, choosing the method with the minimum distance. The most common distance measurement for this setting is the mean squared error: MSE = $\sum_{i=1}^n \left(\hat{Y_i} - Y_i\right)^2$. The hardest part is appropriately choosing your training and testing sets and then performing model selection procedure required by the Lasso and Ridge regularization.

First, split your data into two parts: training $X_{training}$ and testing $X_{testing}$. A common choice is 66% training, 34% testing but the choice of proportion is influenced by the size of your data set. Now forget about the testing set for a while. Train, or fit, the three models using just the training data. Model selection for the regularization parameter $\lambda$ should be chosen via cross-validation since prediction is your goal here. Finally, using the best $\lambda$, perform prediction on the testing data to obtain values for $\hat{Y}$. I haven't used the Lasso2 or MASS implementations of regularized regression but the glmnet package makes the process very straightforward because it provides a cross-validatiion function. Here's some R code to do this on the Prostate data:

require(glmnet)
data(Prostate, package = "lasso2")
## Split into training and test
n_obs = dim(Prostate)[1]
proportion_split = 0.66
train_index = sample(1:n_obs, round(n_obs * proportion_split))
y = Prostate$lpsa
X = as.matrix(Prostate[setdiff(colnames(Prostate), "lpsa")])
Xtr = X[train_index,]
Xte = X[-train_index,]
ytr = y[train_index]
yte = y[-train_index]
## Train models
ols = lm(ytr ~ Xtr)
lasso = cv.glmnet(Xtr, ytr, alpha = 1)
ridge = cv.glmnet(Xtr, ytr, alpha = 0)
## Test models
y_hat_ols = cbind(rep(1, n_obs - length(train_index)), Xte) %*% coef(ols)
y_hat_lasso = predict(lasso, Xte)
y_hat_ridge = predict(ridge, Xte)
## compare
sum((yte - y_hat_ols)^2)
sum((yte - y_hat_lasso)^2)
sum((yte - y_hat_ridge)^2)

Note that the sample() function randomly chooses the rows for training and testing so the MSE will change every time you run it. And since the Prostate data is really just meant as a demonstration dataset, no clear winner is likely to emerge.