Solved – Estimating R-squared and statistical significance from penalized regression model

lassoregressionridge regressionstepwise regression

I am using the R package penalized to obtain shrunken estimates of coefficients for a dataset where I have lots of predictors and little knowledge of which ones are important. After I've picked tuning parameters L1 and L2 and I'm satisfied with my coefficients, is there a statistically sound way to summarize the model fit with something like R-squared?

Furthermore, I'm interested in testing the overall significance of the model (i.e. does R²=0, or do all the =0).

I've read through the answers on a similar question asked here, but it didn't quite answer my question. There's an excellent tutorial on the R package that I'm using here, and the author Jelle Goeman had the following note at the end of the tutorial regarding confidence intervals from penalized regression models:

It is a very natural question to ask for standard errors of regression coefficients or other estimated quantities. In principle such standard errors can easily be calculated, e.g. using the bootstrap.

Still, this package deliberately does not provide them. The reason for this is that standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods. Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias. The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.

Unfortunately, in most applications of penalized regression it is impossible to obtain a sufficiently precise estimate of the bias. Any bootstrap-based cal- culations can only give an assessment of the variance of the estimates. Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used.

Reporting a standard error of a penalized estimate therefore tells only part of the story. It can give a mistaken impression of great precision, completely ignoring the inaccuracy caused by the bias. It is certainly a mistake to make confidence statements that are only based on an assessment of the variance of the estimates, such as bootstrap-based confidence intervals do.

Best Answer

My first reaction to Jelle's comments given is "bias-schmias". You have to be careful about what you mean by "large amount of predictors". This could be "large" with respect to:

The number of data points ("big p small n")
The amount of time you have to investigate the variables
The computational cost of inverting a giant matrix

My reaction was based on "large" with respect to point 1. This is because in this case it is usually worth the trade-off in bias for the reduction in variance that you get. Bias is only important "in-the-long-run". So if you have a small sample, then who care's about "the-long-run"?

Having said all that above, $R^2$ is probably not a particularly good quantity to calculate, especially when you have lots of variables (because that's pretty much all $R^2$ tells you: you have lots of variables). I would calculate something more like a "prediction error" using cross validation.

Ideally this "prediction error" should be based on the context of your modeling situation. You basically want to answer the question "How well does my model reproduce the data?". The context of your situation should be able to tell you what "how well" means in the real world. You then need to translate this into some sort of mathematical equation.

However, I have no obvious context to go off from the question. So a "default" would be something like PRESS: $$PRESS=\sum_{i=1}^{N} (Y_{i}-\hat{Y}_{i,-i})^2$$ Where $\hat{Y}_{i,-i}$ is the predicted value for $Y_{i}$ for a model fitted without the ith data point ($Y_i$ doesn't influence the model parameters). The terms in the summation are also known as "deletion residuals". If this is too computationally expensive to do $N$ model fits (although most programs usually gives you something like this with the standard output), then I would suggest grouping the data. So you set the amount of time you are prepared to wait for $T$ (preferably not 0 ^_^), and then divide this by the time it takes to fit your model $M$. This will give a total of $G=\frac{T}{M}$ re-fits, with a sample size of $N_{g}=\frac{N\times M}{T}$. $$PRESS=\sum_{g=1}^{G}\sum_{i=1}^{N_{g}} (Y_{ig}-\hat{Y}_{ig,-g})^2$$ A way you can get an idea of how important each variable is, is to re-fit an ordinary regression (variables in the same order). Then check proportionately how much each estimator has been shrunk towards zero $\frac{\beta_{LASSO}}{\beta_{UNCONSTRAINED}}$. Lasso, and other constrained regression can be seen as "smooth variable selection", because rather than adopt a binary "in-or-out" approach, each estimate is brought closer to zero, depending on how important it is for the model (as measured by the errors).

Related Solutions

Regression – Understanding Strange Standard Errors from glm() in R

Your coefficients, even when they share common names, are not the same, i.e. their interpretation is different.

In the first model, the effect of religionChristianity is a variation in the outcome wrt the baseline (religionBuddhism), a relative variation. In the second model the effect of religionChristianity is an absolute variation.

The effects are numerically equal, $-2.8718056+0.4934891=-2.378317+5e-07$, but in the first case the effect is a sum of two effects, i.e. you should compare the joint significance of Intercept and religionChristianity in the first model with the significance of religionChristianity in the second one. You should compare a joint confidence interval (first model) with a simple one (second model).

The simple CI for religionChristianity is:

> confint(my.fit.without.intercept)
Waiting for profiling to be done...
                         2.5 %    97.5 %
...
religionChristianity -2.390467 -2.366207

There are several ways to compute joint intervals. Using arm:

> library(arm)
> n.sims <- 1000
> sim.i <- sim(my.fit, n.sims)
> intercept.plus.christianity <- sim.i@coef[,1] + sim.i@coef[,2]
> quantile(intercept.plus.christianity, c(0.025, 0.975))
     2.5%     97.5% 
-2.390826 -2.366828

Can you see any significant (relevant) difference?

Solved – Why are confidence intervals and p-values not reported as default for penalized regression coefficients

Little late to the party, but in case anyone stumbles across this question in the future. . . .

Best answer: have a look at section 6 of the vignette for the penalized R package ("L1 and L2 Penalized Regression Models" Jelle Goeman, Rosa Meijer, Nimisha Chaturvedi, Package version 0.9-47), https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf.

We don't get CIs or standard errors on the coefficients when we use penalized regression because they aren't meaningful. Ordinary linear regression, or logistic regression, or whatever, provides unbiased estimates of the coefficients. A CI around that point estimate, then, can give some indication of how point estimates will be distributed around the true value of the coefficient. Penalized regression, though, uses the bias-variance tradeoff to give us coefficient estimates with lower variance, but with bias. Reporting a CI around a biased estimate will give an unrealistically optimistic indication of how close the true value of the coefficient may be to the point estimate.

("Penalized Regression, Standard Errors, and Bayesian Lassos" Minjung Kyung, Jeff Gill, Malay Ghosh, and George Casella, Bayesian Analysis (2010) pages 369 - 411, discusses non-parametric (bootstrapped) estimates of p values for penalized regression and, if I understand correctly, they are not impressed. https://doi.org/10.1214/10-BA607 (Wayback machine link))

Best Answer

Related Solutions

Regression – Understanding Strange Standard Errors from glm() in R

Solved – Why are confidence intervals and p-values not reported as default for penalized regression coefficients

Related Question