Solved – Estimating R-squared and statistical significance from penalized regression model

lassoregressionridge regressionstepwise regression

I am using the R package penalized to obtain shrunken estimates of coefficients for a dataset where I have lots of predictors and little knowledge of which ones are important. After I've picked tuning parameters L1 and L2 and I'm satisfied with my coefficients, is there a statistically sound way to summarize the model fit with something like R-squared?

Furthermore, I'm interested in testing the overall significance of the model (i.e. does R²=0, or do all the =0).

I've read through the answers on a similar question asked here, but it didn't quite answer my question. There's an excellent tutorial on the R package that I'm using here, and the author Jelle Goeman had the following note at the end of the tutorial regarding confidence intervals from penalized regression models:

It is a very natural question to ask for standard errors of regression coefficients or other estimated quantities. In principle such standard errors can easily be calculated, e.g. using the bootstrap.

Still, this package deliberately does not provide them. The reason for this is that standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods. Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias. The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.

Unfortunately, in most applications of penalized regression it is impossible to obtain a sufficiently precise estimate of the bias. Any bootstrap-based cal- culations can only give an assessment of the variance of the estimates. Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used.

Reporting a standard error of a penalized estimate therefore tells only part of the story. It can give a mistaken impression of great precision, completely ignoring the inaccuracy caused by the bias. It is certainly a mistake to make confidence statements that are only based on an assessment of the variance of the estimates, such as bootstrap-based confidence intervals do.

Best Answer

My first reaction to Jelle's comments given is "bias-schmias". You have to be careful about what you mean by "large amount of predictors". This could be "large" with respect to:

  1. The number of data points ("big p small n")
  2. The amount of time you have to investigate the variables
  3. The computational cost of inverting a giant matrix

My reaction was based on "large" with respect to point 1. This is because in this case it is usually worth the trade-off in bias for the reduction in variance that you get. Bias is only important "in-the-long-run". So if you have a small sample, then who care's about "the-long-run"?

Having said all that above, $R^2$ is probably not a particularly good quantity to calculate, especially when you have lots of variables (because that's pretty much all $R^2$ tells you: you have lots of variables). I would calculate something more like a "prediction error" using cross validation.

Ideally this "prediction error" should be based on the context of your modeling situation. You basically want to answer the question "How well does my model reproduce the data?". The context of your situation should be able to tell you what "how well" means in the real world. You then need to translate this into some sort of mathematical equation.

However, I have no obvious context to go off from the question. So a "default" would be something like PRESS: $$PRESS=\sum_{i=1}^{N} (Y_{i}-\hat{Y}_{i,-i})^2$$ Where $\hat{Y}_{i,-i}$ is the predicted value for $Y_{i}$ for a model fitted without the ith data point ($Y_i$ doesn't influence the model parameters). The terms in the summation are also known as "deletion residuals". If this is too computationally expensive to do $N$ model fits (although most programs usually gives you something like this with the standard output), then I would suggest grouping the data. So you set the amount of time you are prepared to wait for $T$ (preferably not 0 ^_^), and then divide this by the time it takes to fit your model $M$. This will give a total of $G=\frac{T}{M}$ re-fits, with a sample size of $N_{g}=\frac{N\times M}{T}$. $$PRESS=\sum_{g=1}^{G}\sum_{i=1}^{N_{g}} (Y_{ig}-\hat{Y}_{ig,-g})^2$$ A way you can get an idea of how important each variable is, is to re-fit an ordinary regression (variables in the same order). Then check proportionately how much each estimator has been shrunk towards zero $\frac{\beta_{LASSO}}{\beta_{UNCONSTRAINED}}$. Lasso, and other constrained regression can be seen as "smooth variable selection", because rather than adopt a binary "in-or-out" approach, each estimate is brought closer to zero, depending on how important it is for the model (as measured by the errors).

Related Question