My first reaction to Jelle's comments given is "bias-schmias". You have to be careful about what you mean by "large amount of predictors". This could be "large" with respect to:
- The number of data points ("big p small n")
- The amount of time you have to investigate the variables
- The computational cost of inverting a giant matrix
My reaction was based on "large" with respect to point 1. This is because in this case it is usually worth the trade-off in bias for the reduction in variance that you get. Bias is only important "in-the-long-run". So if you have a small sample, then who care's about "the-long-run"?
Having said all that above, $R^2$ is probably not a particularly good quantity to calculate, especially when you have lots of variables (because that's pretty much all $R^2$ tells you: you have lots of variables). I would calculate something more like a "prediction error" using cross validation.
Ideally this "prediction error" should be based on the context of your modeling situation. You basically want to answer the question "How well does my model reproduce the data?". The context of your situation should be able to tell you what "how well" means in the real world. You then need to translate this into some sort of mathematical equation.
However, I have no obvious context to go off from the question. So a "default" would be something like PRESS:
$$PRESS=\sum_{i=1}^{N} (Y_{i}-\hat{Y}_{i,-i})^2$$
Where $\hat{Y}_{i,-i}$ is the predicted value for $Y_{i}$ for a model fitted without the ith data point ($Y_i$ doesn't influence the model parameters). The terms in the summation are also known as "deletion residuals". If this is too computationally expensive to do $N$ model fits (although most programs usually gives you something like this with the standard output), then I would suggest grouping the data. So you set the amount of time you are prepared to wait for $T$ (preferably not 0 ^_^), and then divide this by the time it takes to fit your model $M$. This will give a total of $G=\frac{T}{M}$ re-fits, with a sample size of $N_{g}=\frac{N\times M}{T}$.
$$PRESS=\sum_{g=1}^{G}\sum_{i=1}^{N_{g}} (Y_{ig}-\hat{Y}_{ig,-g})^2$$
A way you can get an idea of how important each variable is, is to re-fit an ordinary regression (variables in the same order). Then check proportionately how much each estimator has been shrunk towards zero $\frac{\beta_{LASSO}}{\beta_{UNCONSTRAINED}}$. Lasso, and other constrained regression can be seen as "smooth variable selection", because rather than adopt a binary "in-or-out" approach, each estimate is brought closer to zero, depending on how important it is for the model (as measured by the errors).
The fact that firth=FALSE
doesn't give similar results to glm
is puzzling to me -- hopefully someone else can answer. As far as pl
goes, though, you're almost always better off with profile confidence intervals. The Wald confidence intervals assume that the (implicit) log-likelihood surface is locally quadratic, which is often a bad approximation. Except that they're more computationally intensive, profile confidence intervals are always (? I would welcome counterexamples ?) more accurate. The "improved" p values you get from the Wald estimates are likely overoptimistic.
Generate data:
dd <- data.frame(X=rep(c("yes","no"),c(22,363)),
Y=rep(c("no","yes","no"),c(22,7,356)))
with(dd,table(X,Y))
Replicate:
m_glm <-glm(Y~X,family=binomial,data=dd)
library("logistf")
m_fp <-logistf(Y~X,data=dd,pl=TRUE,firth=TRUE)
m_mp <- logistf(Y~X,data=dd,pl=TRUE,firth=FALSE)
m_fw <-logistf(Y~X,data=dd,pl=FALSE,firth=TRUE)
m_mw <-logistf(Y~X,data=dd,pl=FALSE,firth=FALSE)
Compare Wald (confint.default
) with profile CIs for glm
(in this case the profile intervals are actually narrower).
confint.default(m_glm) ## {-2740, 2710}
confint(m_glm) ## {NA, 118}
Comparing with the glm2
package (just to make sure that glm
isn't doing something wonky).
library("glm2")
glm2(Y~X,family=binomial,data=dd)
## similar results to glm(...)
Best Answer
Little late to the party, but in case anyone stumbles across this question in the future. . . .
Best answer: have a look at section 6 of the vignette for the penalized R package ("L1 and L2 Penalized Regression Models" Jelle Goeman, Rosa Meijer, Nimisha Chaturvedi, Package version 0.9-47), https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf.
We don't get CIs or standard errors on the coefficients when we use penalized regression because they aren't meaningful. Ordinary linear regression, or logistic regression, or whatever, provides unbiased estimates of the coefficients. A CI around that point estimate, then, can give some indication of how point estimates will be distributed around the true value of the coefficient. Penalized regression, though, uses the bias-variance tradeoff to give us coefficient estimates with lower variance, but with bias. Reporting a CI around a biased estimate will give an unrealistically optimistic indication of how close the true value of the coefficient may be to the point estimate.
("Penalized Regression, Standard Errors, and Bayesian Lassos" Minjung Kyung, Jeff Gill, Malay Ghosh, and George Casella, Bayesian Analysis (2010) pages 369 - 411, discusses non-parametric (bootstrapped) estimates of p values for penalized regression and, if I understand correctly, they are not impressed. https://doi.org/10.1214/10-BA607 (Wayback machine link))