Does it makes sense to calculate the AUC if I do not want to use my multiple logistic regression model for predictions? I only want to calculate some odds ratios and test if the variables in my model have a significant influence and adjust for some covariates.
Solved – Does AUC for multiple logistic regression make sense if prediction is not the goal
auclogisticmultiple regression
Related Solutions
There are several reasons (none of which are specifically related to logistic regression, but may occur in any regression).
- Loss of degrees of freedom: when trying to estimate more parameters from a given dataset, you're effectively asking more of it, which costs precision, hence leads to lower t-statistics, hence higher p-values.
Correlation of Regressors: Your regressors may be related to each other, effectively measuring something similar. Say, your logit model is to explain labor market status (working/not working) as a function of experience and age. Individually, both variables are positively related to the status, as more experienced/older (ruling out very old employees for the sake of the argument) employees find it easier to find jobs than recent graduates. Now, obviously, the two variables are strongly related, as you need to be older to have more experience. Hence, the two variables basically "compete" for explaining the status, which may, especially in small samples, result in both variables "losing", as none of the effects may be strong enough and sufficiently precisely estimated when controlling for the other to get significant estimates. Essentially, you are asking: what is the positive effect of another year of experience when holding age constant? There may be very few to no employees in your dataset to answer that question, so the effect will be imprecisely estimated, leading to large p-values.
Misspecified models: The underlying theory for t-statistics/p-values requires that you estimate a correctly specified model. Now, if you only regress on one predictor, chances are quite high that that univariate model suffers from omitted variable bias. Hence, all bets are off as to how p-values behave. Basically, you must be careful to trust them when your model is not correct.
There are at least three regularization strategies to address this multi-collinearity/separation problem.
1) Build a Bayesian regression model that establishes a prior distribution over the regression coefficients that shrinks estimates toward zero, but supplies enough prior probability for the posterior distribution to move toward a signal in the data if it is strong enough. There are several types of priors for this, including but not limited to the Laplace, spike-and-slab, and horseshoe priors. Gelman et al. have a nice paper describing a default prior distribution on coefficients in logistic regression, which pairs well with the bayesglm
function they developed in the arm
package in R, which allows you to easily build and summarize logistic and other generalized linear models. You can read their paper on the subject here.
2) Penalized regression with L1 norm (LASSO regression), L2 norm (ridge regression), or some combination thereof (the elastic net model). Tibshirani, Hastie, and colleagues have developed a package in R called glmnet
, which implements elastic net regression (thus L1 and L2 regression, since they are special cases of the elastic net). This package includes the logit model. There is an excellent vignette of the package, at the end of which you can find useful references on regularization in general as well as for the ridge/LASSO/elastic-net framework. If you want to watch a video version of the vignette, and learn a lot of other stuff, too, I recommend taking their Stanford online course, as well.
3) Another way to deal with multi-collinearity problems in logistic and other generalized linear models is through boosted regression. In boosted regression models, you iteratively aggregate the inferences from many simple models called "base learners". By aggregating the estimates of many simple models, you avoid the curse of dimensionality, and you can also compute variable-importance measures. If you set up your base learners properly, multi-collinearity is no longer an issue. There's a great package in R called mboost
, which implements boosted generalized linear models and multilevel generalized linear models. Another reason mboost
is great is the variety of base learners available, including non-parametric smoothing splines and random fields. Amazing stuff. Even better is a related package called gamboostLSS
, which allows you to build boosted regression models over each of the parameters of our likelihood, not just the mean or some other location parameter.
In your situation, I'd say the best among these methods is either the Gelman et al. recipe or the elastic net option. Of these, I'd prefer the Gelman et al. recipe, because it will yield you not only point estimates, but posterior distributions of the coefficients, as well.
Side note: The beauty of the elastic net and boosting methods, and for some priors the fully Bayesian inference method, is that by regularizing your model, you can also build models with lots of features - even models more features than observations. The regularization procedure in some sense selects those features that are most important while avoiding or at least mitigating the curse of dimensionality.
Best Answer
The AUC doesn't actually tell you how well your model will predict out of sample. If you want that, you need to cross-validate and get the mean out-of-sample AUC.
More basically, the AUC tells you how well ordered your predicted probabilities are. That is, if you compared the predicted probabilities for two units, $i$ and $i'$, and $p(y_i=1) > p(y_{i'}=1)$, then you would prefer that $y_i = 1$ and $y_{i'} = 0$. The AUC is the proportion of times that is the case.
So computing the AUC for your model in-sample can provide one kind of information about the model's performance / goodness of fit. You certainly don't have to want to know that when you use the model to compute odds ratios, but it doesn't hurt as one more piece of information about whether your model is decent.
To get a fuller sense of how the AUC works, it may help you to read this excellent CV thread: How to calculate Area Under the Curve (AUC), or the c-statistic, by hand.