I would suggest that you use Frank Harrell's excellent rms package. It contains many useful functions to validate and calibrate your model. As far as I know, you cannot assess predictive performance solely based on the coefficients. Further, I would suggest that you use the bootstrap to validate the model. The AUC or concordance-index (c-index) is a useful measure of predictive performance. A c-index of $0.8$ is quite high but as in many predictive models, the fit of your model is likely overoptimistic (overfitting). This overoptimism can be assessed using bootstrap. But let me give an example:
#-----------------------------------------------------------------------------
# Load packages
#-----------------------------------------------------------------------------
library(rms)
#-----------------------------------------------------------------------------
# Load data
#-----------------------------------------------------------------------------
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
#-----------------------------------------------------------------------------
# Fit logistic regression model
#-----------------------------------------------------------------------------
mylogit <- lrm(admit ~ gre + gpa + rank, x=TRUE, y=TRUE, data = mydata)
mylogit
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 400 LR chi2 41.46 R2 0.138 C 0.693
0 273 d.f. 5 g 0.838 Dxy 0.386
1 127 Pr(> chi2) <0.0001 gr 2.311 gamma 0.387
max |deriv| 2e-06 gp 0.167 tau-a 0.168
Brier 0.195
Coef S.E. Wald Z Pr(>|Z|)
Intercept -3.9900 1.1400 -3.50 0.0005
gre 0.0023 0.0011 2.07 0.0385
gpa 0.8040 0.3318 2.42 0.0154
rank=2 -0.6754 0.3165 -2.13 0.0328
rank=3 -1.3402 0.3453 -3.88 0.0001
rank=4 -1.5515 0.4178 -3.71 0.0002
On the bottom you see the usual regression coefficients with corresponding $p$-values. On the top right, you see several discrimination indices. The C
denotes the c-index (AUC), and a c-index of $0.5$ denotes random splitting whereas a c-index of $1$ denotes perfect prediction. Dxy
is Somers' $D_{xy}$ rank correlation between the predicted probabilities and the observed responses. $D_{xy}$ has simple relationship with the c-index: $D_{xy}=2(c-0.5)$. A $D_{xy}$ of $0$ occurs when the model's predictions are random and when $D_{xy}=1$, the model is perfectly discriminating. In this case, the c-index is $0.693$ which is slightly better than chance but a c-index of $>0.8$ is good enough for predicting the outcomes of individuals.
As said above, the model is likely overoptimistic. We now use bootstrap to quantify the optimism:
#-----------------------------------------------------------------------------
# Validate model using bootstrap
#-----------------------------------------------------------------------------
my.valid <- validate(mylogit, method="boot", B=1000)
my.valid
index.orig training test optimism index.corrected n
Dxy 0.3857 0.4033 0.3674 0.0358 0.3498 1000
R2 0.1380 0.1554 0.1264 0.0290 0.1090 1000
Intercept 0.0000 0.0000 -0.0629 0.0629 -0.0629 1000
Slope 1.0000 1.0000 0.9034 0.0966 0.9034 1000
Emax 0.0000 0.0000 0.0334 0.0334 0.0334 1000
D 0.1011 0.1154 0.0920 0.0234 0.0778 1000
U -0.0050 -0.0050 0.0015 -0.0065 0.0015 1000
Q 0.1061 0.1204 0.0905 0.0299 0.0762 1000
B 0.1947 0.1915 0.1977 -0.0062 0.2009 1000
g 0.8378 0.9011 0.7963 0.1048 0.7331 1000
gp 0.1673 0.1757 0.1596 0.0161 0.1511 1000
Let's concentrate on the $D_{xy}$ which is at the top. The first column denotes the original index, which was $0.3857$. The column called optimism
denotes the amount of estimated overestimation by the model. The column index.corrected
is the original estimate minus the optimism. In this case, the bias-corrected $D_{xy}$ is a bit smaller than the original. The bias-corrected c-index (AUC) is $c=\frac{1+ D_{xy}}{2}=0.6749$.
We can also calculate a calibration curve using resampling:
#-----------------------------------------------------------------------------
# Calibration curve using bootstrap
#-----------------------------------------------------------------------------
my.calib <- calibrate(mylogit, method="boot", B=1000)
par(bg="white", las=1)
plot(my.calib, las=1)
n=400 Mean absolute error=0.016 Mean squared error=0.00034
0.9 Quantile of absolute error=0.025
The plot provides some evidence that our models is overfitting: the model underestimates low probabilities and overestimates high probabilities. There is also a systematic overestimation around $0.3$.
Predictive model building is a big topic and I suggest reading Frank Harrell's course notes.
Your approach gives up one of the advantages of multiple regression: accounting for the combined influences of all the predictors at once. It's thus effectively throwing away information, which is seldom useful.
One way to deal with too many predictors is to use subject-matter knowledge or the observed relations among the predictors (not considering the outcomes) to combine some related predictors into a combined individual predictor.
Another way was suggested in the comment by @user777: use LASSO, ridge regression, or elastic net, which impose a penalty on regression coefficients that guards against overfitting. (The rule of thumb of 10 events per variable was based on non-penalized analyses.) These methods provide principled ways to build models even if you have more predictor variables than cases.
Note that the best-subset suggestion in another comment doesn't get around the overfitting issue, and the variables selected would be highly dependent on your particular data sample. Try repeating best-subset analysis on multiple bootstrap samples to see the problems.
If your interest is prediction and you only have 8 predictor variables, ridge regression will probably work well on your data.
Added in response to comments:
The paper linked from a comment, on assessing multivariable logistic regression models, rightly emphasizes the proper selection of predictor variables as a major criterion. Using subject-matter knowledge for selecting or combining variables should be a top priority. You might, for example, be able to combine categories in your categorical variable, or omit other predictors that have been shown in related studies not to be closely related to outcome.
That paper's sole focus on 10 events per variable to prevent overfitting, however, is inadequate in two ways. First, as a rule of thumb, 15 events per variable may be a better choice than 10. Second, not noted in that paper, methods like LASSO and ridge regression provide another well-established way to prevent overfitting, by shrinking the magnitudes of the coefficients to less than those that would be provided by standard logistic regression. See for example An Introduction to Statistical Learning for background on these and other approaches.
The idea to break your analysis into 2 parts (5-level categorical variable and then all other variables separately) doesn't really accomplish much. What you think you might gain from having about 15-20 events per variable in each of the 2 separate analyses would be lost by your need to correct for multiple hypothesis testing and your inability to take into account the levels of the categorical variable when evaluating the other variables (and vice-versa). And you would still need to evaluate overfitting.
With respect to "investigating associations" versus predictive modeling, consider what Frank Harrell has to say in Regression Modeling Strategies, second edition, page 3:
Thus when one develops a reasonable multivariable predictive model, hypothesis testing and estimation of effects are byproducts of the fitted model. So predictive modeling is often desirable even when prediction is not the main goal.
Harrell's rms
package in R provides the tools you need to build, calibrate, and validate logistic models. His book linked above and associated class notes provide examples of ways to deal with too many variables. Try them out on your dataset.
Best Answer
Although the initial symptom was a type of problem seen in logistic regression, the underlying issue is that there are many predictor variables and only a comparatively small number of cases. That underlying issue needs to be addressed.
So first, if the outcome variable is binary you should not abandon logistic regression. The underlying issue will not go away by trying another type of analysis, even if it appears in a different form. For example, an ordinary least-squares model would tend to be highly over-fit (even if it were appropriate for binary outcomes) and thus highly unreliable. You said: "when I run OLS regression on the data I get results that make more sense (or at least appear to)" (emphasis added). Yes, the result of a regression on your data set might fit quite well, but in this situation your model would probably not apply beyond your initial data set.
Second, you can consider reducing the number of predictor variables based on prior knowledge of the subject matter. Likert items are often designed to be multiple questions aimed at a single opinion or personality trait, which are then combined to form a Likert scale as a better gauge of the opinion or trait. If prior knowledge of the subject matter allows combination of the 100 Likert items into 5 or 10 Likert scales as predictors, then the problem with the predictor/case ratio would be greatly diminished. The combination of multiple items into a smaller number of scales might also diminish problems resulting from a potentially incorrect assumption of equally-spaced influences of each of the 4 steps along each 5-point Likert item.
Third, although you say that you can't use PCA (for some unspecified reason; it's just a linear transformation of the original predictors) in this situation, note that the analysis of the correlation structure provided by PCA on the predictors, or clustering approaches, could well identify sets of items that are highly related, essentially measuring the same thing, and thus could be combined into a single predictor for analysis. It would seem that you would want to know these relations among the individual items in any event, so it's a bit concerning that you can't take the next obvious step into a principal-components regression (PCR).
Fourth, you can employ shrinkage methods to minimize the overfitting inevitable with a high ratio of predictors to cases. Ridge regression (unlike LASSO) would keep information from all your predictors, just weighting them differentially. If your objection to PCR is that you don't want to throw out any information from your predictors, then this might be a solution. (It's essentially a weighted principal-components regression, rather than the all-or-none selection of components in PCR.)