Regression – Logistic Regression in R with Many Predictors

logisticrregression

I have been running logistic regression in R, and have been having an issue where as I include more predictors the z-scores and respective p-values approach 0 and 1 respectively. For example if have few predictors:

> model1
b17 ~ i74 + i73 + i72 + i71
> step1<-glm(model1,data=newdat1,family="binomial")
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.9461     1.8953  -3.665 0.000247 ***
i74           0.6842     0.9543   0.717 0.473384    
i73           1.7691     4.8008   0.368 0.712502    
i72           0.5134     2.0142   0.255 0.798812    
i71          -0.6753     4.9173  -0.137 0.890771

The results appear to be fairly reasonable; however, if I have more predictors:

 > model1
b17 ~ i90 + i89 + i88 + i87 + i86 + i85 + i84 + i83 + i82 + i81 + 
i80 + i79 + i78 + i77 + i76 + i74 + i73 + i72 + i71
> step1<-glm(model1,data=newdat1,family="binomial")
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.887e+02  3.503e+05  -0.001    0.999
i90          1.431e-01  1.009e+04   0.000    1.000
i89          8.062e+01  1.027e+05   0.001    0.999
i88          9.738e+01  7.398e+04   0.001    0.999
i87         -1.980e+01  9.469e+03  -0.002    0.998
i86          9.829e+00  1.098e+05   0.000    1.000
i85          5.917e+01  3.074e+04   0.002    0.998
i84         -2.373e+01  1.378e+05   0.000    1.000
i83          7.257e+00  2.173e+05   0.000    1.000
i82         -1.397e+01  1.894e+05   0.000    1.000
i81          6.503e+01  1.373e+05   0.000    1.000
i80          3.728e+01  4.904e+04   0.001    0.999
i79          1.010e+02  5.556e+04   0.002    0.999
i78         -2.628e+01  1.546e+05   0.000    1.000
i77          4.725e+01  3.027e+05   0.000    1.000
i76         -6.517e+01  1.509e+05   0.000    1.000
i74          1.267e+01  1.175e+05   0.000    1.000
i73          2.796e+02  5.280e+05   0.001    1.000
i72         -2.533e+02  4.412e+05  -0.001    1.000
i71         -1.240e+02  4.387e+05   0.000    1.000

I know it is hard to say exactly what is going on without seeing the data, but the predictors are all 5-point Likert Scale items. However, are there any thoughts to what is occurring here? I don't have much experience with logistic regression, so I apologize if the question seems naive, but is there a certain threshold of predictors where logistic regression falls apart due to having such a large amount of predictors what is ultimately a very small amount of variance? Is the potentially a multi-co-linearity issue? Finally, when I run OLS regression on the data I get results that make more sense (or at least appear to), is it okay/what are the consequences of running OLS regression on a binary outcome? Thank you!

Best Answer

Although the initial symptom was a type of problem seen in logistic regression, the underlying issue is that there are many predictor variables and only a comparatively small number of cases. That underlying issue needs to be addressed.

So first, if the outcome variable is binary you should not abandon logistic regression. The underlying issue will not go away by trying another type of analysis, even if it appears in a different form. For example, an ordinary least-squares model would tend to be highly over-fit (even if it were appropriate for binary outcomes) and thus highly unreliable. You said: "when I run OLS regression on the data I get results that make more sense (or at least appear to)" (emphasis added). Yes, the result of a regression on your data set might fit quite well, but in this situation your model would probably not apply beyond your initial data set.

Second, you can consider reducing the number of predictor variables based on prior knowledge of the subject matter. Likert items are often designed to be multiple questions aimed at a single opinion or personality trait, which are then combined to form a Likert scale as a better gauge of the opinion or trait. If prior knowledge of the subject matter allows combination of the 100 Likert items into 5 or 10 Likert scales as predictors, then the problem with the predictor/case ratio would be greatly diminished. The combination of multiple items into a smaller number of scales might also diminish problems resulting from a potentially incorrect assumption of equally-spaced influences of each of the 4 steps along each 5-point Likert item.

Third, although you say that you can't use PCA (for some unspecified reason; it's just a linear transformation of the original predictors) in this situation, note that the analysis of the correlation structure provided by PCA on the predictors, or clustering approaches, could well identify sets of items that are highly related, essentially measuring the same thing, and thus could be combined into a single predictor for analysis. It would seem that you would want to know these relations among the individual items in any event, so it's a bit concerning that you can't take the next obvious step into a principal-components regression (PCR).

Fourth, you can employ shrinkage methods to minimize the overfitting inevitable with a high ratio of predictors to cases. Ridge regression (unlike LASSO) would keep information from all your predictors, just weighting them differentially. If your objection to PCR is that you don't want to throw out any information from your predictors, then this might be a solution. (It's essentially a weighted principal-components regression, rather than the all-or-none selection of components in PCR.)

Related Solutions

Logistic Regression – Interpreting a Model with Multiple Predictors in R

I would suggest that you use Frank Harrell's excellent rms package. It contains many useful functions to validate and calibrate your model. As far as I know, you cannot assess predictive performance solely based on the coefficients. Further, I would suggest that you use the bootstrap to validate the model. The AUC or concordance-index (c-index) is a useful measure of predictive performance. A c-index of $0.8$ is quite high but as in many predictive models, the fit of your model is likely overoptimistic (overfitting). This overoptimism can be assessed using bootstrap. But let me give an example:

#-----------------------------------------------------------------------------
# Load packages
#-----------------------------------------------------------------------------

library(rms)

#-----------------------------------------------------------------------------
# Load data
#-----------------------------------------------------------------------------

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")

mydata$rank <- factor(mydata$rank)

#-----------------------------------------------------------------------------
# Fit logistic regression model
#-----------------------------------------------------------------------------

mylogit <- lrm(admit ~ gre + gpa + rank, x=TRUE, y=TRUE, data = mydata)
mylogit

                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       
Obs           400    LR chi2      41.46    R2       0.138    C       0.693    
 0            273    d.f.             5    g        0.838    Dxy     0.386    
 1            127    Pr(> chi2) <0.0001    gr       2.311    gamma   0.387    
max |deriv| 2e-06                          gp       0.167    tau-a   0.168    
                                           Brier    0.195                     

          Coef    S.E.   Wald Z Pr(>|Z|)
Intercept -3.9900 1.1400 -3.50  0.0005  
gre        0.0023 0.0011  2.07  0.0385  
gpa        0.8040 0.3318  2.42  0.0154  
rank=2    -0.6754 0.3165 -2.13  0.0328  
rank=3    -1.3402 0.3453 -3.88  0.0001  
rank=4    -1.5515 0.4178 -3.71  0.0002

On the bottom you see the usual regression coefficients with corresponding $p$-values. On the top right, you see several discrimination indices. The C denotes the c-index (AUC), and a c-index of $0.5$ denotes random splitting whereas a c-index of $1$ denotes perfect prediction. Dxy is Somers' $D_{xy}$ rank correlation between the predicted probabilities and the observed responses. $D_{xy}$ has simple relationship with the c-index: $D_{xy}=2(c-0.5)$. A $D_{xy}$ of $0$ occurs when the model's predictions are random and when $D_{xy}=1$, the model is perfectly discriminating. In this case, the c-index is $0.693$ which is slightly better than chance but a c-index of $>0.8$ is good enough for predicting the outcomes of individuals.

As said above, the model is likely overoptimistic. We now use bootstrap to quantify the optimism:

#-----------------------------------------------------------------------------
# Validate model using bootstrap
#-----------------------------------------------------------------------------

my.valid <- validate(mylogit, method="boot", B=1000)
my.valid

          index.orig training    test optimism index.corrected    n
Dxy           0.3857   0.4033  0.3674   0.0358          0.3498 1000
R2            0.1380   0.1554  0.1264   0.0290          0.1090 1000
Intercept     0.0000   0.0000 -0.0629   0.0629         -0.0629 1000
Slope         1.0000   1.0000  0.9034   0.0966          0.9034 1000
Emax          0.0000   0.0000  0.0334   0.0334          0.0334 1000
D             0.1011   0.1154  0.0920   0.0234          0.0778 1000
U            -0.0050  -0.0050  0.0015  -0.0065          0.0015 1000
Q             0.1061   0.1204  0.0905   0.0299          0.0762 1000
B             0.1947   0.1915  0.1977  -0.0062          0.2009 1000
g             0.8378   0.9011  0.7963   0.1048          0.7331 1000
gp            0.1673   0.1757  0.1596   0.0161          0.1511 1000

Let's concentrate on the $D_{xy}$ which is at the top. The first column denotes the original index, which was $0.3857$. The column called optimism denotes the amount of estimated overestimation by the model. The column index.corrected is the original estimate minus the optimism. In this case, the bias-corrected $D_{xy}$ is a bit smaller than the original. The bias-corrected c-index (AUC) is $c=\frac{1+ D_{xy}}{2}=0.6749$.

We can also calculate a calibration curve using resampling:

#-----------------------------------------------------------------------------
# Calibration curve using bootstrap
#-----------------------------------------------------------------------------

my.calib <- calibrate(mylogit, method="boot", B=1000)

par(bg="white", las=1)
plot(my.calib, las=1)

n=400   Mean absolute error=0.016   Mean squared error=0.00034
0.9 Quantile of absolute error=0.025

LogReg Calibration

The plot provides some evidence that our models is overfitting: the model underestimates low probabilities and overestimates high probabilities. There is also a systematic overestimation around $0.3$.

Predictive model building is a big topic and I suggest reading Frank Harrell's course notes.

Logistic Regression – Handling Too Many Logistic Regression Predictors

Your approach gives up one of the advantages of multiple regression: accounting for the combined influences of all the predictors at once. It's thus effectively throwing away information, which is seldom useful.

One way to deal with too many predictors is to use subject-matter knowledge or the observed relations among the predictors (not considering the outcomes) to combine some related predictors into a combined individual predictor.

Another way was suggested in the comment by @user777: use LASSO, ridge regression, or elastic net, which impose a penalty on regression coefficients that guards against overfitting. (The rule of thumb of 10 events per variable was based on non-penalized analyses.) These methods provide principled ways to build models even if you have more predictor variables than cases.

Note that the best-subset suggestion in another comment doesn't get around the overfitting issue, and the variables selected would be highly dependent on your particular data sample. Try repeating best-subset analysis on multiple bootstrap samples to see the problems.

If your interest is prediction and you only have 8 predictor variables, ridge regression will probably work well on your data.

Added in response to comments:

The paper linked from a comment, on assessing multivariable logistic regression models, rightly emphasizes the proper selection of predictor variables as a major criterion. Using subject-matter knowledge for selecting or combining variables should be a top priority. You might, for example, be able to combine categories in your categorical variable, or omit other predictors that have been shown in related studies not to be closely related to outcome.

That paper's sole focus on 10 events per variable to prevent overfitting, however, is inadequate in two ways. First, as a rule of thumb, 15 events per variable may be a better choice than 10. Second, not noted in that paper, methods like LASSO and ridge regression provide another well-established way to prevent overfitting, by shrinking the magnitudes of the coefficients to less than those that would be provided by standard logistic regression. See for example An Introduction to Statistical Learning for background on these and other approaches.

The idea to break your analysis into 2 parts (5-level categorical variable and then all other variables separately) doesn't really accomplish much. What you think you might gain from having about 15-20 events per variable in each of the 2 separate analyses would be lost by your need to correct for multiple hypothesis testing and your inability to take into account the levels of the categorical variable when evaluating the other variables (and vice-versa). And you would still need to evaluate overfitting.

With respect to "investigating associations" versus predictive modeling, consider what Frank Harrell has to say in Regression Modeling Strategies, second edition, page 3:

Thus when one develops a reasonable multivariable predictive model, hypothesis testing and estimation of effects are byproducts of the fitted model. So predictive modeling is often desirable even when prediction is not the main goal.

Harrell's rms package in R provides the tools you need to build, calibrate, and validate logistic models. His book linked above and associated class notes provide examples of ways to deal with too many variables. Try them out on your dataset.

Best Answer

Related Solutions

Logistic Regression – Interpreting a Model with Multiple Predictors in R

Logistic Regression – Handling Too Many Logistic Regression Predictors

Related Question