Solved – Dealing with a large number of predictors in Logistic Regression

logisticr

Let's say I have a logistic regression model which predicts whether a consumer will buy an item based on about 10 consumer characteristics.

$$\begin{array}{rcl}Buy &=& B_0 + B_1\times Gender + B_2\times CreditType + B_3\times Education + B_4\times OwnsHome \\\phantom{Buy} && + B_5\times CarMake + B_6\times CarYear + B_7\times State + B_8\times Income + B_9\times Insurance \\ \phantom{Buy} &&+ B_{10}\times CarAccidents\end{array} $$

Is there ever an issue with including too many predictors in a logistic regression model? I'm not talking about insignificant variables or ones that may be related, but just the sheer number of variables included in a model.
With a larger number of predictors, how should one present the regression results in a meaningful manner? Is it just a matter of plotting the probability curve for $Y=1$, or are there "better" ways of doing this. I'd be doing this in R, so any help on that end would be appreciated.

Best Answer

Yes. The general rule of thumb is that you want 10 cases in the smaller group for each variable. So, with 10 IVs, you'd want at least 100 buyers and 100 non-buyers.
Usually a table is presented, although what goes into that table varies depending on the style of the journal or whatever. The American Psychological Association's style is frequently used. I would want to include the coefficient and its SE and the odds ratio for each IV. Another nice thing to do is produce the predicted proportion for various combinations of the IVs, but this can be tricky with lots of IVs. R has a plot() for the glm that gives nice default plots

Related Solutions

Logistic Regression – Performance of Logistic Regression with High Number of Predictors

I think we should give the word to Venables and Ripley, page 198 in MASS:

There is one fairly common circumstance in which both convergence problems and the Hauck-Donner phenomenon can occur. This is when the fitted probabilities are extremely close to zero or one. Consider a medical diagnosis problem with thousands of cases and around fifty binary explanatory variables (which may arise from coding fewer categorical factors); one of these indicators is rarely true but always indicates that the disease is present. Then the fitted probabilities of cases with that indicator should be one, which can only be achieved by taking $\hat\beta_i = \infty$. The result from glm will be warnings and an estimated coefficient of around +/- 10.

Besides potential numerical difficulties there is no formal problem with probabilities being estimated numerically to 0 or 1. However, the $t$-test, which is based on a quadratic approximation, for testing the hypothesis $\beta_i = 0$ can become a poor approximation of the likelihood ratio test, and the $t$-test may appear insignificant though in reality the hypothesis is definitely wrong. As I understand it, this it what the warning is about.

With many predictors a situation like the one Venables and Ripley describes may easily occur; one predictor is mostly not informative, but in certain cases it is a strong predictor for a case.

Logistic Regression – Interpreting a Model with Multiple Predictors in R

I would suggest that you use Frank Harrell's excellent rms package. It contains many useful functions to validate and calibrate your model. As far as I know, you cannot assess predictive performance solely based on the coefficients. Further, I would suggest that you use the bootstrap to validate the model. The AUC or concordance-index (c-index) is a useful measure of predictive performance. A c-index of $0.8$ is quite high but as in many predictive models, the fit of your model is likely overoptimistic (overfitting). This overoptimism can be assessed using bootstrap. But let me give an example:

#-----------------------------------------------------------------------------
# Load packages
#-----------------------------------------------------------------------------

library(rms)

#-----------------------------------------------------------------------------
# Load data
#-----------------------------------------------------------------------------

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")

mydata$rank <- factor(mydata$rank)

#-----------------------------------------------------------------------------
# Fit logistic regression model
#-----------------------------------------------------------------------------

mylogit <- lrm(admit ~ gre + gpa + rank, x=TRUE, y=TRUE, data = mydata)
mylogit

                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       
Obs           400    LR chi2      41.46    R2       0.138    C       0.693    
 0            273    d.f.             5    g        0.838    Dxy     0.386    
 1            127    Pr(> chi2) <0.0001    gr       2.311    gamma   0.387    
max |deriv| 2e-06                          gp       0.167    tau-a   0.168    
                                           Brier    0.195                     

          Coef    S.E.   Wald Z Pr(>|Z|)
Intercept -3.9900 1.1400 -3.50  0.0005  
gre        0.0023 0.0011  2.07  0.0385  
gpa        0.8040 0.3318  2.42  0.0154  
rank=2    -0.6754 0.3165 -2.13  0.0328  
rank=3    -1.3402 0.3453 -3.88  0.0001  
rank=4    -1.5515 0.4178 -3.71  0.0002

On the bottom you see the usual regression coefficients with corresponding $p$-values. On the top right, you see several discrimination indices. The C denotes the c-index (AUC), and a c-index of $0.5$ denotes random splitting whereas a c-index of $1$ denotes perfect prediction. Dxy is Somers' $D_{xy}$ rank correlation between the predicted probabilities and the observed responses. $D_{xy}$ has simple relationship with the c-index: $D_{xy}=2(c-0.5)$. A $D_{xy}$ of $0$ occurs when the model's predictions are random and when $D_{xy}=1$, the model is perfectly discriminating. In this case, the c-index is $0.693$ which is slightly better than chance but a c-index of $>0.8$ is good enough for predicting the outcomes of individuals.

As said above, the model is likely overoptimistic. We now use bootstrap to quantify the optimism:

#-----------------------------------------------------------------------------
# Validate model using bootstrap
#-----------------------------------------------------------------------------

my.valid <- validate(mylogit, method="boot", B=1000)
my.valid

          index.orig training    test optimism index.corrected    n
Dxy           0.3857   0.4033  0.3674   0.0358          0.3498 1000
R2            0.1380   0.1554  0.1264   0.0290          0.1090 1000
Intercept     0.0000   0.0000 -0.0629   0.0629         -0.0629 1000
Slope         1.0000   1.0000  0.9034   0.0966          0.9034 1000
Emax          0.0000   0.0000  0.0334   0.0334          0.0334 1000
D             0.1011   0.1154  0.0920   0.0234          0.0778 1000
U            -0.0050  -0.0050  0.0015  -0.0065          0.0015 1000
Q             0.1061   0.1204  0.0905   0.0299          0.0762 1000
B             0.1947   0.1915  0.1977  -0.0062          0.2009 1000
g             0.8378   0.9011  0.7963   0.1048          0.7331 1000
gp            0.1673   0.1757  0.1596   0.0161          0.1511 1000

Let's concentrate on the $D_{xy}$ which is at the top. The first column denotes the original index, which was $0.3857$. The column called optimism denotes the amount of estimated overestimation by the model. The column index.corrected is the original estimate minus the optimism. In this case, the bias-corrected $D_{xy}$ is a bit smaller than the original. The bias-corrected c-index (AUC) is $c=\frac{1+ D_{xy}}{2}=0.6749$.

We can also calculate a calibration curve using resampling:

#-----------------------------------------------------------------------------
# Calibration curve using bootstrap
#-----------------------------------------------------------------------------

my.calib <- calibrate(mylogit, method="boot", B=1000)

par(bg="white", las=1)
plot(my.calib, las=1)

n=400   Mean absolute error=0.016   Mean squared error=0.00034
0.9 Quantile of absolute error=0.025

LogReg Calibration

The plot provides some evidence that our models is overfitting: the model underestimates low probabilities and overestimates high probabilities. There is also a systematic overestimation around $0.3$.

Predictive model building is a big topic and I suggest reading Frank Harrell's course notes.

Best Answer

Related Solutions

Logistic Regression – Performance of Logistic Regression with High Number of Predictors

Logistic Regression – Interpreting a Model with Multiple Predictors in R

Related Question