Solved – Dealing with a large number of predictors in Logistic Regression

logisticr

Let's say I have a logistic regression model which predicts whether a consumer will buy an item based on about 10 consumer characteristics.

$$\begin{array}{rcl}Buy &=& B_0 + B_1\times Gender + B_2\times CreditType + B_3\times Education + B_4\times OwnsHome \\\phantom{Buy} && + B_5\times CarMake + B_6\times CarYear + B_7\times State + B_8\times Income + B_9\times Insurance \\ \phantom{Buy} &&+ B_{10}\times CarAccidents\end{array} $$

  1. Is there ever an issue with including too many predictors in a logistic regression model? I'm not talking about insignificant variables or ones that may be related, but just the sheer number of variables included in a model.

  2. With a larger number of predictors, how should one present the regression results in a meaningful manner? Is it just a matter of plotting the probability curve for $Y=1$, or are there "better" ways of doing this. I'd be doing this in R, so any help on that end would be appreciated.

Best Answer

  1. Yes. The general rule of thumb is that you want 10 cases in the smaller group for each variable. So, with 10 IVs, you'd want at least 100 buyers and 100 non-buyers.

  2. Usually a table is presented, although what goes into that table varies depending on the style of the journal or whatever. The American Psychological Association's style is frequently used. I would want to include the coefficient and its SE and the odds ratio for each IV. Another nice thing to do is produce the predicted proportion for various combinations of the IVs, but this can be tricky with lots of IVs. R has a plot() for the glm that gives nice default plots