Solved – Sample size for logistic regression

assumptionslogisticsample-sizestatistical-powerunbalanced-classes

I want to make a logistic model from my survey data. It is a small survey of four residential colonies in which only 154 respondents were interviewed. My dependent variable is "satisfactory transition to work". I found that, of the 154 respondents, 73 said that they have satisfactorily transitioned to work, while the rest did not. So the dependent variable is binary in nature and I decided to use logistic regression. I have seven independent variables (three continuous and four nominal). One guideline suggest that there should be 10 cases for each predictor / independent variable (Agresti, 2007). Based on this guideline I feel that it is OK to run logistic regression.

Am I right? If not please let me know how to decide the number of independent variables?

Best Answer

There are several issues here.

Typically, we want to determine a minimum sample size so as to achieve a minimally acceptable level of statistical power. The sample size required is a function of several factors, primarily the magnitude of the effect you want to be able to differentiate from 0 (or whatever null you are using, but 0 is most common), and the minimum probability of catching that effect you want to have. Working from this perspective, sample size is determined by a power analysis.

Another consideration is the stability of your model (as @cbeleites notes). Basically, as the ratio of parameters estimated to the number of data gets close to 1, your model will become saturated, and will necessarily be overfit (unless there is, in fact, no randomness in the system). The 1 to 10 ratio rule of thumb comes from this perspective. Note that having adequate power will generally cover this concern for you, but not vice versa.

The 1 to 10 rule comes from the linear regression world, however, and it's important to recognize that logistic regression has additional complexities. One issue is that logistic regression works best when the percentages of 1's and 0's is approximately 50% / 50% (as @andrea and @psj discuss in the comments above). Another issue to be concerned with is separation. That is, you don't want to have all of your 1's gathered on one extreme of an independent variable (or some combination of them), and all of the 0's at the other extreme. Although this would seem like a good situation, because it would make perfect prediction easy, it actually makes the parameter estimation process blow up. (@Scortchi has an excellent discussion of how to deal with separation in logistic regression here: How to deal with perfect separation in logistic regression?) With more IV's, this becomes more likely, even if the true magnitudes of the effects are held constant, and especially if your responses are unbalanced. Thus, you can easily need more than 10 data per IV.

One last issue with that rule of thumb, is that it assumes your IV's are orthogonal. This is reasonable for designed experiments, but with observational studies such as yours, your IV's will almost never be roughly orthogonal. There are strategies for dealing with this situation (e.g., combining or dropping IV's, conducting a principal components analysis first, etc.), but if it isn't addressed (which is common), you will need more data.

A reasonable question then, is what should your minimum N be, and/or is your sample size sufficient? To address this, I suggest you use the methods @cbeleites discusses; relying on the 1 to 10 rule will be insufficient.

Related Question