Logistic Regression – Understanding Perfect Separation in Logistic Regression and Possible Alternatives

logisticrseparation

I have some data on patients presenting to emergency departments after sustaining self-inflicted gunshot injuries, stored in a data frame ("SIGSW," which is ~16,000 observations of 47 variables) in R. I want to create a model that helps a physician predict, using several objective covariates, the "pretest probability" of the self-shooting being a suicide attempt, or a negligent discharge. The covariates are largely categorical variables, but a few are continuous or binary. My outcome, suicide attempt or not, is coded as a binary/indicator variable, "SI," so I believe a binary logistic regression to be the appropriate tool.

In order to construct my model, I intended to individually regress SI on each covariate, and use the p-value from the likelihood ratio test for each model to inform which covariates should be considered for the backward model selection.

For each model, SI~SEX, SI~AGE, etc, I receive the following error:

>glm(SI ~ SEX, family = binomial, data=SIGSW)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: algorithm did not converge

A little Googling revealed that I perhaps need to increase the number of iterations to allow convergence. I did this with the following:

>glm(SI ~ SEX, family = binomial, data=SIGSW, control = list(maxit = 50))

Call:  glm(formula = SI ~ SEX, family = binomial, data = SIGSW, control = list(maxit = 50))

Coefficients:
(Intercept)          SEX  
 -3.157e+01   -2.249e-13  

Degrees of Freedom: 15986 Total (i.e. Null);  15985 Residual
Null Deviance:      0 
Residual Deviance: 7.1e-12  AIC: 4
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

This warning message, after a little Googling, suggests a "perfect separation," which, as I understand it, means that my predictor is "too good." Seeing as how this happens with all of the predictors, I'm somewhat skeptical that they're all "too good." Am I doing something wrong?

Edit: In light of the answers, here is a sample of the data (I only selected a few of the variables for space concerns):

   SIGSW.AGENYR_C SIGSW.SEX SIGSW.RACE_C SIGSW.SI
1              19      Male        White        0
2              13      Male        Other        0
3              18      Male   Not Stated        0
4              15      Male        White        0
5              23      Male        White        0
6              11      Male        Black        0
7              16      Male   Not Stated        1
8              21      Male   Not Stated        0
9              14      Male        White        0
10             41      Male        White        0

And here is the crosstabulation of SEX and SI, showing that SI is coded as an indicator variable, and that there are both men and women with SI, so sex is not a perfect predictor.

  >table(SIGSW$SEX, SIGSW$SI)        
              0     1
  Unknown     1     3
  Male    11729  2121
  Female   1676   457

Does the small cell size represent a problem?

Best Answer

Looking at this

Coefficients:
(Intercept)          SEX  
 -3.157e+01   -2.249e-13

I see that your model is returning a numeric zero for the coefficient of SEX ($-2.2 \times 10^{-13}$ may as well be $0$), and is driving the intercept to $-31.57$. Plugging that value into the logistic function in my R interpreter I get

> 1/(1 + exp(-31.57))
[1] 1

So you don't really have perfect separation except in a degenerate sense; your model is saying there is a probability of one of a suicide for every record.

I can't say why this is so without seeing your data, but I would hypothesize it is an encoding error in how you are passing the response to the model. Make sure that your response column is coded as an indicator variable, $0$ for no suicide, $1$ for a suicide.

In order to construct my model, I intended to individually regress SI on each covariate, and use the p-value from the likelihood ratio test for each model to inform which covariates should be considered for the backward model selection.

I can't help but comment that this is a a poor procedure. Regressing a response on individual predictors tells you next to nothing about the structure of a multivariate model. Backwards selection also has it's own host of problems, as you will find if you search this site for the term.

If you want to do variable selection, please consider a more principled method like glmnet.