I have some data on patients presenting to emergency departments after sustaining self-inflicted gunshot injuries, stored in a data frame ("SIGSW," which is ~16,000 observations of 47 variables) in R. I want to create a model that helps a physician predict, using several objective covariates, the "pretest probability" of the self-shooting being a suicide attempt, or a negligent discharge. The covariates are largely categorical variables, but a few are continuous or binary. My outcome, suicide attempt or not, is coded as a binary/indicator variable, "SI," so I believe a binary logistic regression to be the appropriate tool.
In order to construct my model, I intended to individually regress SI on each covariate, and use the p-value from the likelihood ratio test for each model to inform which covariates should be considered for the backward model selection.
For each model, SI~SEX, SI~AGE, etc, I receive the following error:
>glm(SI ~ SEX, family = binomial, data=SIGSW)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: algorithm did not converge
A little Googling revealed that I perhaps need to increase the number of iterations to allow convergence. I did this with the following:
>glm(SI ~ SEX, family = binomial, data=SIGSW, control = list(maxit = 50))
Call: glm(formula = SI ~ SEX, family = binomial, data = SIGSW, control = list(maxit = 50))
Coefficients:
(Intercept) SEX
-3.157e+01 -2.249e-13
Degrees of Freedom: 15986 Total (i.e. Null); 15985 Residual
Null Deviance: 0
Residual Deviance: 7.1e-12 AIC: 4
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
This warning message, after a little Googling, suggests a "perfect separation," which, as I understand it, means that my predictor is "too good." Seeing as how this happens with all of the predictors, I'm somewhat skeptical that they're all "too good." Am I doing something wrong?
Edit: In light of the answers, here is a sample of the data (I only selected a few of the variables for space concerns):
SIGSW.AGENYR_C SIGSW.SEX SIGSW.RACE_C SIGSW.SI
1 19 Male White 0
2 13 Male Other 0
3 18 Male Not Stated 0
4 15 Male White 0
5 23 Male White 0
6 11 Male Black 0
7 16 Male Not Stated 1
8 21 Male Not Stated 0
9 14 Male White 0
10 41 Male White 0
And here is the crosstabulation of SEX and SI, showing that SI is coded as an indicator variable, and that there are both men and women with SI, so sex is not a perfect predictor.
>table(SIGSW$SEX, SIGSW$SI)
0 1
Unknown 1 3
Male 11729 2121
Female 1676 457
Does the small cell size represent a problem?
Best Answer
Looking at this
I see that your model is returning a numeric zero for the coefficient of
SEX
($-2.2 \times 10^{-13}$ may as well be $0$), and is driving the intercept to $-31.57$. Plugging that value into the logistic function in my R interpreter I getSo you don't really have perfect separation except in a degenerate sense; your model is saying there is a probability of one of a suicide for every record.
I can't say why this is so without seeing your data, but I would hypothesize it is an encoding error in how you are passing the response to the model. Make sure that your response column is coded as an indicator variable, $0$ for no suicide, $1$ for a suicide.
I can't help but comment that this is a a poor procedure. Regressing a response on individual predictors tells you next to nothing about the structure of a multivariate model. Backwards selection also has it's own host of problems, as you will find if you search this site for the term.
If you want to do variable selection, please consider a more principled method like
glmnet
.