Solved – How to deal with perfect separation in logistic regression

logisticrregressionseparation

If you have a variable which perfectly separates zeroes and ones in target variable, R will yield the following "perfect or quasi perfect separation" warning message:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

We still get the model but the coefficient estimates are inflated.

How do you deal with this in practice?

Best Answer

A solution to this is to utilize a form of penalized regression. In fact, this is the original reason some of the penalized regression forms were developed (although they turned out to have other interesting properties.

Install and load package glmnet in R and you're mostly ready to go. One of the less user-friendly aspects of glmnet is that you can only feed it matrices, not formulas as we're used to. However, you can look at model.matrix and the like to construct this matrix from a data.frame and a formula...

Now, when you expect that this perfect separation is not just a byproduct of your sample, but could be true in the population, you specifically don't want to handle this: use this separating variable simply as the sole predictor for your outcome, not employing a model of any kind.

Related Solutions

Solved – How to simulate artificial data for logistic regression

No. The response variable $y_i$ is a Bernoulli random variable taking value $1$ with probability $pr(i)$.

> set.seed(666)
> x1 = rnorm(1000)           # some continuous variables 
> x2 = rnorm(1000)
> z = 1 + 2*x1 + 3*x2        # linear combination with a bias
> pr = 1/(1+exp(-z))         # pass through an inv-logit function
> y = rbinom(1000,1,pr)      # bernoulli response variable
> 
> #now feed it to glm:
> df = data.frame(y=y,x1=x1,x2=x2)
> glm( y~x1+x2,data=df,family="binomial")

Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
(Intercept)           x1           x2  
     0.9915       2.2731       3.1853  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      1355 
Residual Deviance: 582.9        AIC: 588.9

Solved – Is it possible to get fitted values 0 or 1 in logistic regression when the fitting algorithm converges

R is giving you two different warnings because these really are two distinct issues.

Very loosely, the algorithm that fits a logistic regression model (typically some version of Newton-Raphson) looks around for the coefficient estimates that will maximize the log likelihood. It will estimate the model at a given point in the parameter space, see which direction is 'uphill', and then move some distance in that direction. The potential problem with this is that when perfect separation exists, the maximum of the log likelihood is where the slope is infinite. Because a search algorithm has to be designed to stop at some point, it doesn't converge.

On the other hand, no matter where it stops, whether it converged or not, it is (theoretically) possible to calculate the model's predicted values for the data. However, because computers use finite precision arithmetic, when they perform the calculations, they eventually need to round off or drop extremely low decimal values. Thus, if the arithmetically correct value is sufficiently close to 0 or 1, when it is rounded, it can end up being 0 or 1 exactly. Values can have that property and be within the normal range of the data due to an extremely large (in absolute value) slope estimate due to complete separation, or they can just be so far out on X that even a small slope will lead to the same phenomenon.

# I'll use this function to convert log odds to probabilities
lo2p = function(lo){ exp(lo) / (1+exp(lo)) }    
set.seed(163)                             # this makes the example exactly reproducible
x  = c(-500, runif(100, min=-3, max=3), 500)  # the x-values; 2 are extreme
lo = 0 + 1*x
p  = lo2p(lo)
y  = rbinom(102, size=1, prob=p)

m  = glm(y~x, family=binomial)
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred 
summary(m)
# ... 
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   0.3532     0.3304   1.069    0.285    
# x             1.3686     0.2372   5.770 7.95e-09 ***
# ...
# 
#     Null deviance: 140.420  on 101  degrees of freedom
# Residual deviance:  63.017  on 100  degrees of freedom
# AIC: 67.017
# 
# Number of Fisher Scoring iterations: 9

Here we see that we got the second warning, but the algorithm converged. The betas are reasonably close to the true values, the standard errors aren't huge, and the number of Fisher scoring iterations is moderate. Nonetheless, the extreme x-values yield predicted log odds that are perfectly calculable, but when converted into probabilities, become essentially 0 and 1.

predict(m, type="link")[c(1, 102)]      # these are the predicted log odds
#         1       102 
# -683.9379  684.6444 
predict(m, type="response")[c(1, 102)]  # these are the predicted probabilities
#            1          102 
# 2.220446e-16 1.000000e+00

Best Answer

Related Solutions

Solved – How to simulate artificial data for logistic regression

Solved – Is it possible to get fitted values 0 or 1 in logistic regression when the fitting algorithm converges

Related Question