Solved – Why does logistic regression become unstable when classes are well-separated

logisticrregressionseparation

Why is it that logistic regression becomes unstable when classes are well-separated? What does well-separated classes mean? I would really appreciate if someone can explain with an example.

Best Answer

It isn't correct that logistic regression in itself becomes unstable when there are separation. Separation means that there are some variables which are very good predictors, which is good, or, separation may be an artifact of too few observations/too many variables. If that is the case, the solution might be to get more data. But separation itself, then, is only a symptom, and not a problem in itself.

So there are really different cases to be treated. First, what is the goal of the analysis? If the final result of the analysis is some classification of cases, separation is no problem at all, it really means that there are very good variables giving very good classification. But if the goal is risk estimation, we need the parameter estimates, and with separation the usual mle (maximum likelihood) estimates do not exist. So we must change estimation method, maybe. There are several proposals in the literature, I will come back to that.

Then there are (as said above) two different possible causes for separation. There might be separation in the full population, or separation might be caused by to few observed cases/too many variables.

What breaks down with separation, is the maximum likelihood estimation procedure. The mle parameter estimates (or at least some of them) becomes infinite. I said in the first version of this answer that that can be solved easily, maybe with bootstrapping, but that does not work, since there will be separation in each bootstrap resample, at least with the usual cases bootstrapping procedure. But logistic regression is still a valid model, but we need some other estimation procedure. Some proposals have been:

  1. regularization, like ridge or lasso, maybe combined with bootstrap.
  2. exact conditional logistic regression
  3. permutation tests, see https://www.ncbi.nlm.nih.gov/pubmed/15515134
  4. Firths bias-reduced estimation procedure, see Logistic regression model does not converge
  5. surely others ...

If you use R, there is a package on CRAN, SafeBinaryRegression, which help with diagnosing problems with separation, using mathematical optimization methods to check for sure if there is separation or quasiseparation! In the following I will be giving a simulated example using this package, and the elrm package for approximate conditional logistic regression.

First, a simple example with the safeBinaryRegression package. This package just redefines the glm function, overloading it with a test of separation, using linear programming methods. If it detects separation, it exits with an error condition, declaring that the mle does not exist. Otherwise it just runs the ordinary glm function from stats. The example is from its help pages:

library(safeBinaryRegression)   # Some testing of that 
                                # package,
                                # based on its examples
# complete separation:
x  <-  c(-2, -1, 1, 2)
y  <-  c(0, 0, 1, 1)
glm(y ~ x, family=binomial)
glm(y ~ x,  family=binomial,  separation="test")
stats::glm(y ~ x, family=binomial)
# Quasicomplete separation:
x  <-  c(-2, 0, 0, 2)
y  <-  c(0, 0, 1, 1)
glm(y ~ x, family=binomial)
glm(y ~ x,  family=binomial,  separation="test")
stats::glm(y ~ x, family=binomial)

The output from running it:

> # complete separation:
> x  <-  c(-2, -1, 1, 2)
> y  <-  c(0, 0, 1, 1)
> glm(y ~ x, family=binomial)
Error in glm(y ~ x, family = binomial) : 
  The following terms are causing separation among the sample points: (Intercept), x
> glm(y ~ x,  family=binomial,  separation="test")
Error in glm(y ~ x, family = binomial, separation = "test") : 
  Separation exists among the sample points.
    This model cannot be fit by maximum likelihood.
> stats::glm(y ~ x, family=binomial)

Call:  stats::glm(formula = y ~ x, family = binomial)

Coefficients:
(Intercept)            x  
 -9.031e-08    2.314e+01  

Degrees of Freedom: 3 Total (i.e. Null);  2 Residual
Null Deviance:      5.545 
Residual Deviance: 3.567e-10    AIC: 4
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> # Quasicomplete separation:
> x  <-  c(-2, 0, 0, 2)
> y  <-  c(0, 0, 1, 1)
> glm(y ~ x, family=binomial)
Error in glm(y ~ x, family = binomial) : 
  The following terms are causing separation among the sample points: x
> glm(y ~ x,  family=binomial,  separation="test")
Error in glm(y ~ x, family = binomial, separation = "test") : 
  Separation exists among the sample points.
    This model cannot be fit by maximum likelihood.
> stats::glm(y ~ x, family=binomial)

Call:  stats::glm(formula = y ~ x, family = binomial)

Coefficients:
(Intercept)            x  
  5.009e-17    9.783e+00  

Degrees of Freedom: 3 Total (i.e. Null);  2 Residual
Null Deviance:      5.545 
Residual Deviance: 2.773    AIC: 6.773

Now we simulate from a model which can be closely approximated by a logistic model, except that above a certain cutoff the event probability is exactly 1.0. Think about a bioassay problem, but above the cutoff the poison always kills:

pl  <-  function(a, b, x) 1/(1+exp(-a-b*x))
a  <-  0
b  <-  1.5
x_cutoff  <-  uniroot(function(x) pl(0,1.5,x) - 
                       0.98,lower=1,upper=3.5)$root
### circa 2.6
pltrue  <-  function(a, b, x) ifelse(x < x_cutoff, pl(a, b, 
                 x), 1.0)

x <- -3:3

### Let us simulate many times from this model,  and try to 
    # estimate it
### with safeBinaryRegression::glm  That way we can estimate 
    #the probability
### of separation from this model

set.seed(31415926)  ### May I have a large container of 
                       # coffee 
replications  <-  1000
p  <-  pltrue(a, b, x)
err  <-  0
good  <- 0

for (i in 1:replications) {
    y  <- rbinom(length(x), 1, p)
    res  <-  try(glm(y~x, family=binomial), silent=TRUE)
    if (inherits(res,"try-error")) err <-  err+1 else 
                      good <- good+1
}
P_separation  <-  err/replications
P_separation

When running this code, we estimate the probability of separation as 0.759. Run the code yourself, it is fast!

Then we extend this code to try different estimations procedures, mle and approximate conditional logistic regression from elrm. Running this simulation take around 40 minutes on my computer.

library(elrm)  # from CRAN
set.seed(31415926)  ### May I have a large container of 
                        # coffee
replications  <-  1000
GOOD  <-  numeric(length=replications) ### will be set to one 
           # when MLE exists!
COEFS <- matrix(as.numeric(NA), replications, 2)
COEFS.elrm <- matrix(as.numeric(NA), replications, 2) # But 
                 # we'll only use second col for x
p  <-  pltrue(a, b, x)
err   <-  0
good  <- 0

for (i in 1:replications) {
    y  <- rbinom(length(x), 1, p)
    res  <-  try(glm(y~x, family=binomial), silent=TRUE)
    if (inherits(res,"try-error")) err <-  err+1 else{ 
          good <- good+1
          GOOD[i] <- 1 }
    # Using stats::glm
    mod  <-  stats::glm(y~x, family=binomial)
    COEFS[i, ]  <-  coef(mod)
    # Using elrm:
    DATASET  <-  data.frame(x=x, y=y, n=1)
    mod.elrm  <-  elrm(y/n ~ x,  interest= ~ x -1, r=4, 
                       iter=10000, burnIn=1000,
                       dataset=DATASET)
    COEFS.elrm[i, 2 ]  <-  mod.erlm$coeffs       
}
### Now we can compare coefficient estimates of x,
###  when there are separation,  and when not:

non  <-  which(GOOD==1)
cof.mle.non  <-  COEFS[non, 2, drop=TRUE]
cof.mle.sep  <-  COEFS[-non, 2, drop=TRUE]
cof.elrm.non  <-  COEFS.elrm[non, 2, drop=TRUE]
cof.elrm.sep  <-  COEFS.elrm[-non, 2, drop=TRUE]

Now we want to plot the results, but before that, note that ALL the conditional estimates are equal! That is really strange and should need an explanation ... The common value is 0.9523975. But at least we obtained finite estimates, with confidence intervals which contains the true value (not shown here). So I will only show a histogram of the mle estimates in the cases without separation:

hist(cof.mle.non, prob=TRUE)

[histogram of simulated parameter estimates1

What is remarkable is that all the estimates is lesser than the true value 1.5. That can have to do with the fact that we simulated from a modified model, needs investigation.

Related Question