Logistic Regression – Issue with Complete Separation in R

logisticrseparation

I am trying to fit a logistic regression model for business defaults. Apart from the dichotomous variable default, the data set includes some performance ratios.
When estimating the model in R, the following warning appears:

glm.fit<-glm(Default~ROS+ROI+debt_ratio,data=ratios,family=binomial)

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

However, when I'm using the brglm2 package to detect separation, no separation is detected:

glm.det<-glm(Default~ROS+ROI+debt_ratio,data=ratios,family=binomial("logit"),method="detect_separation")

Separation: FALSE 
Existence of maximum likelihood estimates
(Intercept)         ROS         ROI  debt_ratio 
          0           0           0           0 
0: finite value, Inf: infinity, -Inf: -infinity

And when I'm using the logistf function to penalize complete separation, the variables stop being significant:

glm.adj<-logistf(Default~ROS+ROI+debt_ratio,data=ratios,family=binomial)
logistf(formula = Default ~ ROS + ROI + debt_ratio, data = ratios, 
family = binomial)

Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood Profile Likelihood 
Profile Likelihood Profile Likelihood 

                   coef    se(coef)  lower 0.95   upper 0.95    Chisq          p
(Intercept) -2.10087016 0.134332976 -2.36696659 -1.833575742      Inf 
0.00000000
ROS         -0.01252600 0.009949910 -5.07141750 -0.001546061 5.054069 
0.02456818
ROI         -0.02086535 0.004830609 -0.03171022 -0.012274125 0.000000 
1.00000000
debt_ratio  -0.45693051 0.107951265 -0.69899271 -0.265132585 0.000000 
1.00000000

Likelihood ratio test=53.38183 on 3 df, p=1.519984e-11, n=997
Wald test = 24.06556 on 3 df, p = 2.420494e-05

In case it helps here is the complete set of raw data.

Why doesn't the detect_separation method detect any separation?
Is it possible that the variables, even though yielding complete separation, are not at all significant with logistf?
How should I proceed?

Update: Added two boxplots (the first one includes all data, while in the second one two outliers are excluded to make the plot slightly clearer), in order to show the dispersion of the data depending on the Default variable.

Boxplot 1
Boxplot 2

Best Answer

TL;DR: The warning is not occurring because of complete separation.

library("tidyverse")
library("broom")
# semicolon delimited but period for decimal
ratios <- read_delim("data/W0krtTYM.txt", delim=";")
# filter out the ones with missing values to make it easier to see what's going on
ratios.complete <- filter(ratios, !is.na(ROS), !is.na(ROI), !is.na(debt_ratio))
glm0<-glm(Default~ROS+ROI+debt_ratio,data=ratios.complete,family=binomial)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(glm0)
#> 
#> Call:
#> glm(formula = Default ~ ROS + ROI + debt_ratio, family = binomial, 
#>     data = ratios.complete)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8773  -0.3133  -0.2868  -0.2355   3.6160  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -3.759154   0.306226 -12.276  < 2e-16 ***
#> ROS         -0.919294   0.245712  -3.741 0.000183 ***
#> ROI         -0.044447   0.008981  -4.949 7.45e-07 ***
#> debt_ratio   0.868707   0.291368   2.981 0.002869 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 604.89  on 998  degrees of freedom
#> Residual deviance: 372.43  on 995  degrees of freedom
#> AIC: 380.43
#> 
#> Number of Fisher Scoring iterations: 8

When does that warning occur? Looking at the source code for glm.fit() we find

eps <- 10 * .Machine$double.eps
if (family$family == "binomial") {
  if (any(mu > 1 - eps) || any(mu < eps)) 
    warning("glm.fit: fitted probabilities numerically 0 or 1 occurred", 
            call. = FALSE)
}

The warning will arise whenever a predicted probability is effectively indistinguishable from 1. The problem is on the top end:

glm0.resids <- augment(glm0) %>%
  mutate(p = 1 / (1 + exp(-.fitted)),
         warning = p > 1-eps)

arrange(glm0.resids, desc(.fitted)) %>%  
  select(2:5, p, warning) %>% 
  slice(1:10)
#> # A tibble: 10 x 6
#>         ROS      ROI debt_ratio .fitted     p warning
#>       <dbl>    <dbl>      <dbl>   <dbl> <dbl> <lgl>  
#>  1 - 25.0   -10071     452       860    1.00  T      
#>  2 -292     -   17.9     0.0896  266    1.00  T      
#>  3 - 96.0   -  176       0.0219   92.3  1.00  T      
#>  4 - 25.4   -  548       6.43     49.5  1.00  T      
#>  5 -  1.80  -  238      21.2      26.9  1.000 F      
#>  6 -  5.65  -  344      11.3      26.6  1.000 F      
#>  7 -  0.597 -  345       4.43     16.0  1.000 F      
#>  8 -  2.62  -  359       0.444    15.0  1.000 F      
#>  9 -  0.470 -  193       9.87     13.8  1.000 F      
#> 10 -  2.46  -  176       3.64      9.50 1.000 F

So there are four observations that are causing the issue. They all have extreme values of one or more covariates. But there are lots of other observations that are similarly close to 1. There are some observations with high leverage -- what do they look like?

arrange(glm0.resids, desc(.hat)) %>%  
  select(2:4, .hat, p, warning) %>% 
  slice(1:10)
#> # A tibble: 10 x 6
#>        ROS     ROI debt_ratio   .hat     p warning
#>      <dbl>   <dbl>      <dbl>  <dbl> <dbl> <lgl>  
#>  1  0.995  - 2.46      4.96   0.358  0.437 F      
#>  2 -3.01   - 0.633     1.36   0.138  0.555 F      
#>  3 -3.08   -14.6       0.0686 0.136  0.444 F      
#>  4 -2.64   - 0.113     1.90   0.126  0.579 F      
#>  5 -2.95   -13.9       0.773  0.112  0.561 F      
#>  6 -0.0132 -14.9       3.12   0.0936 0.407 F      
#>  7 -2.60   -10.9       0.856  0.0881 0.464 F      
#>  8 -3.41   -26.4       1.12   0.0846 0.821 F      
#>  9 -1.63   - 1.02      2.14   0.0746 0.413 F      
#> 10 -0.146  -17.6       8.02   0.0644 0.984 F

None of those are problematic. Eliminate the four observations that trigger the warning; does the answer change?

ratios2 <- filter(ratios.complete, !glm0.resids$warning)

glm1<-glm(Default~ROS+ROI+debt_ratio,data=ratios2,family=binomial)

summary(glm1)
#> 
#> Call:
#> glm(formula = Default ~ ROS + ROI + debt_ratio, family = binomial, 
#>     data = ratios2)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8773  -0.3133  -0.2872  -0.2363   3.6160  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -3.75915    0.30621 -12.277  < 2e-16 ***
#> ROS         -0.91929    0.24571  -3.741 0.000183 ***
#> ROI         -0.04445    0.00898  -4.949 7.45e-07 ***
#> debt_ratio   0.86871    0.29135   2.982 0.002867 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 585.47  on 994  degrees of freedom
#> Residual deviance: 372.43  on 991  degrees of freedom
#> AIC: 380.43
#> 
#> Number of Fisher Scoring iterations: 6

tidy(glm1)[,2] - tidy(glm0)[,2] 
#> [1]  2.058958e-08  4.158585e-09 -1.119948e-11 -2.013056e-08

None of the coefficients changed by more than 10^-8! So essentially unchanged results. I'll go out on a limb here, but I think that's a "false positive" warning, nothing to worry about. This warning arises with complete separation, but in that case I would expect to see the coefficient for one or more covariates get very large, with a standard error that is even larger. That's not occurring here, and from your plots you can see that the defaults occur across overlapping ranges of all covariates. So the warning occurs because a few observations have very extreme values of the covariates. That could be a problem if those observations were also highly influential. But they're not.

In the comments you asked "Why does standardization blow up the standard errors?". Standardizing your covariates changes the scale. The coefficients and standard errors refer to a one unit change in the covariate, always. So if the variance of your covariate is larger than 1, then standardizing is going to shrink the scale. A one unit change on the standardized scale is the same as a much larger change on the unstandardized scale. So the coefficients and standard errors will get larger. Look at the z values -- they should not change even if you standardize. The z value of the intercept changes if you also center the covariates, because now it is estimating a different point (at the mean of the covariates, instead of at 0)

ratios.complete2 <- mutate(ratios.complete,
                           scROS =  (ROS - mean(ROS))/sd(ROS),
                           scROI =  (ROI - mean(ROI))/sd(ROI),
                           scdebt_ratio = (debt_ratio - mean(debt_ratio))/sd(debt_ratio))
glm2<-glm(Default~scROS+scROI+scdebt_ratio,data=ratios.complete2,family=binomial)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# compare z values
tidy(glm2)[,4] - tidy(glm0)[,4]
#> [1]  4.203563e+00  8.881784e-16  1.776357e-15 -6.217249e-15

Created on 2018-03-25 by the reprex package (v0.2.0).

good news

you can probably more or less disregard the warnings. Where stepwise regression is recommended at all (see below ...), backward regression is probably better than forward regression anyway.

you can do forward and backward stepwise regression with MASS::stepAIC() (instead of step).

bad news

step probably isn't doing what you think it's doing anyway. Rather than refitting the negative binomial dispersion parameter, it's re-fitting with a fixed overdispersion parameter, which is probably not what you want (there's a classically snarky e-mail from Prof. Brian Ripley from 2006 here that discusses this issue in passing). As mentioned above, stepAIC() works better.
if you are only interested in predictive accuracy, and not in anything about confidence intervals or hypothesis tests or measuring variable importance ... then stepwise regression might be OK (Murtaugh 2009) ...
but if you care at all about being able to make any inferences about the effects of the parameters, you have too many variables and not enough data. A rule of thumb is that (1) you need at least 10 times as many data points as predictor variables to do reliable inference and (2) doing any inference after selecting variables (via stepwise selection or otherwise) is very wrong [unless you do super-cutting-edge stuff that only works with huge data sets and very strong assumptions].

The big question here is: why do you want to do variable selection in the first place?

you're only interested in prediction: OK, but something like penalized regression (Dahlgren 2010) will probably work better
you're interested in inference: this is going to be tough; you almost certainly don't have enough data to tell the effects of correlated variables apart. In your situation I would probably compute the principal components (PCA) of the predictor variables and use only the first 5 (which fall within the $n/10$ rule, and explain 99.5% of the variance in the predictors ...)

Murtaugh, Paul A. “Performance of Several Variable-Selection Methods Applied to Real Ecological Data.” Ecology Letters 12, no. 10 (October 2009): 1061–68. https://doi.org/10.1111/j.1461-0248.2009.01361.x.

Dahlgren, Johan P. “Alternative Regression Methods Are Not Considered in Murtaugh (2009) or by Ecologists in General.” Ecology Letters 13, no. 5 (May 1, 2010): E7–9. https://doi.org/10.1111/j.1461-0248.2010.01460.x.

Proportion Data Analysis – Choosing Between GLM or Beta Regression

The data you have are really a classic binomial setting and you can use binomial GLMs to model the data, either using the standard maximum likelihood (ML) estimator or using a bias-reduced (BR) estimator (Firth's penalized estimator). I wouldn't use beta regression here.

For treatment b you have 49 successes and 0 failures.

subset(dataset, treatment == "b")
##    treatment success fail  n
## 6          b      10    0 10
## 7          b      10    0 10
## 8          b       9    0  9
## 9          b      10    0 10
## 10         b      10    0 10

Hence there is quasi-complete separation leading to an infinite ML estimator while the BR estimator is guaranteed to be finite. Note that the BR estimator provides a principled solution for the separation while the trimming of the proportion data is rather ad hoc.

For fitting the binomial GLM with ML and BR you can use the glm() function and combine it with the brglm2 package (Kosmidis & Firth, 2020, Biometrika, doi:10.1093/biomet/asaa052). An easier way to specify the binomial response is to use a matrix where the first column contains the successes and the second column the failures:

ml <- glm(cbind(success, fail) ~ treatment, data = dataset, family = binomial)
summary(ml)
## Call:
## glm(formula = cbind(success, fail) ~ treatment, family = binomial, 
##     data = dataset)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.81457  -0.33811   0.00012   0.54942   1.86737  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.6325     0.3001   2.108   0.0351 *
## treatmentb    20.4676  3308.1100   0.006   0.9951  
## treatmentc     1.0257     0.4888   2.099   0.0359 *
## treatmentd     0.3119     0.4351   0.717   0.4734  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.36  on 19  degrees of freedom
## Residual deviance: 13.40  on 16  degrees of freedom
## AIC: 56.022
## 
## Number of Fisher Scoring iterations: 18

The corresponding BR estimator can be obtained by using the "brglmFit" method (provided in brglm2) instead of the default "glm.fit" method:

library("brglm2")
br <- glm(cbind(success, fail) ~ treatment, data = dataset, family = binomial,
  method = "brglmFit")
summary(br)
## Call:
## glm(formula = cbind(success, fail) ~ treatment, family = binomial, 
##     data = dataset, method = "brglmFit")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7498  -0.2890   0.4368   0.6030   1.9096  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.6190     0.2995   2.067  0.03875 * 
## treatmentb    3.9761     1.4667   2.711  0.00671 **
## treatmentc    0.9904     0.4834   2.049  0.04049 * 
## treatmentd    0.3041     0.4336   0.701  0.48304   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.362  on 19  degrees of freedom
## Residual deviance: 14.407  on 16  degrees of freedom
## AIC:  57.029
## 
## Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
## Number of Fisher Scoring iterations: 1

Note that the coefficient estimates, standard errors, and t statistics for the non-separated treatments only change very slightly. The main difference is the finite estimate for treatment b.

Because the BR adjustment is so slight, you can still employ the usual inference (Wald, likelihood ratio, etc.) and information criteria etc. In R with brglm2 this is particularly easy because the br model fitted above still inherits from glm. As an illustration we can assess the overall significance of the treatment factor:

br0 <- update(br, . ~ 1)
AIC(br0, br)
##     df      AIC
## br0  1 79.98449
## br   4 57.02927
library("lmtest")
lrtest(br0, br)
## Likelihood ratio test
## 
## Model 1: cbind(success, fail) ~ 1
## Model 2: cbind(success, fail) ~ treatment
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   1 -38.992                         
## 2   4 -24.515  3 28.955  2.288e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or we can look at all pairwise contrasts (aka Tukey contrasts) of the four treatments:

library("multomp")
summary(glht(br, linfct = mcp(treatment = "Tukey")))
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## Fit: glm(formula = cbind(success, fail) ~ treatment, family = ## binomial, 
##     data = dataset, method = "brglmFit")
## 
## Linear Hypotheses:
##            Estimate Std. Error z value Pr(>|z|)  
## b - a == 0   3.9761     1.4667   2.711   0.0286 *
## c - a == 0   0.9904     0.4834   2.049   0.1513  
## d - a == 0   0.3041     0.4336   0.701   0.8864  
## c - b == 0  -2.9857     1.4851  -2.010   0.1641  
## d - b == 0  -3.6720     1.4696  -2.499   0.0517 .
## d - c == 0  -0.6863     0.4922  -1.394   0.4743  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Finally, for the standard ML estimator you can also use the quasibinomial family as you did in your post, although in this data set you don't have overdispersion (rather a little bit of underdispersion). However, the brglm2 package does not support this at the moment. But given that there is no evidence for overdispersion I would not be concerned about this here.

Best Answer

Related Solutions

R Regression – Forward and Backward Stepwise Regression (AIC) for Negative Binomial Regression

good news

bad news

Proportion Data Analysis – Choosing Between GLM or Beta Regression

Related Question