Solved – Logistic regression model does not converge

logisticrseparation

I've got some data about airline flights (in a data frame called flights) and I would like to see if the flight time has any effect on the probability of a significantly delayed arrival (meaning 10 or more minutes). I figured I'd use logistic regression, with the flight time as the predictor and whether or not each flight was significantly delayed (a bunch of Bernoullis) as the response. I used the following code…

flights$BigDelay <- flights$ArrDelay >= 10
delay.model <- glm(BigDelay ~ ArrDelay, data=flights, family=binomial(link="logit"))
summary(delay.model)

…but got the following output.

> flights$BigDelay <- flights$ArrDelay >= 10
> delay.model <- glm(BigDelay ~ ArrDelay, data=flights, family=binomial(link="logit"))
Warning messages:
1: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart,  :
  algorithm did not converge
2: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart,  :
  fitted probabilities numerically 0 or 1 occurred
> summary(delay.model)

Call:
glm(formula = BigDelay ~ ArrDelay, family = binomial(link = "logit"),
    data = flights)

Deviance Residuals:
       Min          1Q      Median          3Q         Max
-3.843e-04  -2.107e-08  -2.107e-08   2.107e-08   3.814e-04

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -312.14     170.26  -1.833   0.0668 .
ArrDelay       32.86      17.92   1.833   0.0668 .
---
Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.8375e+06  on 2291292  degrees of freedom
Residual deviance: 9.1675e-03  on 2291291  degrees of freedom
AIC: 4.0092

Number of Fisher Scoring iterations: 25

What does it mean that the algorithm did not converge? I thought it be because the BigDelay values were TRUE and FALSE instead of 0 and 1, but I got the same error after I converted everything. Any ideas?

Best Answer

glm() uses an iterative re-weighted least squares algorithm. The algorithm hit the maximum number of allowed iterations before signalling convergence. The default, documented in ?glm.control is 25. You pass control parameters as a list in the glm call:

delay.model <- glm(BigDelay ~ ArrDelay, data=flights, family=binomial,
                   control = list(maxit = 50))

As @Conjugate Prior says, you seem to be predicting the response with the data used to generate it. You have complete separation as any ArrDelay < 10 will predict FALSE and any ArrDelay >= 10 will predict TRUE. The other warning message tells you that the fitted probabilities for some observations were effectively 0 or 1 and that is a good indicator you have something wrong with the model.

The two warnings can go hand in hand. The likelihood function can be quite flat when some $\hat{\beta}_i$ get large, as in your example. If you allow more iterations, the model coefficients will diverge further if you have a separation issue.

Related Solutions

Solved – I think the logistic model is overfitted even with Lasso? R gives me a perfect separation warning message

Getting fitted values that are 0 or 1 is not itself a problem, nor it is necessarily a sign of over-fitting. Other things being equal, getting fitted probabilities near to 0 or 1 is good rather than bad, suggesting that the predictor variables are correlated with the response. The lack of convergence is the same -- it is just a consequence of the 0 or 1 fitted values.

Neither warning is an "error message". These warnings do however alert you that you will not be able to use the coefficient standard errors and p-values that are produced by the summary table:

summary(model)

Looking at the summary table you give, it is evident that most of the coefficients and standard errors in your table are actually infinite. It is well known that logistic regression does not yield usable z-statistics in this situation. You need to use likelihood ratio tests (LRTs) instead. The variables may well be highly significant by LRT even if the Wald p-values in the summary table were all near 1. See p-value from a binomial model glm for a binomial predictor for an example of this.

To see LRTs for your fit, you could try

anova(model, test="Chi")

but beware that the p-values you see are order dependent. This is a "sequential analysis of deviance table". Each variable is added to the model one at a time, in the same order you included them in the model formula. Each variable is adjusted for the variables above it in the table, so each p-value tests whether that variable adds something useful over the variables already in the model. If you change the order of the variables, then the p-values will change as well.

It is also evident that you do have over-fitting in the sense that you are including too many predictor variables that are collinear with one another and therefore mutually redundant. You cannot possibly interpret the logistic regression with 25 variables, and it is likely to be pretty useless for prediction as well.

Rather than examining individual p-values, you need to test the overall significance of the regression model. You can do this by comparing the full and null models:

model <- glm(Y~., family=binomial(link='logit'), data=...)
null.model <- glm(Y~1, family=binomial(link='logit'), data=...)
anova(null.model, model, test="Chi")

If the overall model is not significant, then there is nothing to be done. In that case, trying to do any model selection would be purposeless.

If the overall model is significant, then you have the problem of which variables to keep. I don't agree that the LASSO is useful here, because it has treated all the columns of the design matrix as continuous covariates. It has not taken into account the fact that columns are grouped by factor.

There are lots of ways to proceed, but I would be tempted to just try logistic regression with one factor or variable at a time, and seeing if any of the individual variables give you good prediction.

It might also be useful to examine collinearity of your variables. For example

 table(race, college)

would tell you if you have representatives of all races at all college levels. If race and college are highly correlated, then they might be mutually redundant in your model. Same for other variables such as gender and sexor.

A bit of common sense might be required, instead of use of automatic procedures.

Solved – Forward and backward stepwise regression (AIC) for negative binomial regression (with real data)

I have good news and bad news.

good news

you can probably more or less disregard the warnings. Where stepwise regression is recommended at all (see below ...), backward regression is probably better than forward regression anyway.

you can do forward and backward stepwise regression with MASS::stepAIC() (instead of step).

bad news

step probably isn't doing what you think it's doing anyway. Rather than refitting the negative binomial dispersion parameter, it's re-fitting with a fixed overdispersion parameter, which is probably not what you want (there's a classically snarky e-mail from Prof. Brian Ripley from 2006 here that discusses this issue in passing). As mentioned above, stepAIC() works better.
if you are only interested in predictive accuracy, and not in anything about confidence intervals or hypothesis tests or measuring variable importance ... then stepwise regression might be OK (Murtaugh 2009) ...
but if you care at all about being able to make any inferences about the effects of the parameters, you have too many variables and not enough data. A rule of thumb is that (1) you need at least 10 times as many data points as predictor variables to do reliable inference and (2) doing any inference after selecting variables (via stepwise selection or otherwise) is very wrong [unless you do super-cutting-edge stuff that only works with huge data sets and very strong assumptions].

The big question here is: why do you want to do variable selection in the first place?

you're only interested in prediction: OK, but something like penalized regression (Dahlgren 2010) will probably work better
you're interested in inference: this is going to be tough; you almost certainly don't have enough data to tell the effects of correlated variables apart. In your situation I would probably compute the principal components (PCA) of the predictor variables and use only the first 5 (which fall within the $n/10$ rule, and explain 99.5% of the variance in the predictors ...)

Murtaugh, Paul A. “Performance of Several Variable-Selection Methods Applied to Real Ecological Data.” Ecology Letters 12, no. 10 (October 2009): 1061–68. https://doi.org/10.1111/j.1461-0248.2009.01361.x.

Dahlgren, Johan P. “Alternative Regression Methods Are Not Considered in Murtaugh (2009) or by Ecologists in General.” Ecology Letters 13, no. 5 (May 1, 2010): E7–9. https://doi.org/10.1111/j.1461-0248.2010.01460.x.