Solved – statsmodels logistic regression with binned variables has large coefficients and standard error for some variables

binningconvergencelogisticpythonregression coefficients

I'm fitting a logistic regression (binary) using Python's statsmodels, and here's a snippet of summary from the model:

enter image description here

I have noticed that the large coefficients only occurred on two variables and it seems like it's due to not converging (though I set max to 500).

Warning: Maximum number of iterations has been exceeded.
Current function value: 0.094121
Iterations: 500

I'm wondering what's the reason behind it and what are some possible ways of fixing this.

Just as extra information, I did:

  1. drop one of the levels from binning
  2. add a constant to the design matrix

Any help is appreciated! And please let me know what other information might be useful to identify the problem.

Best Answer

This is not a good application for binning. To have an adequate fit for an underlying smooth relationship that is steep in places, binning requires a large number of bins resulting in a losing battle in the bias-variance war because of high variance. For continuous variables use fewer parameters and still get a better fit using things like restricted cubic splines and other cubic spline bases.

Related Question