Solved – statsmodels logistic regression with binned variables has large coefficients and standard error for some variables

binningconvergencelogisticpythonregression coefficients

I'm fitting a logistic regression (binary) using Python's statsmodels, and here's a snippet of summary from the model:

I have noticed that the large coefficients only occurred on two variables and it seems like it's due to not converging (though I set max to 500).

Warning: Maximum number of iterations has been exceeded.
Current function value: 0.094121
Iterations: 500

I'm wondering what's the reason behind it and what are some possible ways of fixing this.

Just as extra information, I did:

drop one of the levels from binning
add a constant to the design matrix

Any help is appreciated! And please let me know what other information might be useful to identify the problem.

Best Answer

This is not a good application for binning. To have an adequate fit for an underlying smooth relationship that is steep in places, binning requires a large number of bins resulting in a losing battle in the bias-variance war because of high variance. For continuous variables use fewer parameters and still get a better fit using things like restricted cubic splines and other cubic spline bases.

Linear regression residual sum of squares

The OLS estimator is defined as the optimizer of the well-known residual sum of squares function: $$ \begin{align} \hat{\boldsymbol{\beta}} &= \arg\min_{\boldsymbol{\beta}}\left(\boldsymbol{Y} - \mathbf{X}\boldsymbol{\beta}\right)'\left(\boldsymbol{Y} - \mathbf{X}\boldsymbol{\beta}\right) \\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{Y} \end{align} $$

In the case of a twice differentiable, convex function like the residual sum of squares, most gradient-based optimizers do good job. In this case, I will be using the BFGS algorithm.

#================================================
# reading in the data & pre-processing
#================================================
urlSheatherData = "http://www.stat.tamu.edu/~sheather/book/docs/datasets/MichelinNY.csv"
dfSheather = as.data.frame(read.csv(urlSheatherData, header = TRUE))

# create the design matrices
vY = as.matrix(dfSheather['InMichelin'])
mX = as.matrix(dfSheather[c('Service','Decor', 'Food', 'Price')])

# add an intercept to the predictor variables
mX = cbind(1, mX)

# the number of variables and observations
iK = ncol(mX)
iN = nrow(mX)

#================================================
# compute the linear regression parameters as 
#   an optimal value
#================================================
# the residual sum of squares criterion function
fnRSS = function(vBeta, vY, mX) {
  return(sum((vY - mX %*% vBeta)^2))
}

# arbitrary starting values
vBeta0 = rep(0, ncol(mX))

# minimise the RSS function to get the parameter estimates
optimLinReg = optim(vBeta0, fnRSS,
                   mX = mX, vY = vY, method = 'BFGS', 
                   hessian=TRUE)

#================================================
# compare to the LM function
#================================================
linregSheather = lm(InMichelin ~ Service + Decor + Food + Price,
                    data = dfSheather)

This yields:

> print(cbind(coef(linregSheather), optimLinReg$par))
                    [,1]         [,2]
(Intercept) -1.492092490 -1.492093965
Service     -0.011176619 -0.011176583
Decor        0.044193000  0.044193023
Food         0.057733737  0.057733770
Price        0.001797941  0.001797934

Logistic regression log-likelihood

The criterion function corresponding to the MLE in the logistic regression model is the log-likelihood function.

$$ \begin{align} \log L_n(\boldsymbol{\beta}) &= \sum_{i=1}^n \left(Y_i \log \Lambda(\boldsymbol{X}_i'\boldsymbol{\beta}) + (1-Y_i)\log(1 - \Lambda(\boldsymbol{X}_i'\boldsymbol{\beta}))\right) \end{align} $$ where $\Lambda(k) = 1/(1+ \exp(-k))$ is the logistic function. The parameter estimates are the optimizers of this function $$ \hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\log L_n(\boldsymbol{\beta}) $$

I show how to construct and optimize the criterion function using the optim() function once again employing the BFGS algorithm.

#================================================
# compute the logistic regression parameters as 
#   an optimal value
#================================================
# define the logistic transformation
logit = function(mX, vBeta) {
  return(exp(mX %*% vBeta)/(1+ exp(mX %*% vBeta)) )
}

# stable parametrisation of the log-likelihood function
# Note: The negative of the log-likelihood is being returned, since we will be
# /minimising/ the function.
logLikelihoodLogitStable = function(vBeta, mX, vY) {
  return(-sum(
    vY*(mX %*% vBeta - log(1+exp(mX %*% vBeta)))
    + (1-vY)*(-log(1 + exp(mX %*% vBeta)))
    ) 
  ) 
}

# initial set of parameters
vBeta0 = c(10, -0.1, -0.3, 0.001, 0.01) # arbitrary starting parameters

# minimise the (negative) log-likelihood to get the logit fit
optimLogit = optim(vBeta0, logLikelihoodLogitStable,
                   mX = mX, vY = vY, method = 'BFGS', 
                   hessian=TRUE)

#================================================
# test against the implementation in R
# NOTE glm uses IRWLS: 
# http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
# rather than the BFGS algorithm that we have reported
#================================================
logitSheather = glm(InMichelin ~ Service + Decor + Food + Price,
                                  data = dfSheather, 
                         family = binomial, x = TRUE)

This yields

> print(cbind(coef(logitSheather), optimLogit$par))
                    [,1]         [,2]
(Intercept) -11.19745057 -11.19661798
Service      -0.19242411  -0.19249119
Decor         0.09997273   0.09992445
Food          0.40484706   0.40483753
Price         0.09171953   0.09175369

As a caveat, note that numerical optimization algorithms require careful use or you can end up with all sorts of pathological solutions. Until you understand them well, it is best to use the available packaged options that allow you to concentrate on specifying the model rather than worrying about how to numerically compute the estimates.

Solved – How to interpret logistic regression coefficients with interactions between binary and continuous variables

The odds of being elected when you are not treated increases by a factor $\exp(1.50083)\approx 4.49$ or $(4.49-1)\times100\%=349\%$ if you move from a city with no one treated to a city where everyone is treated.

This effect of Treat.City increases by a factor $\exp(2.80625\approx16.55)$ or $(16.55-1)\times100\%=1555\%$ if one is treated. For more see: http://maartenbuis.nl/publications/interactions.html

Given the large size of the effect I will assume that Treat.City is not a percentage but a proportion. The effects will be more realistic and easier to interpret when you turn Treat.City into percentages.

Best Answer

Related Solutions

Solved – Calculate coefficients in a logistic regression with R

Linear regression residual sum of squares

Logistic regression log-likelihood

Solved – How to interpret logistic regression coefficients with interactions between binary and continuous variables

Related Question