Solved – Logistic regression with LBFGS solver

logisticmachine learning

Is there any open source library or code which implements Logistic Regression using L-BFGS solver?

I would prefer Python, but other languages are welcome, too.

Best Answer

Here is an example of logistic regression estimation using the limited memory BFGS [L-BFGS] optimization algorithm. I will be using the optimx function from the optimx library in R, and SciPy's scipy.optimize.fmin_l_bfgs_b in Python.

Python

The example that I am using is from Sheather (2009, pg. 264). The following Python code shows estimation of the logistic regression using the BFGS algorithm:

# load required libraries
import numpy as np
import scipy as sp
import scipy.optimize
import pandas as pd
import os

# hyperlink to data location
urlSheatherData = "http://www.stat.tamu.edu/~sheather/book/docs/datasets/MichelinNY.csv"

# read in the data to a NumPy array
arrSheatherData = np.asarray(pd.read_csv(urlSheatherData))

# slice the data to get the dependent variable
vY = arrSheatherData[:, 0].astype('float64')

# slice the data to get the matrix of predictor variables
mX = np.asarray(arrSheatherData[:, 2:]).astype('float64')

# add an intercept to the predictor variables
intercept = np.ones(mX.shape[0]).reshape(mX.shape[0], 1)
mX = np.concatenate((intercept, mX), axis = 1)

# the number of variables and obserations
iK = mX.shape[1]
iN = mX.shape[0]

# logistic transformation
def logit(mX, vBeta):
    return((np.exp(np.dot(mX, vBeta))/(1.0 + np.exp(np.dot(mX, vBeta)))))

# stable parametrisation of the cost function
def logLikelihoodLogitStable(vBeta, mX, vY):
    return(-(np.sum(vY*(np.dot(mX, vBeta) -
    np.log((1.0 + np.exp(np.dot(mX, vBeta))))) +
                    (1-vY)*(-np.log((1.0 + np.exp(np.dot(mX, vBeta))))))))

# score function
def likelihoodScore(vBeta, mX, vY):
    return(np.dot(mX.T,
                  (logit(mX, vBeta) - vY)))

#====================================================================
# optimize to get the MLE using the BFGS optimizer (numerical derivatives)
#====================================================================
optimLogitBFGS = sp.optimize.minimize(logLikelihoodLogitStable,
                                  x0 = np.array([10, 0.5, 0.1, -0.3, 0.1]),
                                    args = (mX, vY), method = 'BFGS',
                                    options={'gtol': 1e-3, 'disp': True})

print(optimLogitBFGS) # print the results of the optimisation

And this can easily be adapted to the scipy.optimize.fmin_l_bfgs_b function:

#====================================================================
# optimize to get the MLE using the L-BFGS optimizer (analytical derivatives)
#====================================================================
optimLogitLBFGS = sp.optimize.fmin_l_bfgs_b(logLikelihoodLogitStable,
                                  x0 = np.array([10, 0.5, 0.1, -0.3, 0.1]),
                                    args = (mX, vY), fprime = likelihoodScore,
                                    pgtol =  1e-3, disp = True)

print(optimLogitLBFGS) # print the results of the optimisation

R

Using the L-BFGS-B optimizer in R is just as simple. First the version with the BFGS algorithm:

library(optimx)

# read in the data
urlSheatherData = "http://www.stat.tamu.edu/~sheather/book/docs/datasets/MichelinNY.csv"
dfSheatherData = as.data.frame(read.csv(urlSheatherData, header = T))

# create the design matrices
vY = as.matrix(dfSheatherData['InMichelin'])
mX = as.matrix(dfSheatherData[c('Service','Decor', 'Food', 'Price')])

# add an intercept to the predictor variables
mX = cbind(rep(1, nrow(mX)), mX)

# the number of variables and observations
iK = ncol(mX)
iN = nrow(mX)

# define the logistic transformation
logit = function(mX, vBeta) {
  return(exp(mX %*% vBeta)/(1+ exp(mX %*% vBeta)) )
}

# stable parametrisation of the log-likelihood function
# Note: The negative of the log-likelihood is being returned, since we will be
#       /minimising/ the function.
logLikelihoodLogitStable = function(vBeta, mX, vY) {
  return(-sum(
    vY*(mX %*% vBeta - log(1+exp(mX %*% vBeta)))
    + (1-vY)*(-log(1 + exp(mX %*% vBeta)))
  )  # sum
  )  # return
}

# score function
likelihoodScore = function(vBeta, mX, vY) {
  return(t(mX) %*% (logit(mX, vBeta) - vY) )
}

# initial set of parameters
vBeta0 = c(10, -0.1, -0.3, 0.001, 0.01)  # arbitrary starting parameters

#====================================================================
# optimize to get the MLE using the BFGS optimizer (numerical derivatives)
#====================================================================
optimLogitBFGS = optim(vBeta0, logLikelihoodLogitStable,
                    mX = mX, vY = vY, method = 'BFGS', hessian=TRUE)
optimLogitBFGS # get the results of the optimisation

and then the version with the L-BFGS-B from the optimx package:

#====================================================================
# optimize to get the MLE using the L-BFGS optimizer (analytical derivatives)
#====================================================================
optimLogitLBFGS = optimx(vBeta0, logLikelihoodLogitStable, method = 'L-BFGS-B',
                            gr = likelihoodScore, mX = mX, vY = vY, hessian=TRUE)

summary(optimLogitLBFGS)

Related Solutions

Bayesian Logistic Regression – Regularized Bayesian Logistic Regression in JAGS

Since L1 regularization is equivalent to a Laplace (double exponential) prior on the relevant coefficients, you can do it as follows. Here I have three independent variables x1, x2, and x3, and y is the binary target variable. Selection of the regularization parameter $\lambda$ is done here by putting a hyperprior on it, in this case just uniform over a good-sized range.

model {
  # Likelihood
  for (i in 1:N) {
    y[i] ~ dbern(p[i])

    logit(p[i]) <- b0 + b[1]*x1[i] + b[2]*x2[i] + b[3]*x3[i]
  }

  # Prior on constant term
  b0 ~ dnorm(0,0.1)

  # L1 regularization == a Laplace (double exponential) prior 
  for (j in 1:3) {
    b[j] ~ ddexp(0, lambda)  
  }

  lambda ~ dunif(0.001,10)
  # Alternatively, specify lambda via lambda <- 1 or some such
}

Let's try it out using the dclone package in R!

library(dclone)

x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)

prob <- exp(x1+x2+x3) / (1+exp(x1+x2+x3))
y <- rbinom(100, 1, prob)

data.list <- list(
  y = y,
  x1 = x1, x2 = x2, x3 = x3,
  N = length(y)
)

params = c("b0", "b", "lambda")

temp <- jags.fit(data.list, 
                 params=params, 
                 model="modela.jags",
                 n.chains=3, 
                 n.adapt=1000, 
                 n.update=1000, 
                 thin=10, 
                 n.iter=10000)

And here are the results, compared to an unregularized logistic regression:

> summary(temp)

<< blah, blah, blah >> 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean     SD Naive SE Time-series SE
b[1]   1.21064 0.3279 0.005987       0.005641
b[2]   0.64730 0.3192 0.005827       0.006014
b[3]   1.25340 0.3217 0.005873       0.006357
b0     0.03313 0.2497 0.004558       0.005580
lambda 1.34334 0.7851 0.014333       0.014999

2. Quantiles for each variable: << deleted to save space >>

> summary(glm(y~x1+x2+x3, family="binomial"))

  << blah, blah, blah >>

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.02784    0.25832   0.108   0.9142    
x1           1.34955    0.32845   4.109 3.98e-05 ***
x2           0.78031    0.32191   2.424   0.0154 *  
x3           1.39065    0.32863   4.232 2.32e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

<< more stuff deleted to save space >>

And we can see that the three b parameters have indeed been shrunk towards zero.

I don't know much about priors for the hyperparameter of the Laplace distribution / the regularization parameter, I'm sorry to say. I tend to use uniform distributions and look at the posterior to see if it looks reasonably well-behaved, e.g., not piled up near an endpoint and pretty much peaked in the middle w/o horrible skewness problems. So far, that's typically been the case. Treating it as a variance parameter and using the recommendation(s) by Gelman Prior distributions for variance parameters in hierarchical models works for me, too.

Solved – Goldfarb Idnani quadratic solver

If hard constraints of the optimization problem are violated in the solution there is most definitely a problem in your implementation. A solution is required to adhere to all constraints. Note that this does not mean the solution must be somewhere on the boundary of the feasible region (in contrast to linear programming).

Best Answer

Python

R

Related Solutions

Bayesian Logistic Regression – Regularized Bayesian Logistic Regression in JAGS

Solved – Goldfarb Idnani quadratic solver

Related Question