Since L1 regularization is equivalent to a Laplace (double exponential) prior on the relevant coefficients, you can do it as follows. Here I have three independent variables x1, x2, and x3, and y is the binary target variable. Selection of the regularization parameter $\lambda$ is done here by putting a hyperprior on it, in this case just uniform over a good-sized range.
model {
# Likelihood
for (i in 1:N) {
y[i] ~ dbern(p[i])
logit(p[i]) <- b0 + b[1]*x1[i] + b[2]*x2[i] + b[3]*x3[i]
}
# Prior on constant term
b0 ~ dnorm(0,0.1)
# L1 regularization == a Laplace (double exponential) prior
for (j in 1:3) {
b[j] ~ ddexp(0, lambda)
}
lambda ~ dunif(0.001,10)
# Alternatively, specify lambda via lambda <- 1 or some such
}
Let's try it out using the dclone
package in R!
library(dclone)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
prob <- exp(x1+x2+x3) / (1+exp(x1+x2+x3))
y <- rbinom(100, 1, prob)
data.list <- list(
y = y,
x1 = x1, x2 = x2, x3 = x3,
N = length(y)
)
params = c("b0", "b", "lambda")
temp <- jags.fit(data.list,
params=params,
model="modela.jags",
n.chains=3,
n.adapt=1000,
n.update=1000,
thin=10,
n.iter=10000)
And here are the results, compared to an unregularized logistic regression:
> summary(temp)
<< blah, blah, blah >>
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
b[1] 1.21064 0.3279 0.005987 0.005641
b[2] 0.64730 0.3192 0.005827 0.006014
b[3] 1.25340 0.3217 0.005873 0.006357
b0 0.03313 0.2497 0.004558 0.005580
lambda 1.34334 0.7851 0.014333 0.014999
2. Quantiles for each variable: << deleted to save space >>
> summary(glm(y~x1+x2+x3, family="binomial"))
<< blah, blah, blah >>
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02784 0.25832 0.108 0.9142
x1 1.34955 0.32845 4.109 3.98e-05 ***
x2 0.78031 0.32191 2.424 0.0154 *
x3 1.39065 0.32863 4.232 2.32e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
<< more stuff deleted to save space >>
And we can see that the three b
parameters have indeed been shrunk towards zero.
I don't know much about priors for the hyperparameter of the Laplace distribution / the regularization parameter, I'm sorry to say. I tend to use uniform distributions and look at the posterior to see if it looks reasonably well-behaved, e.g., not piled up near an endpoint and pretty much peaked in the middle w/o horrible skewness problems. So far, that's typically been the case. Treating it as a variance parameter and using the recommendation(s) by Gelman Prior distributions for variance parameters in hierarchical models works for me, too.
Best Answer
Here is an example of logistic regression estimation using the limited memory BFGS [L-BFGS] optimization algorithm. I will be using the
optimx
function from theoptimx
library in R, and SciPy'sscipy.optimize.fmin_l_bfgs_b
in Python.Python
The example that I am using is from Sheather (2009, pg. 264). The following Python code shows estimation of the logistic regression using the BFGS algorithm:
And this can easily be adapted to the
scipy.optimize.fmin_l_bfgs_b
function:R
Using the L-BFGS-B optimizer in R is just as simple. First the version with the BFGS algorithm:
and then the version with the L-BFGS-B from the
optimx
package: