Solved – Regularization methods for logistic regression

logisticregressionregularization

Regularization using methods such as Ridge, Lasso, ElasticNet is quite common for linear regression. I wanted to know the following:
Are these methods applicable for logistic regression? If so, are there any differences in the way they need to be used for logistic regression? If these methods are not applicable, how does one regularize a logistic regression?

Best Answer

Yes, Regularization can be used in all linear methods, including both regression and classification. I would like to show you that there are not too much difference between regression and classification: the only difference is the loss function.

Specifically, there are three major components of linear method, Loss Function, Regularization, Algorithms. Where loss function plus regularization is the objective function in the problem in optimization form and the algorithm is the way to solve it (the objective function is convex, we will not discuss in this post).

In loss function setting, we can have different loss in both regression and classification cases. For example, Least squares and least absolute deviation loss can be used for regression. And their math representation are $L(\hat y,y)=(\hat y -y)^2$ and $L(\hat y,y)=|\hat y -y|$. (The function $L( \cdot ) $ is defined on two scalar, $y$ is ground truth value and $\hat y$ is predicted value.)

On the other hand, logistic loss and hinge loss can be used for classification. Their math representations are $L(\hat y, y)=\log (1+ \exp(-\hat y y))$ and $L(\hat y, y)= (1- \hat y y)_+$. (Here, $y$ is the ground truth label in $\{-1,1\}$ and $\hat y$ is predicted "score". The definition of $\hat y$ is a little bit unusual, please see the comment section.)

In regularization setting, you mentioned about the L1 and L2 regularization, there are also other forms, which will not be discussed in this post.

Therefore, in a high level a linear method is

$$\underset{w}{\text{minimize}}~~~ \sum_{x,y} L(w^{\top} x,y)+\lambda h(w)$$

If you replace the Loss function from regression setting to logistic loss, you get the logistic regression with regularization.

For example, in ridge regression, the optimization problem is

$$\underset{w}{\text{minimize}}~~~ \sum_{x,y} (w^{\top} x-y)^2+\lambda w^\top w$$

If you replace the loss function with logistic loss, the problem becomes

$$\underset{w}{\text{minimize}}~~~ \sum_{x,y} \log(1+\exp{(-w^{\top}x \cdot y)})+\lambda w^\top w$$

Here you have the logistic regression with L2 regularization.


This is how it looks like in a toy synthesized binary data set. The left figure is the data with the linear model (decision boundary). The right figure is the objective function contour (x and y axis represents the values for 2 parameters.). The data set was generated from two Gaussian, and we fit the logistic regression model without intercept, so there are only two parameters we can visualize in the right sub-figure.

The blue lines are the logistic regression without regularization and the black lines are logistic regression with L2 regularization. The blue and black points in right figure are optimal parameters for objective function.

In this experiment, we set a large $\lambda$, so you can see two coefficients are close to $0$. In addition, from the contour, we can observe the regularization term is dominated and the whole function is like a quadratic bowl.

enter image description here

Here is another example with L1 regularization.

enter image description here

Note that, the purpose of this experiment is trying to show how the regularization works in logistic regression, but not argue regularized model is better.


Here are some animations about L1 and L2 regularization and how it affects the logistic loss objective. In each frame, the title suggests the regularization type and $\lambda$, the plot is objective function (logistic loss + regularization) contour. We increase the regularization parameter $\lambda$ in each frame and the optimal solution will shrink to $0$ frame by frame.

enter image description here enter image description here


Some notation comments. $w$ and $x$ are column vectors,$y$ is a scalar. So the linear model $\hat y = f(x)=w^\top x$. If we want to include the intercept term, we can append $1$ as a column to the data.

In regression setting, $y$ is a real number and in classification setting $y \in \{-1,1\}$.

Note it is a little bit strange for the definition of $\hat y=w^{\top} x$ in classification setting. Since most people use $\hat y$ to represent a predicted value of $y$. In our case, $\hat y = w^{\top} x$ is a real number, but not in $\{-1,1\}$. We use this definition of $\hat y$ because we can simplify the notation on logistic loss and hinge loss.

Also note that, in some other notation system, $y \in \{0,1\}$, the form of the logistic loss function would be different.

The code can be found in my other answer here.

Is there any intuitive explanation of why logistic regression will not work for perfect separation case? And why adding regularization will fix it?