Solved – How is `tol` used in scikit-learn’s `Lasso` and `ElasticNet`

elastic netlassoregressionscikit learn

As a followup to this question, how does scikit-learn implementation of Lasso (and coordinate_descent algorithm) uses the tol parameter in practice?

More precisely, in the documentation, we can see:

tol: float, optional

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

Additionally, when the model does not converge, we can get:

ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 8.058438499079784, tolerance: 5.712111291830755 positive)

So:

How is the duality gap defined in the case of Lasso (/ElasticNet)?
Why the displayed tolerance in the example above is 5.712111291830755 whereas it was set as 0.0001 (default value) in the model?
In pratice, what does the optimization code checks the dual gap for optimality and continues until it is smaller than tol. mean?

Best Answer

I am going to explain the case of Lasso, you can apply the same logic to ElasticNet.

How is the duality gap defined in the case of Lasso (/ElasticNet)?

The duality gap is the difference between a solution of the primal problem and a solution of the dual problem as said here. The primal problem is the following: $$ \min_{w \in \mathbb{R}^{n}} \frac{1}{2} ||Xw - y||_{2}^{2} + \alpha ||w||_{1}$$

Where $n$ is the number of features that your dataset has. $X$ is your samples, $w$ is the weights and $\alpha$ is the tradeoff between accuracy and sparsity of the weights. I don't think there is a real interest in looking at the dual formulation but if you want, you can have a look here.

However, it is more interesting to know that the dual gap is always positive. In case of strong duality, the dual gap is equal to zero. The lasso problem is convex (and has an interior point) so there is strong duality. This is why reducing the dual gap tells us that we are getting closer to an optimal solution.

Why the displayed tolerance in the example above is 5.712111291830755 whereas it was set as 0.0001 (default value) in the model?

The reason is that the algorithm has not converged yet after max_iter updates of weight coordinate. That is why they ask you to increase the number of iterations. The last update of a weight coordinate changed the value of this coordinate by 5.71, which is greater than 0.0001. In the meantime, with the weights that you have, the primal problem minus the dual problem is equal to 8.058.

In pratice, what does the optimization code checks the dual gap for optimality and continues until it is smaller than tol. mean?

In order for your algorithm to converge, there has to be an update of a weight coordinate lower than tol. Then, sklearn will check the dual gap and it will stop only if the value is lower than tol.

Related Solutions

Solved – scikit-learn’s LogisticRegression minimizing

Nevermind my question, I did the derivation once again and found out that the scikit equation is correct, and it is minimizing the negative log likelihood. Here are the steps:

Let $(X_i,y_i), i=1,\dots,m$ be pairs of (features, class), where $X_i$ is a column-vector of $N$ features. The class $y_i\in\{1,-1\}$ will be limited to these two values (instead of 0 and 1), which will be useful later.

We are trying to model the probability of a $X$ feature vector to be of class $y=1$ as: $$p(y=1|X;w,c) = g(X_i^Tw+c) = \frac{1}{1+\exp(-(X_i^Tw+c))}\,,$$ where $w,c$ are the weights and intercept of the logistic regression model.

To obtain the optimal $w,c$, we want to maximize the likelihood given the database of labeled data. The optimization problem is: $$\begin{align} \mathop{argmax}_\theta\quad& \mathcal{L}(w,c;X_1,\dots,X_m) \\ &= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} p(y=-1|X_i;w,c) \\ &\langle \text{There are only two classes, so } p(y=-1|\dots) = 1-p(y=1|\dots)\rangle\\ &= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} (1-p(y=1|X_i;w,c)) \\ &\langle \text{Definition of } p\rangle\\ &= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} (1-g(X_i^Tw +c)) \\ &\langle \text{Useful property: } 1-g(z) = g(-z) \rangle\\ &= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} g(-(X_i^Tw +c)) \\ &\langle \text{Handy trick of using +1/-1 classes: multiply by } y_i \text{ to have a common product}\rangle\\ &= \prod_{i=1}^m g(y_i (X_i^Tw +c)) \\ \end{align}$$

At this point I decided to apply the logarithm function (since its monotonically increasing) and flip the maximization problem to a minimization, by multiplying by -1:

$$\begin{align} \mathop{argmin}_\theta\quad & -\log \left(\prod_{i=1}^m g(y_i (X_i^Tw +c))\right) \\ &\langle \log (a\cdot b) = \log a + \log b \rangle\\ &= -\sum_{i=1}^{m} \log g(y_i (X_i^Tw +c)) \\ &\langle \text{definition of } g \rangle\\ &= -\sum_{i=1}^{m} \log \frac{1}{1+\exp(-y_i (X_i^Tw +c))} \\ &\langle \log (a/b) = \log a - \log b \rangle\\ &= -\sum_{i=1}^{m} \log 1 - \log (1+\exp(-y_i (X_i^Tw +c))) \\ &\langle \log 1 \text{ is a constant, so it can be ignored} \rangle\\ &= -\sum_{i=1}^{m} - \log (1+\exp(-y_i (X_i^Tw +c))) \\ &= \sum_{i=1}^{m} \log (\exp(-y_i (X_i^Tw +c))+1)\,, \end{align}$$ which is exactly the equation minimized by scikit-learn (without the L1 regularization term, and with $C=1$)

Solved – What do eps and tol do in LassoCV (scikit-learn)

Here is an example of LassoCV's affect on MSE with varying eps and tol (using the diabetes dataset), for various $\alpha$'s. Note that this is the average MSE (each CV run will have a different MSE):

It appears that eps has a significant impact for some penalty parameters, but with a large enough penalty it doesn't matter. tol doesn't seem to play a large role (at least as far as scikit has implement LassoCV).

See below for code.

import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
%matplotlib inline
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LassoCV

# load data
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

# Plot of epsilons
epss = [0.0001, 0.001, 0.01, 0.1]

plt.figure(figsize=(10,6))
color = cm.rainbow(np.linspace(0,1,len(epss)))

for i,c in zip(epss,color):
    model = LassoCV(eps=i).fit(X, y)

    ymin, ymax = 2300, 3800
    plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), color=c,
             label='eps = {}'.format(i), linewidth=2)
    plt.legend()

    plt.xlabel('-log(alpha)')
    plt.ylabel('Mean square error')
    plt.axis('tight')
    plt.ylim(ymin, ymax)


# Plot of tols
plt.figure(figsize=(10,6))
tols = [0.0001, 0.001, 0.01, 0.1, 1]

color = cm.rainbow(np.linspace(0,1,len(tols)))

for i,c in zip(tols,color):
    model = LassoCV(tol=i).fit(X, y)

    ymin, ymax = 2300, 3800
    plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), color=c,
             label='tol = {}'.format(i), linewidth=2)
    plt.legend()

    plt.xlabel('-log(alpha)')
    plt.ylabel('Mean square error')
    plt.axis('tight')
    plt.ylim(ymin, ymax)

Best Answer

Related Solutions

Solved – scikit-learn’s LogisticRegression minimizing

Solved – What do eps and tol do in LassoCV (scikit-learn)

Related Question