Solved – Why does glmnet use coordinate descent for Ridge regression

elastic netglmnetlassoregularizationridge regression

If I understand it correctly, glmnet uses cyclical coordinate descent not only for lasso and elastics nets, but also for Ridge regression.

Why does it use this algorithm, which sometimes gives slightly inaccurate results, when there is in fact an easy closed form solution available?

Thank you very much in advance!

Best Answer

I think this is due to speed. Cyclical coordinate descent does not find the exact solution in finite time, but it is faster, not only for a grid of $\lambda$'s but also for a single $\lambda$.

Consider the task of solving ridge regression for a single $\lambda$, with a data matrix of size $n \times p$. I believe the optimal runtime for exact ridge regression is $O(n^2p)$ if $n < p$ and $O(np^2)$ if $n > p$. See Murphy, Machine Learning, section 7.5.2 for a reference.

With the cyclical coordinate descent algorithm, "a complete cycle through all $p$ variables costs $O(pN)$ operations" (p. 6, Friedman et al. 2010, https://www.jstatsoft.org/article/view/v033i01). One can then specify a number of cycles $c$ with $c \ll min(n, p)$ to get a faster big-Oh runtime for a single $\lambda$. For solving over many $\lambda$'s, the glmnet method should yield further improvement using warm starts.

Related Solutions

Solved – How is the intercept computed in GLMnet

I found that the intercept in GLMnet is computed after the new coefficients updates have converged. The intercept is computed with the means of the $y_i$'s and the mean of the $x_{ij}$'s. The formula is siimilar to the previous one I gave but with the $\beta_j$'s after the update loop : $\beta_0=\bar{y}-\sum_{j=1}^{p} \hat{\beta_j} \bar{x_j}$.

In python this gives something like :

        self.intercept_ = ymean - np.dot(Xmean, self.coef_.T)

which I found here on scikit-learn page.

EDIT : the coefficients have to be standardized before :

        self.coef_ = self.coef_ / X_std

$\beta_0=\bar{y}-\sum_{j=1}^{p} \frac{\hat{\beta_j} \bar{x_j}}{\sum_{i=1}^{n} x_{ij}^2}$.

Solved – Ridge Regression with R

You need to standardize $X$ before applying the penalty, $\lambda$, then transform the coefficients back to the scale of the original $X$. And the results will be the same with lm.ridge.

Something like:

r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda
as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X

where X is the original model matrix excluding the intercept. Xs is X standardized to have unit variance and sd_X is vector of standard deviations of variables in X.

Best Answer

Related Solutions

Solved – How is the intercept computed in GLMnet

Solved – Ridge Regression with R

Related Question