Ridge Regression – Differences Between Ridge Regression Using R’s glmnet and Python’s scikit-learn

I am going through the LAB section §6.6 on Ridge Regression/Lasso in the book 'An Introduction to Statistical Learning with Applications in R' by James, Witten, Hastie, Tibshirani (2013).

More specifically, I am trying to do apply the scikit-learn Ridge model to the 'Hitters' dataset from the R package 'ISLR'. I have created the same set of features as shown in the R code. However, I cannot get close to the results from the glmnet() model. I have selected one L2 tuning parameter to compare. ('alpha' argument in scikit-learn).

Python:

regr = Ridge(alpha=11498)
regr.fit(X, y)

http://nbviewer.ipython.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb

Note that the argument alpha=0 in glmnet() means that a L2 penalty should be applied (Ridge regression). The documentation warns not to enter a single value for lambda, but the result is the same as in ISL, where a vector is used.

ridge.mod <- glmnet(x,y,alpha=0,lambda=11498)

What causes the differences?

Edit:
When using penalized() from the penalized package in R, the coefficients are the same as with scikit-learn.

ridge.mod2 <- penalized(y,x,lambda2=11498)

Maybe the question could then also be: 'What is the difference between glmnet() and penalized() when doing Ridge regression?

New python wrapper for actual Fortran code used in R package glmnet
https://github.com/civisanalytics/python-glmnet

Best Answer

My answer is missing a factor of $\frac{1}{N}$, please see @visitors answer below for the correct comparison.

Here are two references that should clarify the relationship.

The sklearn documentation says that linear_model.Ridge optimizes the following objective function

$$ \left| X \beta - y \right|_2^2 + \alpha \left| \beta \right|_2^2 $$

The glmnet paper says that the elastic net optimizes the following objective function

$$ \left| X \beta - y \right|_2^2 + \lambda \left( \frac{1}{2} (1 - \alpha) \left| \beta \right|_2^2 + \alpha \left| \beta \right|_1 \right) $$

Notice that the two implementations use $\alpha$ in totally different ways, sklearn uses $\alpha$ for the overall level of regularization while glmnet uses $\lambda$ for that purpose, reserving $\alpha$ for trading between ridge and lasso regularization.

Comparing the formulas, it look like setting $\alpha = 0$ and $\lambda = 2 \alpha_{\text{sklearn}}$ in glmnet should recover the solution from linear_model.Ridge.

Best Answer

Related Solutions

Scikit-Learn – How to Perform Polynomial Regression Using Scikit-Learn?

Solved – Logistic Regression: Scikit Learn vs glmnet

Related Question