I am going through the LAB section ยง6.6 on Ridge Regression/Lasso in the book 'An Introduction to Statistical Learning with Applications in R' by James, Witten, Hastie, Tibshirani (2013).
More specifically, I am trying to do apply the scikit-learn Ridge
model to the 'Hitters' dataset from the R package 'ISLR'. I have created the same set of features as shown in the R code. However, I cannot get close to the results from the glmnet()
model. I have selected one L2 tuning parameter to compare. ('alpha' argument in scikit-learn).
Python:
regr = Ridge(alpha=11498)
regr.fit(X, y)
http://nbviewer.ipython.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb
R:
Note that the argument alpha=0
in glmnet()
means that a L2 penalty should be applied (Ridge regression). The documentation warns not to enter a single value for lambda
, but the result is the same as in ISL, where a vector is used.
ridge.mod <- glmnet(x,y,alpha=0,lambda=11498)
What causes the differences?
Edit:
When using penalized()
from the penalized package in R, the coefficients are the same as with scikit-learn.
ridge.mod2 <- penalized(y,x,lambda2=11498)
Maybe the question could then also be: 'What is the difference between glmnet()
and penalized()
when doing Ridge regression?
New python wrapper for actual Fortran code used in R package glmnet
https://github.com/civisanalytics/python-glmnet
Best Answer
My answer is missing a factor of $\frac{1}{N}$, please see @visitors answer below for the correct comparison.
Here are two references that should clarify the relationship.
The sklearn documentation says that
linear_model.Ridge
optimizes the following objective function$$ \left| X \beta - y \right|_2^2 + \alpha \left| \beta \right|_2^2 $$
The glmnet paper says that the elastic net optimizes the following objective function
$$ \left| X \beta - y \right|_2^2 + \lambda \left( \frac{1}{2} (1 - \alpha) \left| \beta \right|_2^2 + \alpha \left| \beta \right|_1 \right) $$
Notice that the two implementations use $\alpha$ in totally different ways, sklearn uses $\alpha$ for the overall level of regularization while glmnet uses $\lambda$ for that purpose, reserving $\alpha$ for trading between ridge and lasso regularization.
Comparing the formulas, it look like setting $\alpha = 0$ and $\lambda = 2 \alpha_{\text{sklearn}}$ in glmnet should recover the solution from
linear_model.Ridge
.