Is Ridge more robust than Lasso on feature selection

feature selectionlassolinear modelregularizationridge regression

My goal is to identify the best n-feature linear model, i.e. pick the model with only n-feature from total N features (n < N) and lowest Mean-Squared-Error (MSE). The experiment is on the Lasso and Ridge regression from sklearn.liner_model. I thought Lasso will do a better job as it tends to eliminate redundant variables, but it turns out to be the opposite in the following experiment.

Here is what I did:

Generated some data where the dependent variable is a linear combination of a bunch of other variables, i.e. $Y = 0.1X_5 + 0.2X_6 + 0.3X_7 + 0.4X_8 + 0.5X_9$, and $X_1$ to $X_4$ are correlated to $Y$.

df = pd.DataFrame(np.random.randint(0, 100, size = [100, 10]))
df.columns = ['X' + str(i) for i in range(1, df.shape[1])] + ['Y'] 
df['Y'] = (df[['X' + str(i) for i in [5, 6, 7, 8, 9]]] * [0.1, 0.2, 0.3, 0.4, 0.5]).sum(axis = 1)
df['X1'] = 0.5 * df.Y + np.random.randn(df.shape[0])
df['X2'] = 0.6 * df.Y + np.random.randn(df.shape[0])
df['X3'] = 0.7 * df.Y + np.random.randn(df.shape[0])
df['X4'] = 0.8 * df.Y + np.random.randn(df.shape[0])
X, y = df.drop(columns='Y'), df.Y

Use Lasso and Ridge model to select best 5 features based on importance weights. For Lasso and Ridge, I believe SelectFromModel function uses the absolute values of coefficients as importance weights.

# Lasso model selection
las_model = SelectFromModel(estimator = Lasso(), max_features = 5).fit(X, y)
print("Lasso Model")
print(pd.DataFrame({'feature':df.columns[:-1],
                    'coef'   :abs(las_model.estimator_.coef_)}).sort_values('coef',ascending=False)[:5].sort_index())
print("R-squared = " + str(sm.OLS(df.Y, sm.add_constant(las_model.transform(X))).fit().rsquared.round(2)))

Lasso missed X5 but got X4 and the coefficients are off by a big margin:

Lasso Model
  feature      coef
3      X4  0.327149
5      X6  0.129456
6      X7  0.197254
7      X8  0.262760
8      X9  0.328964
R-squared = 1.0

# Ridge model selection
rid_model = SelectFromModel(estimator = Ridge(), max_features = 5).fit(X, y)
print("Ridge Model")
print(pd.DataFrame({'feature':df.columns[:-1],
                    'coef'   :abs(rid_model.estimator_.coef_)}).sort_values('coef',ascending=False)[:5].sort_index())
print("R-squared = " + str(sm.OLS(df.Y, sm.add_constant(rid_model.transform(X))).fit().rsquared.round(2)))

Ridge model shows decent results:

Ridge Model
  feature      coef
4      X5  0.098881
5      X6  0.197748
6      X7  0.296673
7      X8  0.395516
8      X9  0.494395
R-squared = 0.98

Note that the defualt alpha for both Lasso and Ridge in Sklearn are 1. Lowering alpha for Lasso doesn't seem to help. In an extreme case, when setting alpha = 0, the result is still not correct.

Lasso Model
  feature      coef
3      X4  0.199773
5      X6  0.160934
6      X7  0.243271
7      X8  0.323027
8      X9  0.404394

On the other hand, Ridge model results are surprisingly robust across a range of alphas, from 0 to 10, as well as data of different scales.

My questions are:

why Ridge regression is so robust in selecting the best n-feature model while Lasso isn't?
Is Ridge regression always more robust in this kind task? If not, under what condition Lasso will outperform?

Best Answer

Greedy selection

I am not sure about the details for the python functions but there is probably a difference in the algorithm where

Lasso has a greedy selection method and builds up the model from zero (like Least Angle Regression) and selects the $X_4$ variable first because it has the highest correlation with $Y$ and explains the variable initially the best.
Ridge regression will fit all the variables and then select the best according to some criterium.

I believe that SelectFromModel picks out the variables with the highest coefficients. In such a case the Ridge regression is not sensitive to the $X_4$ which will have a smaller coefficient (this happens especially since your model has no noise).

If you would select less than 4 features then Lasso may arguably do better. E.g. if you select only 1 feature then Lasso selects the variable $X_4$, which is noisy but Ridge selects $X_9$ which only covers a third of the variance in $Y$.

alpha level and bad convergence

Note that the defualt alpha for both Lasso and Ridge in Sklearn are 1. Lowering alpha for Lasso doesn't seem to help. In an extreme case, when setting alpha = 0, the result is still not correct.

The result is still not correct because the model does not converge well. If you increase the number of iterations, then the selection with lasso gives the same result as the selection with ridge.

^{And apparently SelectFromModel does the same way of feature selection with lasso: using the coefficients values and not the coefficients importance. When we check it out we see that, indeed, the Lasso function has no feature_importances_ attribute, and the problem is not the greedy selection but the standard alpha=1 being a bad choice.}

# Lasso model selection
las_model = SelectFromModel(estimator = Lasso(alpha = 0.0, max_iter=10000), max_features = 5).fit(X, y)

So a part of this problem is that the SelectFromModel from the model method uses a standard alpha = 1 which is arbitrary and may cause too much bias. The bias is that high values of the coefficients are penalized and that it becomes beneficial to use $X_4$ instead of the variables $X_5, X_6, X_7, X_8, X_9$.

We can make this bias appear with Ridge as well with alpha = 100 you get something like.

Ridge Model
  feature      coef
2      X3  0.147403
3      X4  0.144151
6      X7  0.200929
7      X8  0.268979
8      X9  0.335139

The 0.14 values for $X_3$ and $X_4$ (and also for $X_1$ and $X_2$ which are fitted but not selected) make that the L2 norm of the coefficient vector is smaller.

Comparing cross validation methods

On the other hand, Ridge model results are surprisingly robust across a range of alphas, from 0 to 10, as well as data of different scales.

If you do not have some objective method to choose an alpha then the choice of the range of alpha is arbitrary. The penalty terms for ridge regression and lasso regression are different so you can not really compare the alpha values directly. What should matter is how these models perform when you have some algorithm that optimizes the penalty term, e.g. when you use cross-validation to select the value of alpha.

You can use the lassoCV en ridgeCV to let the models choose the parameters. Now we see that lasso performs less bad. You will get something like

Lasso Model
  feature      coef
0      X1  0.000000
1      X2  0.000000
2      X3  0.054414
3      X4  0.125836
4      X5  0.085088
5      X6  0.171051
6      X7  0.258965
7      X8  0.342983
8      X9  0.429946
Ridge Model
  feature      coef
0      X1  0.000275
1      X2  0.000284
2      X3  0.000373
3      X4  0.000504
4      X5  0.099901
5      X6  0.199803
6      X7  0.299712
7      X8  0.399611
8      X9  0.499513

The ridge regression is still doing better because it will add the $X_1,X_2,X_3,X_4$ variables differently in comparison to the lasso regression. With ridge all the variables increase together and with lasso it is only a few that get increased. This has an additional regularizing effect. The four features in ridge are less sensitive to correlate with noise than the two features in lasso.

Example based on comment by Sycorax.

Even if you would normalize the features, then you can still get the problem. An example is when the observation $Y$ depends on some contrast

$Y = X_1 - X_2$

and this contrast is equal to some other variable but with a tiny bit of noise

$X_3 = X_1 - X_2 + \epsilon$

So this $X_3$ is not accurate and the exact model, but... it will be able to capture the model with a lower penalty.

import matplotlib.pyplot as plt
import pandas as pd
import statsmodels as sm
import numpy as np
import scipy as sp
import sklearn 
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV

### dataframe
df = pd.DataFrame(np.random.randint(0, 100, size = [100, 4]))
df.columns = ['X1', 'X2', 'X3', 'Y'] 


### model with no noise
df['Y'] = df['X1'] - df['X2']   
### a single variable that captures the model, but with noise
df['X3'] = df['Y'] + 0.1 * np.random.randn(df.shape[0])  
### normalize columns
df=(df-df.mean())/df.std()

X, y = df.drop(columns='Y'), df.Y

# Lasso model selection
las_model = SelectFromModel(estimator = LassoCV(max_iter=10000)).fit(X, y)
print("Lasso Model")
print(pd.DataFrame({'feature':df.columns[:-1],
                    'coef'   :abs(las_model.estimator_.coef_)}).sort_values('coef',ascending=False)[:9].sort_index())

# Ridge model selection
rid_model = SelectFromModel(estimator = RidgeCV()).fit(X, y)
print("Ridge Model")
print(pd.DataFrame({'feature':df.columns[:-1],
                    'coef'   :abs(rid_model.estimator_.coef_)}).sort_values('coef',ascending=False)[:5].sort_index())

Output:

Lasso Model
  feature      coef
0      X1  0.000000
1      X2  0.000000
2      X3  0.998997
Ridge Model
  feature      coef
0      X1  0.356942
1      X2  0.349096
2      X3  0.545204

Related Solutions

Solved – Problem with BootCV for coxph in pec after feature selection with glmnet (lasso)

pec accepts only R-objects for which a predictSurvProb method exists and glmnet is not such an object.

Currently, predictSurvProb methods are available for the following R-objects:
matrix
aalen, cox.aalen from library(timereg)
mfp from library(mfp)
phnnet, survnnet from library(survnnet)
rpart (from library(rpart))
coxph, survfit from library(survival)
cph, psm from library(rms)
prodlim from library(prodlim)
glm from library(stats)

For calculating the brier score for glmnet one needs to use the peperr package with the c060 library that wraps glmnet as an object suitable for peperr.

peperr_glmnet_noerror <- peperr(response=Surv(time, status), x=x, 
                    fit.fun=fit.glmnet, args.fit=list(family="cox"),
                    complexity=complexity.glmnet,args.complexity=list(family="cox"),
                    indices=resample.indices(n=length(time), method="boot", sample.n=10))

To get the integrated brier score for the entire model it seems one needs to use the ipec function but I still need to research that.

Many thanks to Thomas Hielscher who authored the c060 package and was extremely kind to help me with this.

Solved – Ridge & LASSO norms

There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".

Properties of interest might be:

1) nearly unbiased estimators (note all penalized estimators will be biased)

2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)

3) Continuity (to avoid instability in model prediction)

These are just a few properties one might be interested in a penalty function.

It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.