Ridge Regression Lambda – Why Doesn’t ?=1 in Ridge Regression?

ridge regression

Take traditional Ridge regression,

$$ Y_i = \sum_{j=0}^m \beta_{j} X_{i,j} + \epsilon_i $$

we minimize

$$ L_{ridge} = \arg \min_\hat{\beta}(\lambda||\beta||_2^2 + ||\epsilon||^2)$$

where $\lambda$ is the regularization penalty.

Suppose instead our model we wrote our model as

$$ Y_i = \sum_{j=0}^m \beta_j X_{i,j} + \sum_{j=1}^n \beta_{m+j} I_j $$

where $I_j=1$ if $j=i$ and 0 otherwise. In other words, the errors become additional parameters, one for each observation. Now we minimize

$$ L_{ridge} = \arg\min_\hat{\beta}(\lambda||\beta_{1..m}||_2^2 + ||\beta_{m+1,…,m+n}||_2^2)$$

In this case, assuming standardized coefficients, shouldn't it be clear that these "error"/ residual parameters should be treated as any other; i.e. $\lambda = 1$?, so just

$$ L_{ridge} = \arg\min_\hat{\beta}(||\beta_{1..m+n}||_2^2)$$

I see this answer here, but if the data is standardized, I don't see why these error parameters should have a different weight? (Maybe they need to be standardized too.)

(Equivalent question for Lasso with Least Absolute Deviations.)

Best Answer

$\lambda$ doesn't equal $1$ because it is defined as $\sigma^2/\tau^2$ where the $\sigma$ is the observation standard deviation and $\tau$ is the standard deviation of the prior on the coefficients $\beta_j$. Remember the point of ridge regression is to impose a penalty for the coefficients to become too large in magnitude, i.e. a prior $p({\beta})=\prod_j\mathcal N(\beta_j|0,\tau^2)$ is placed on the coefficients. In general when we place a Gaussian prior on parameters of a model to encourage them to be small this is called $\mathscr l_2$ regularization or weight decay.

$\lambda$ is greater than or equal to $0$ with larger values meaning larger precision $1/\tau^2$. Since the prior of $\beta_j$ has mean $0$, this pulls the coefficients closer to $0$ giving them smaller magnitudes.