Take traditional Ridge regression,
$$ Y_i = \sum_{j=0}^m \beta_{j} X_{i,j} + \epsilon_i $$
we minimize
$$ L_{ridge} = \arg \min_\hat{\beta}(\lambda||\beta||_2^2 + ||\epsilon||^2)$$
where $\lambda$ is the regularization penalty.
Suppose instead our model we wrote our model as
$$ Y_i = \sum_{j=0}^m \beta_j X_{i,j} + \sum_{j=1}^n \beta_{m+j} I_j $$
where $I_j=1$ if $j=i$ and 0 otherwise. In other words, the errors become additional parameters, one for each observation. Now we minimize
$$ L_{ridge} = \arg\min_\hat{\beta}(\lambda||\beta_{1..m}||_2^2 + ||\beta_{m+1,…,m+n}||_2^2)$$
In this case, assuming standardized coefficients, shouldn't it be clear that these "error"/ residual parameters should be treated as any other; i.e. $\lambda = 1$?, so just
$$ L_{ridge} = \arg\min_\hat{\beta}(||\beta_{1..m+n}||_2^2)$$
I see this answer here, but if the data is standardized, I don't see why these error parameters should have a different weight? (Maybe they need to be standardized too.)
(Equivalent question for Lasso with Least Absolute Deviations.)
Best Answer
$\lambda$ doesn't equal $1$ because it is defined as $\sigma^2/\tau^2$ where the $\sigma$ is the observation standard deviation and $\tau$ is the standard deviation of the prior on the coefficients $\beta_j$. Remember the point of ridge regression is to impose a penalty for the coefficients to become too large in magnitude, i.e. a prior $p({\beta})=\prod_j\mathcal N(\beta_j|0,\tau^2)$ is placed on the coefficients. In general when we place a Gaussian prior on parameters of a model to encourage them to be small this is called $\mathscr l_2$ regularization or weight decay.
$\lambda$ is greater than or equal to $0$ with larger values meaning larger precision $1/\tau^2$. Since the prior of $\beta_j$ has mean $0$, this pulls the coefficients closer to $0$ giving them smaller magnitudes.