Machine Learning – Does L2-Normalization of Ridge Regression Punish Intercept? If Not, How to Solve Its Derivative?

machine learningridge regression

I am new to ML. I was informed that the L2-normalization of ridge regression does not punish the intercept $\theta_{0}$. As in the cost function:
$$
\nabla_{\theta}J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_{\vec \theta}(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^{n}{\theta_{j}^{2}}
$$
The L2-normalization term $\lambda\sum_{j=1}^{n}{\theta_{j}^{2}}$ only sums from $j=1$ to $n$, not from $j=0$ to $n$. I also read that:

in most cases (all cases?), you're better off not regularizing $\theta_{0}$,
since its unlikely to reduce overfitting and shrinks the space of
representable functions

which comes from the last answer of user48956 of Why is a zero-intercept linear regression model predicts better than a model with an intercept?

I am confused about how to solve the derivative of the cost function, since:
$$
\nabla_{\theta}J(\theta)=\frac{1}{2}(X\theta-Y)^{T}(X\theta-Y)+\lambda(\theta^{'})^{T}\theta^{'},
$$
where $\theta^{'}=\left[
\begin{matrix}
\theta_{1} \\
\theta_{2} \\
…\\
\theta_{n}
\end{matrix}
\right]$ , $\theta=\left[
\begin{matrix}
\theta_{0} \\
\theta_{1} \\
…\\
\theta_{n}
\end{matrix}
\right]$ and $X=\left[
\begin{matrix}
1 & X_{1}^{(1)} & X_{2}^{(1)} & …& X_{n}^{(1)} \\
1 & X_{1}^{(2)} & X_{2}^{(2)} & …& X_{n}^{(2)} \\
…\\
1 & X_{1}^{(m)} & X_{2}^{(m)} & …& X_{n}^{(m)}
\end{matrix}
\right]$.

$\theta^{'}$ and $\theta$ are different. Hence they cannot be mixed from my point of view. And the derivative is about $\theta$,which contains $\theta^{'}$. After googling and viewing the questions on this forum, there is still no way for me to get the solution:
$$
\theta=(X^TX+\lambda*I)^{-1}X^TY
$$
Can anybody give me a clue? Thanks in advance for your help!

However, I think there are two quick fixes to this problem:

First of all, we do not add the all 1 column to $X$. Namely $X=\left[
\begin{matrix}
X_{1}^{(1)} & X_{2}^{(1)} & …& X_{n}^{(1)} \\
X_{1}^{(2)} & X_{2}^{(2)} & …& X_{n}^{(2)} \\
…\\
X_{1}^{(m)} & X_{2}^{(m)} & …& X_{n}^{(m)}
\end{matrix}
\right]$. That is to say we do not include the intercept at all in the model:$$ y=\theta_{1}X_{1}+\theta_{2}X_{2}+…+\theta_{n}X_{n}.$$
I believe this method is adopted in the classic book Machine Learning in Action by Peter Harrington which I am currently reading. In its implementation of ridge regression (P166 and P177 if you also have the book), all the $X$ passed to ridge regression does not have the all 1 column. So no intercept is fitted at all.

Secondly, the intercept is also being punished in reality.

scikit's logistic regression regularizes the intercept by default.

which once again comes from the last answer of user48956 of Why is a zero-intercept linear regression model predicts better than a model with an intercept?

Both of the two quick fixes lead to the solution
$$
\theta=(X^TX+\lambda*I)^{-1}X^TY.
$$

So can the derivative of L2-normalization of ridge regression actually being solved or are just solved by quick fixes?

Best Answer

The Elements of Statistical Learning by Hastie et al. points out in P63 that:

the intercept $\theta_{0}$ has been left out of the penalty term

Furthermore, it says:

The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the inputs before solving (3.41) (3.41 is the cost function). It can be shown (Exercise 3.5) that the solution to (3.41) can be separated into two parts, after reparametrization using centered inputs: each $X_{j}^{(i)}$ gets replaced by $X_{j}^{(i)}-\overline{x_{j}}.$ We estimate $\theta_{0}$ by $\overline{y}=\frac{1}{m}\sum_{i=1}^{m}y^{(i)}$ The remaining coefficients get estimated by a ridge regression without intercept, using the centered $X_{j}^{(i)}$. Henceforth we assume that this centering has been done, so that the input matrix $X$ has $n$ (rather than $n + 1$) columns.

Although I wonder why The Elements of Statistical Learning first suggests feature standardization and then only feature centering is conducted. Maybe to agree with Exercise 3.5 which only uses feature centering.

Anyway, I believe it's right to apply z-score standardization to features. So I now try to solve the derivative of the cost function of ridge regression following the suggestion of the above commenter amoeba. Thanks him or her a lot!

First, the cost function: $$ \nabla_{ \theta}J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(y_{i}-\theta_{0}-\frac{X_{1}^{(i)}-\overline{X_1}}{\sigma_1}\theta_1-\frac{X_{2}^{(i)}-\overline{X_2}}{\sigma_2}\theta_2-...-\frac{X_{n}^{(i)}-\overline{X_n}}{\sigma_n}\theta_n)^2+\lambda\sum_{j=1}^{n}{\theta_{j}^{2}}, $$ where $\overline{X_j}$ is the mean of attribute $X_{j}$ and $\sigma_j$ is the standard deviation of $X_{j}$. To make it shorter: $$ \nabla_{ \theta}J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(y_{i}-\theta_{0}-\sum_{j=1}^{n}\frac{X_j^{(i)}-\overline{X_j}}{\sigma_{j}}\theta_j)^2+\lambda\sum_{j=1}^{n}{\theta_{j}^{2}} $$ Now we first compute the value of $\theta_0$ in the above expression by setting the derivative with respect to $\theta_0$ equal to zero. Since $\lambda\sum_{j=1}^{n}{\theta_{j}^{2}}$ does not have $\theta_{0}$, we get: $$ \nabla_{ \theta_0}J(\theta)=-\sum_{i=1}^{m}(y_{i}-\theta_{0}-\sum_{j=1}^{n}\frac{X_j^{(i)}-\overline{X_j}}{\sigma_{j}}\theta_j)=0 $$ That is: $$ \sum_{i=1}^{m}(y_{i}-\theta_{0})-\sum_{i=1}^{m}\sum_{j=1}^{n}\frac{X_j^{(i)}-\overline{X_j}}{\sigma_{j}}\theta_j=0 $$ As $$\sum_{i=1}^{m}\sum_{j=1}^{n}\frac{X_j^{(i)}-\overline{X_j}}{\sigma_{j}}\theta_j=0$$ (because $\overline{X_j}$ is the mean of attribute $X_{j}$ ), so now we have $$\sum_{i=1}^{m}(y_{i}-\theta_{0})=0,$$ obviously: $$\theta_0=\overline{y}=\frac{1}{m}\sum_{i=1}^{m}y^{(i)}$$

So the intercept of feature-standardized ridge regression is always $\overline{y}$. Hence if we first centralize $Y$ by subtracting its mean (get $(y_i)^{'}$ for data example $i$), not include all 1 column in $X$, and then do feature standardization on $X$ (get $(X_j^{(i)})^{'}$ for $X_{j}$ of data example $i$), the cost function will simply be $$ \nabla_{ \theta}J(\theta)=\frac{1}{2}\sum_{i=1}^{m}((y_{i})^{'}-\sum_{j=1}^{n}(X_j^{(i)})^{'}\theta_j)^2+\lambda\sum_{j=1}^{n}{\theta_{j}^{2}} $$ That is $$ \nabla_{\theta}J(\theta)=\frac{1}{2}(X^{'}\theta-Y^{'})^{T}(X^{'}\theta-Y^{'})+\lambda(\theta)^{T}\theta, $$ where $\theta=\left[ \begin{matrix} \theta_1 \\ \theta_2 \\ ...\\ \theta_n \end{matrix} \right]$, $X^{'}$ does not have all 1 column and standardized of $X$, $Y^{'}$ is centered with respect to $Y$. Now $\theta$ (without $\theta_0$) can be solved with: $$ \theta=((X^{'})^TX^{'}+\lambda*I)^{-1}(X^{'})^TY^{'} $$ For standardized features, the linear model will be $$ y=\overline{y}+\theta{_1}X_1^{'}+\theta{_2}X_2^{'}+...+\theta{_n}X_n^{'}---(1), $$ where $$X_i^{'}=\frac{X_{i}-\overline{X_i}}{\sigma_i}---(2)$$ If we use (2) in (1) as suggested in the answer of Plasty Grove. So for origin input data, the linear model will be
$$ y=\overline{y}+\frac{X_{1}-\overline{X_1}}{\sigma_1}\theta_1+\frac{X_{2}-\overline{X_2}}{\sigma_2}\theta_2+...+\frac{X_{n}-\overline{X_n}}{\sigma_n}\theta_n $$ That is $$ y=\frac{\theta_1}{\sigma_1}X_1+\frac{\theta_2}{\sigma_2}X_2+...+\frac{\theta_n}{\sigma_n}X_n+\overline{y}-\frac{\overline{X_1}}{\sigma_1}\theta_1-\frac{\overline{X_2}}{\sigma_2}\theta_2-...-\frac{\overline{X_n}}{\sigma_n}\theta_n $$ That's why after we solve coefficients of standardized features, to return coefficients of origin input data (unstandardized features), we must return $\theta_i/\sigma_i$

Related Question