Solved – Need for centering and standardizing data in regression

lassoregressionregularizationstandardization

Consider linear regression with some regularization:
E.g. Find $x$ that minimizes $||Ax – b||^2+\lambda||x||_1$

Usually, columns of A are standardized to have zero mean and unit norm, while $b$ is centered to have zero mean. I want to make sure if my understanding of the reason for standardizing and centering is correct.

By making the means of columns of $A$ and $b$ zero, we don't need an intercept term anymore. Otherwise, the objective would have been $||Ax-x_01-b||^2+\lambda||x||_1$. By making the norms of columns of A equal to 1, we remove the possibility of a case where just because one column of A has very high norm, it gets a low coefficient in $x$, which might lead us to conclude incorrectly that that column of A doesn't "explain" $x$ well.

This reasoning is not exactly rigorous but intuitively, is that the right way to think?

Best Answer

You are correct about zeroing the means of the columns of $A$ and $b$.

However, as for adjusting the norms of the columns of $A$, consider what would happen if you started out with a normed $A$, and all the elements of $x$ were of roughly the same magnitude. Then let us multiply one column by, say, $10^{-6}$. The corresponding element of $x$ would, in an unregularized regression, be increased by a factor of $10^6$. See what would happen to the regularization term? The regularization would, for all practical purposes, apply only to that one coefficient.

By norming the columns of $A$, we, writing intuitively, put them all on the same scale. Consequently, differences in the magnitudes of the elements of $x$ are directly related to the "wiggliness" of the explanatory function ($Ax$), which is, loosely speaking, what the regularization tries to control. Without it, a coefficient value of, e.g., 0.1 vs. another of 10.0 would tell you, in the absence of knowledge about $A$, nothing about which coefficient was contributing the most to the "wiggliness" of $Ax$. (For a linear function, like $Ax$, "wiggliness" is related to deviation from 0.)

To return to your explanation, if one column of $A$ has a very high norm, and for some reason gets a low coefficient in $x$, we would not conclude that the column of $A$ doesn't "explain" $x$ well. $A$ doesn't "explain" $x$ at all.

Related Solutions

Linear Model – Effect of Standardization on Y-Intercept

Your model is:

$$y_j = \text{X}_{j} b + \epsilon_j = \sum_{i=0}^px_{ij}b_i + \epsilon_j$$

Let $b_0$ be the intercept, so every $x_{0j} = 1$

$$y_j = b_0 + \sum_{i=1}^px_{ij}b_i + \epsilon_j$$

The average then becomes:

$$\mathbf E[y]=\hat y = \frac{1}{n}\sum_{j=1}^n y_j = \frac{1}{n}\sum_{j=1}^n \left( b_0 + \sum_{i=1}^px_{ij}b_i + \epsilon_j\right)=\\ =\frac{1}{n} n \cdot b_0 + \frac{1}{n}\sum_{j=1}^n \left(\sum_{i=1}^px_{ij}b_i\right) + \frac{1}{n}\sum_{j=1}^n\epsilon_j=\\ =b_0 + \sum_{i=1}^p \left(\frac{1}{n}\sum_{j=1}^nx_{ij}\right)b_i $$

As we made the average of each column of $\mathbf X$ equal to $0$, we get:

$$\hat y = b_0$$

If you standardize $y$ as well, then:

$$\hat y = b_0 = 0$$

QED.

See it only depends on the centering, not on the scale of the variables.

This can easily be shown in R. Compare the three fits, and specially fit2 with mean(y)

x = iris$Petal.Width
y = iris$Petal.Length
fit1 = lm(y ~ x)
fit2 = lm(y ~ I(scale(x)))
mean(y)
fit3 = lm(I(scale(y)) ~ I(scale(x)))

Solved – Non negative least squares with minimal colinearity

I think minimizing $\| Ax -y \|^2 + \lambda x ^\top A^\top Ax$ does make sense. think of it as just a particular non-diagonal covariance gaussian prior. then you can vary $\lambda$ (and cross validate) to achieve different error/feature support tradeoffs.

Best Answer

Related Solutions

Linear Model – Effect of Standardization on Y-Intercept

Solved – Non negative least squares with minimal colinearity

Related Question