Solved – Need for centering and standardizing data in regression

lassoregressionregularizationstandardization

Consider linear regression with some regularization:
E.g. Find $x$ that minimizes $||Ax – b||^2+\lambda||x||_1$

Usually, columns of A are standardized to have zero mean and unit norm, while $b$ is centered to have zero mean. I want to make sure if my understanding of the reason for standardizing and centering is correct.

By making the means of columns of $A$ and $b$ zero, we don't need an intercept term anymore. Otherwise, the objective would have been $||Ax-x_01-b||^2+\lambda||x||_1$. By making the norms of columns of A equal to 1, we remove the possibility of a case where just because one column of A has very high norm, it gets a low coefficient in $x$, which might lead us to conclude incorrectly that that column of A doesn't "explain" $x$ well.

This reasoning is not exactly rigorous but intuitively, is that the right way to think?

Best Answer

You are correct about zeroing the means of the columns of $A$ and $b$.

However, as for adjusting the norms of the columns of $A$, consider what would happen if you started out with a normed $A$, and all the elements of $x$ were of roughly the same magnitude. Then let us multiply one column by, say, $10^{-6}$. The corresponding element of $x$ would, in an unregularized regression, be increased by a factor of $10^6$. See what would happen to the regularization term? The regularization would, for all practical purposes, apply only to that one coefficient.

By norming the columns of $A$, we, writing intuitively, put them all on the same scale. Consequently, differences in the magnitudes of the elements of $x$ are directly related to the "wiggliness" of the explanatory function ($Ax$), which is, loosely speaking, what the regularization tries to control. Without it, a coefficient value of, e.g., 0.1 vs. another of 10.0 would tell you, in the absence of knowledge about $A$, nothing about which coefficient was contributing the most to the "wiggliness" of $Ax$. (For a linear function, like $Ax$, "wiggliness" is related to deviation from 0.)

To return to your explanation, if one column of $A$ has a very high norm, and for some reason gets a low coefficient in $x$, we would not conclude that the column of $A$ doesn't "explain" $x$ well. $A$ doesn't "explain" $x$ at all.