regression – Understanding High Variance of Gradients in Regression Analysis

gradientpcaregressionridge regressionvariance

I was trying to understand Ridge Regression and came across the following excerpt from Hastie et al. in The Elements of Statistical Learning (section 3.4.1, Page 67):

If we consider fitting a linear surface over this domain
(the Y -axis is sticking out of the page), the configuration of the data allow
us to determine its gradient more accurately in the long direction than
the short. Ridge regression protects against the potentially high variance
of gradients estimated in the short directions.
The implicit assumption is
that the response will tend to vary most in the directions of high variance
of the inputs.

Could someone please help me understand why there is high variance of gradients estimated in the short directions?

My understanding is that along the short directions, there is less variance in the data. So shouldn't the variance of the gradients estimated be lower along the short directions?

Best Answer

In ordinary least squares regression, the variance of the data in the regressors will end up in the denominator for the expression of the error of the parameters

$$\text{Cov} (\hat\beta) = \hat{\sigma}^2 (X^TX)^{-1} $$

If the columns in the regressor matrix $X$ are perpendicular (and they are if you use principle components) then the standard error can be expressed as

$$s.e.(\beta_i) \approx \sqrt{\frac{\sigma^2} {n \text{Var}(X_i)} }$$

So larger variance in the regressor $X_i$ means smaller variance/error in the estimate for the coefficient/slope/gradient.

See also the following image. It gives the same correlated data but with a different variance for the $x$ variable. This changes the slope. If the scale of the $x$ variable is smaller then the slope becomes larger (and also the error of the slope).

example

Related Question