Solved – Least angle regression keeps the correlations monotonically decreasing and tied

correlationmachine learningregressionself-study

I'm trying to solve a problem for least angle regression (LAR). This is a problem 3.23 on page 97 of Hastie et al., Elements of Statistical Learning, 2nd. ed. (5th printing).

Consider a regression problem with all variables and response having mean zero and standard deviation one. Suppose also that each variable has identical absolute correlation with the response:

$
\frac{1}{N} | \left \langle \bf{x}_j, \bf{y} \right \rangle | = \lambda, j = 1, …, p
$

Let $\hat{\beta}$ be the least squares coefficient of $\mathbf{y}$ on $\mathbf{X}$ and let $\mathbf{u}(\alpha)=\alpha \bf{X} \hat{\beta}$ for $\alpha\in[0,1]$.

I am asked to show that
$$
\frac{1}{N} | \left \langle \bf{x}_j, \bf{y}-u(\alpha) \right \rangle | = (1 – \alpha) \lambda, j = 1, …, p
$$
and I am having problems with that. Note that this can basically says that the correlations of each $x_j$ with the residuals remain equal
in magnitude as we progress toward $u$.

I also do not know how to show that the correlations are equal to:

$\lambda(\alpha) = \frac{(1-\alpha)}{\sqrt{(1-\alpha)^2 + \frac{\alpha (2-\alpha)}{N} \cdot RSS}} \cdot \lambda$

Any pointers would be greatly appreciated!

Best Answer

This is problem 3.23 on page 97 of Hastie et al., Elements of Statistical Learning, 2nd. ed. (5th printing).

The key to this problem is a good understanding of ordinary least squares (i.e., linear regression), particularly the orthogonality of the fitted values and the residuals.

Orthogonality lemma: Let $X$ be the $n \times p$ design matrix, $y$ the response vector and $\beta$ the (true) parameters. Assuming $X$ is full-rank (which we will throughout), the OLS estimates of $\beta$ are $\hat{\beta} = (X^T X)^{-1} X^T y$. The fitted values are $\hat{y} = X (X^T X)^{-1} X^T y$. Then $\langle \hat{y}, y-\hat{y} \rangle = \hat{y}^T (y - \hat{y}) = 0$. That is, the fitted values are orthogonal to the residuals. This follows since $X^T (y - \hat{y}) = X^T y - X^T X (X^T X)^{-1} X^T y = X^T y - X^T y = 0$.

Now, let $x_j$ be a column vector such that $x_j$ is the $j$th column of $X$. The assumed conditions are:

  • $\frac{1}{N} \langle x_j, x_j \rangle = 1$ for each $j$, $\frac{1}{N} \langle y, y \rangle = 1$,
  • $\frac{1}{N} \langle x_j, 1_p \rangle = \frac{1}{N} \langle y, 1_p \rangle = 0$ where $1_p$ denotes a vector of ones of length $p$, and
  • $\frac{1}{N} | \langle x_j, y \rangle | = \lambda$ for all $j$.

Note that in particular, the last statement of the orthogonality lemma is identical to $\langle x_j, y - \hat{y} \rangle = 0$ for all $j$.


The correlations are tied

Now, $u(\alpha) = \alpha X \hat{\beta} = \alpha \hat{y}$. So, $$ \langle x_j, y - u(a) \rangle = \langle x_j, (1-\alpha) y + \alpha y - \alpha \hat{y} \rangle = (1-\alpha) \langle x_j, y \rangle + \alpha \langle x_j, y - \hat{y} \rangle , $$ and the second term on the right-hand side is zero by the orthogonality lemma, so $$ \frac{1}{N} | \langle x_j, y - u(\alpha) \rangle | = (1-\alpha) \lambda , $$ as desired. The absolute value of the correlations are just $$ \hat{\rho}_j(\alpha) = \frac{\frac{1}{N} | \langle x_j, y - u(\alpha) \rangle |}{\sqrt{\frac{1}{N} \langle x_j, x_j \rangle }\sqrt{\frac{1}{N} \langle y - u(\alpha), y - u(\alpha) \rangle }} = \frac{(1-\alpha)\lambda}{\sqrt{\frac{1}{N} \langle y - u(\alpha), y - u(\alpha) \rangle }} $$

Note: The right-hand side above is independent of $j$ and the numerator is just the same as the covariance since we've assumed that all the $x_j$'s and $y$ are centered (so, in particular, no subtraction of the mean is necessary).

What's the point? As $\alpha$ increases the response vector is modified so that it inches its way toward that of the (restricted!) least-squares solution obtained from incorporating only the first $p$ parameters in the model. This simultaneously modifies the estimated parameters since they are simple inner products of the predictors with the (modified) response vector. The modification takes a special form though. It keeps the (magnitude of) the correlations between the predictors and the modified response the same throughout the process (even though the value of the correlation is changing). Think about what this is doing geometrically and you'll understand the name of the procedure!


Explicit form of the (absolute) correlation

Let's focus on the term in the denominator, since the numerator is already in the required form. We have $$ \langle y - u(\alpha), y - u(\alpha) \rangle = \langle (1-\alpha) y + \alpha y - u(\alpha), (1-\alpha) y + \alpha y - u(\alpha) \rangle . $$

Substituting in $u(\alpha) = \alpha \hat{y}$ and using the linearity of the inner product, we get

$$ \langle y - u(\alpha), y - u(\alpha) \rangle = (1-\alpha)^2 \langle y, y \rangle + 2\alpha(1-\alpha) \langle y, y - \hat{y} \rangle + \alpha^2 \langle y-\hat{y}, y-\hat{y} \rangle . $$

Observe that

  • $\langle y, y \rangle = N$ by assumption,
  • $\langle y, y - \hat{y} \rangle = \langle y - \hat{y}, y - \hat{y} \rangle + \langle \hat{y}, y - \hat{y} \rangle = \langle y - \hat{y}, y - \hat{y}\rangle$, by applying the orthogonality lemma (yet again) to the second term in the middle; and,
  • $\langle y - \hat{y}, y - \hat{y} \rangle = \mathrm{RSS}$ by definition.

Putting this all together, you'll notice that we get

$$ \hat{\rho}_j(\alpha) = \frac{(1-\alpha) \lambda}{\sqrt{ (1-\alpha)^2 + \frac{\alpha(2-\alpha)}{N} \mathrm{RSS}}} = \frac{(1-\alpha) \lambda}{\sqrt{ (1-\alpha)^2 (1 - \frac{\mathrm{RSS}}{N}) + \frac{1}{N} \mathrm{RSS}}} $$

To wrap things up, $1 - \frac{\mathrm{RSS}}{N} = \frac{1}{N} (\langle y, y, \rangle - \langle y - \hat{y}, y - \hat{y} \rangle ) \geq 0$ and so it's clear that $\hat{\rho}_j(\alpha)$ is monotonically decreasing in $\alpha$ and $\hat{\rho}_j(\alpha) \downarrow 0$ as $\alpha \uparrow 1$.


Epilogue: Concentrate on the ideas here. There is really only one. The orthogonality lemma does almost all the work for us. The rest is just algebra, notation, and the ability to put these last two to work.

Related Question