This is problem **3.23** on page **97** of Hastie et al., *Elements of Statistical Learning*, 2nd. ed. (5th printing).

The key to this problem is a good understanding of ordinary least squares (i.e., linear regression), particularly the orthogonality of the fitted values and the residuals.

**Orthogonality lemma**: Let $X$ be the $n \times p$ design matrix, $y$ the response vector and $\beta$ the (true) parameters. Assuming $X$ is full-rank (which we will throughout), the OLS estimates of $\beta$ are $\hat{\beta} = (X^T X)^{-1} X^T y$. The fitted values are $\hat{y} = X (X^T X)^{-1} X^T y$. Then $\langle \hat{y}, y-\hat{y} \rangle = \hat{y}^T (y - \hat{y}) = 0$. That is, the fitted values are *orthogonal* to the residuals. This follows since $X^T (y - \hat{y}) = X^T y - X^T X (X^T X)^{-1} X^T y = X^T y - X^T y = 0$.

Now, let $x_j$ be a column vector such that $x_j$ is the $j$th column of $X$. The assumed conditions are:

- $\frac{1}{N} \langle x_j, x_j \rangle = 1$ for each $j$, $\frac{1}{N} \langle y, y \rangle = 1$,
- $\frac{1}{N} \langle x_j, 1_p \rangle = \frac{1}{N} \langle y, 1_p \rangle = 0$ where $1_p$ denotes a vector of ones of length $p$, and
- $\frac{1}{N} | \langle x_j, y \rangle | = \lambda$ for all $j$.

Note that **in particular**, the last statement of the **orthogonality lemma** is identical to $\langle x_j, y - \hat{y} \rangle = 0$ for all $j$.

**The correlations are tied**

Now, $u(\alpha) = \alpha X \hat{\beta} = \alpha \hat{y}$. So,
$$
\langle x_j, y - u(a) \rangle = \langle x_j, (1-\alpha) y + \alpha y - \alpha \hat{y} \rangle = (1-\alpha) \langle x_j, y \rangle + \alpha \langle x_j, y - \hat{y} \rangle ,
$$
and the second term on the right-hand side is zero by the **orthogonality lemma**, so
$$
\frac{1}{N} | \langle x_j, y - u(\alpha) \rangle | = (1-\alpha) \lambda ,
$$
as desired. The absolute value of the correlations are just
$$
\hat{\rho}_j(\alpha) = \frac{\frac{1}{N} | \langle x_j, y - u(\alpha) \rangle |}{\sqrt{\frac{1}{N} \langle x_j, x_j \rangle }\sqrt{\frac{1}{N} \langle y - u(\alpha), y - u(\alpha) \rangle }} = \frac{(1-\alpha)\lambda}{\sqrt{\frac{1}{N} \langle y - u(\alpha), y - u(\alpha) \rangle }}
$$

**Note**: The right-hand side above is independent of $j$ and the numerator is just the same as the covariance since we've assumed that all the $x_j$'s and $y$ are centered (so, in particular, no subtraction of the mean is necessary).

**What's the point?** As $\alpha$ increases the response vector is modified so that it inches its way toward that of the (*restricted!*) least-squares solution obtained from incorporating only the first $p$ parameters in the model. This simultaneously modifies the estimated parameters since they are simple inner products of the predictors with the (modified) response vector. The modification takes a special form though. It keeps the (magnitude of) the correlations between the predictors and the modified response the same throughout the process (even though the *value* of the correlation is changing). Think about what this is doing geometrically and you'll understand the name of the procedure!

**Explicit form of the (absolute) correlation**

Let's focus on the term in the denominator, since the numerator is already in the required form. We have
$$
\langle y - u(\alpha), y - u(\alpha) \rangle = \langle (1-\alpha) y + \alpha y - u(\alpha), (1-\alpha) y + \alpha y - u(\alpha) \rangle .
$$

Substituting in $u(\alpha) = \alpha \hat{y}$ and using the linearity of the inner product, we get

$$
\langle y - u(\alpha), y - u(\alpha) \rangle = (1-\alpha)^2 \langle y, y \rangle + 2\alpha(1-\alpha) \langle y, y - \hat{y} \rangle + \alpha^2 \langle y-\hat{y}, y-\hat{y} \rangle .
$$

Observe that

- $\langle y, y \rangle = N$ by assumption,
- $\langle y, y - \hat{y} \rangle = \langle y - \hat{y}, y - \hat{y} \rangle + \langle \hat{y}, y - \hat{y} \rangle = \langle y - \hat{y}, y - \hat{y}\rangle$, by applying the
**orthogonality lemma** (yet again) to the second term in the middle; and,
- $\langle y - \hat{y}, y - \hat{y} \rangle = \mathrm{RSS}$ by definition.

Putting this all together, you'll notice that we get

$$
\hat{\rho}_j(\alpha) = \frac{(1-\alpha) \lambda}{\sqrt{ (1-\alpha)^2 + \frac{\alpha(2-\alpha)}{N} \mathrm{RSS}}} = \frac{(1-\alpha) \lambda}{\sqrt{ (1-\alpha)^2 (1 - \frac{\mathrm{RSS}}{N}) + \frac{1}{N} \mathrm{RSS}}}
$$

To wrap things up, $1 - \frac{\mathrm{RSS}}{N} = \frac{1}{N} (\langle y, y, \rangle - \langle y - \hat{y}, y - \hat{y} \rangle ) \geq 0$ and so it's clear that $\hat{\rho}_j(\alpha)$ is monotonically decreasing in $\alpha$ and $\hat{\rho}_j(\alpha) \downarrow 0$ as $\alpha \uparrow 1$.

**Epilogue**: Concentrate on the ideas here. There is really only one. The **orthogonality lemma** does almost all the work for us. The rest is just algebra, notation, and the ability to put these last two to work.

You could combine the three datasets, to make one (very) long dataset. For a given subject_id you could then have e.g. 26 records or observations, 12 for method 1, 12 for method 2, and 2 for method 3. You should make a factor variable, "method", valued 1, 2 or 3, depending on the method to which each record (row) in your data frame belongs. So for this subject the factor "method" would be 12 times 1, 12 times 2, and 2 times 3. Your dependent variable is "ts" and contains the 26 values for the 3 methods. Finally, variable "interaction_num" takes values 1 to 12. Instead of the three models you proposed in your post, you could use one single model now:

`model1 <- lmer(ts ~ method*interaction_num + (0+method||subject_id)`

This would give you (almost) the same results as running the separate models you proposed. In the fixed effects results you will see the regression coefficient for interaction_num, which holds for method 1 (the reference of the factor method). The term method2:interaction_num shows you how much for method 2 the effect of interaction_num deviates from the effect for method 1. This is probably what you are most interested in. Same for method3:interaction_num. The effect of method2, shows you the ts difference between method2 vs. method1 for interaction_num being 0. This is not meaningful, but if you would code interaction_num from 0-11 instead of 1-12 then it would have a meaningful interpretation: difference between method 2 and 1 for the first interaction.

There will be three different random intercept variances, one for each method. The "||" in `(0+method||subject_id)`

means that these three random intercepts are uncorrelated. This is only done above to let the results be more similar to the ones you would have if you would run your separate models. However, you probably will get a better model fit (deviance) if you would use `(0+method|subject_id)`

so that correlations between the three random intercepts are allowed and estimated.

This model1 will estimate only one single residual variance! This is different from the three models you proposed, because these will give you three different residual variances. So, model1 comes with the assumption that the within-subject variance for each method is the same. And hence, the results of model1 will almost (but not exactly) be the same as those of your separate models, if indeed these residual variances are close. You can (afaik) with lme from package nlme specify that you also want to have three different residual variances.

The model1 assumes, for a given method, the same total variance in your data at each interaction number and also the same covariance (or correlation) between a pair of interaction numbers. This is called a compound symmetry covariance structure. This may not be plausible for your data. If you would let the interaction_number have a random effect for each person/method combination, you would have a more flexible covariance structure. That would be possible for method1 and method2, but not for method3, because you only have two observations there.

## Best Answer

Most multiple regression problems can be reduced to simple regressions involving one response variable and one explanatory variable (without an intercept), amounting to a plane geometry problem.The solution strategy presented here amounts to finding a particularly nice way to represent these variables.The three variables in this question generate a Euclidean space of at most three dimensions. We will solve the problem directly by selecting a simple but fully general way to express all the variables: that is, by choosing a suitable orthonormal basis for this space. One basic computational fact, the

Normal equations(for the simplest possible regression), is thatwhere $\cdot$ is the Euclidean inner product.

Choosing units in which the length of $X_2$ (presumably nonzero) is $1,$ begin creating an orthonormal basis of which the first element is $X_2,$ so that in this basis $X_2 = (1,0,0).$

Because $\{X_2, X_3\}$ generate a subspace of at most two dimensions, we may select a second basis element to represent the second dimension, so that $X_3 = (u,v,0),$ say.

The regression of $X_1$ on $X_2$ and $X_3$ states

$$X_1 = \beta_2^* X_2 + \beta_3^* X_3 + \text{residual} = (\beta_2^* + \beta_3^*u,\ \beta_3^*v,\ w)$$

where the third basis element is chosen to be parallel to the residual $(0,0,w).$

Consider the first two regressions.Regressing $X_1$ on $X_2=(1,0,0)$ simply picks out the first coefficient of $X_1,$ leaving the residual$$E = (0, \beta_3^*v, w).$$

Finally we come to the only step that requires any computation:regressing $E$ against $X_3$ gives the coefficient$$\beta_3 = \frac{E \cdot X_3}{X_3\cdot X_3} = \frac{(0,\beta_3^*v, w)\cdot(u,v,0)}{(u,v,0)\cdot(u,v,0)} = \beta_3^* \frac{v^2}{u^2+v^2}.$$

(We must assume $u^2+v^2\ne 0,$ which means $X_2$ is nonzero. Otherwise, $X_2$ plays no role and trivially $\beta_3 = \beta_3^*.$)

Because both $u^2$ and $v^2$ are nonnegative, taking absolute values produces

In terms of the original variables, $u^2=0$ means $X_3 = (0,v,0)$ is orthogonal to $X_2.$