Let's say $x$ is correlated to both $y_1$, and $y_2$. Why are the residuals of the nested regression of $x$ against $y_1$ and $y_2$, not equal to the residuals of the simultaneous (multiple) regression of $x$ against $y_1$ and $y_2$? To clarify:
I take the residuals of the regression of $x$ against $y_1$ to get residuals $r_1$.
I regress $r_1$ against $y_2$ to get residuals $r_2$. Why are these residuals $r_2$ not equal to the residuals of the multivariate regression of $x$ against both $y_1$ and $y_2$?
Written in R
code, we would say that,
lm(lm(x ~ y1)$residuals ~ y2)$residuals
is not equal to:
lm(x ~ y1 + y2)$residuals
I would like to understand this as would like progressively to extract the influence of explanatory variables from a dependent variable, so that I can "magnify" progressively the dependent variable's correlation to each subsequent factor. I am doing this in the context of PCA regression so specifically:
it30
= the 30 year point on the Italian yield curveitpc1
= the first principal component of the Italian yield curve, calculated from maturity points 1y, 2y, 3y, …, 30y.itpc2
= the second principal component of the Italian yield curve
I expect it30
independently to have a relationship to itpc1
(yield curve level) and itpc2 (yield curve slope). Another fact is that, due to the PCA, itpc1
and itpc2
are orthogonal, but I do not think that is important for this question.
Indeed:
And:
…so the 30y yield curve has a relationship to both itpc1
and itpc2
.
Now if I take the residuals of the first regression and regress these against the second variable itpc2
, I would expect there to be a relationship, and there does seem to be:
So it appears that my residuals from the first regression are linked to the second variable, as I would expect, that is, after accounting for the first correlation, that is, extracting itpc1
from the data, there is still information related to the correlation with PC2. Interesting so far.
Now I want to extract both itpc1
and itpc2
from it30
, but I am wondering which approach to take, because of the following that I do not understand….
My question is, why is this regression not a perfect straight line?
That is, if I progressively extract correlated variables from the dependent variable in a nested way, why are the residuals not equal to the regression which extracts them all in one shot?
My objective is to understand to what extent each principal component affects a series. Yes I know I can do this using the eigenvector matrix but I am interested in the above behaviour of regressions.
Any intuitive explanation accompanying formulas would be appreciated.
Best Answer
As per Bill Huber's comments and answer elsewhere, the trick is to remove the influence of the independent variables on each other whenever producing each sequential regression. In other words instead of:
We want:
In this case, we DO get back to the multiple regression:
Moreover, we can show the coefficients are the same:
Interestingly, and as expected, if the independent variables are orthogonal as in PCA regression, then we do not need to take out the influence of each of the regressors against each other. In this case it is true that:
is perfectly correlated with:
as can be seen here:
This is because the orthogonal principal components have a zero-slope regression line and thus the residuals are equal to the dependent variable (with a vertical translation to mean=0).