Multiple Regression – How to Perform Multiple Regression as a Sequence of Univariate Regressions

intuitionmultiple regressionpcaregressionresiduals

Let's say $x$ is correlated to both $y_1$, and $y_2$. Why are the residuals of the nested regression of $x$ against $y_1$ and $y_2$, not equal to the residuals of the simultaneous (multiple) regression of $x$ against $y_1$ and $y_2$? To clarify:

I take the residuals of the regression of $x$ against $y_1$ to get residuals $r_1$.
I regress $r_1$ against $y_2$ to get residuals $r_2$. Why are these residuals $r_2$ not equal to the residuals of the multivariate regression of $x$ against both $y_1$ and $y_2$?

Written in R code, we would say that,

lm(lm(x ~ y1)$residuals ~ y2)$residuals

is not equal to:

lm(x ~ y1 + y2)$residuals

I would like to understand this as would like progressively to extract the influence of explanatory variables from a dependent variable, so that I can "magnify" progressively the dependent variable's correlation to each subsequent factor. I am doing this in the context of PCA regression so specifically:

it30 = the 30 year point on the Italian yield curve
itpc1 = the first principal component of the Italian yield curve, calculated from maturity points 1y, 2y, 3y, …, 30y.
itpc2 = the second principal component of the Italian yield curve

I expect it30 independently to have a relationship to itpc1 (yield curve level) and itpc2 (yield curve slope). Another fact is that, due to the PCA, itpc1 and itpc2 are orthogonal, but I do not think that is important for this question.

Indeed:

enter image description here

And:

enter image description here

…so the 30y yield curve has a relationship to both itpc1 and itpc2.

Now if I take the residuals of the first regression and regress these against the second variable itpc2, I would expect there to be a relationship, and there does seem to be:

enter image description here

So it appears that my residuals from the first regression are linked to the second variable, as I would expect, that is, after accounting for the first correlation, that is, extracting itpc1 from the data, there is still information related to the correlation with PC2. Interesting so far.

Now I want to extract both itpc1 and itpc2 from it30, but I am wondering which approach to take, because of the following that I do not understand….

My question is, why is this regression not a perfect straight line?

enter image description here

That is, if I progressively extract correlated variables from the dependent variable in a nested way, why are the residuals not equal to the regression which extracts them all in one shot?

My objective is to understand to what extent each principal component affects a series. Yes I know I can do this using the eigenvector matrix but I am interested in the above behaviour of regressions.

Any intuitive explanation accompanying formulas would be appreciated.

Best Answer

As per Bill Huber's comments and answer elsewhere, the trick is to remove the influence of the independent variables on each other whenever producing each sequential regression. In other words instead of:

lm(lm(x ~ y1)$residuals ~ y2)

We want:

lm(lm(x ~ y1)$residuals ~ lm(y2 ~ y1)$residuals)

In this case, we DO get back to the multiple regression:

enter image description here

Moreover, we can show the coefficients are the same:

> round(coef(lm(lm(it30 ~ itpc1)$residuals ~ lm(itpc2 ~ itpc1)$residuals)), 5) 
(Intercept) lm(itpc2 ~ itpc1)$residuals  #$
    0.00000                    -0.21846 
> round(coef(lm(lm(it30 ~ itpc2)$residuals ~ lm(itpc1 ~ itpc2)$residuals)), 5) 
(Intercept) lm(itpc1 ~ itpc2)$residuals  #$
    0.00000                     0.29197 
> round(coef(lm(it30 ~ itpc1 + itpc2)), 5)
(Intercept)       itpc1       itpc2 
    0.01186     0.29197    -0.21846

Interestingly, and as expected, if the independent variables are orthogonal as in PCA regression, then we do not need to take out the influence of each of the regressors against each other. In this case it is true that:

lm(lm(x ~ y1)$residuals ~ y2)$residuals

is perfectly correlated with:

lm(x ~ y1 + y2)$residuals

as can be seen here:

enter image description here

This is because the orthogonal principal components have a zero-slope regression line and thus the residuals are equal to the dependent variable (with a vertical translation to mean=0).

enter image description here

Related Solutions

Solved – Compare influence of same set of independent variables on two different dependent variables

as ShannonC pointed out why not run regression y3=y1/y2 ~ x1,x2,...

however,if that is not possible(e.g. you don't have the original data) you can use taylor expansion to understand how y3 is influenced by x1,x2,...

let's re-state:

$$y_{k,i} = b_{k,0} + \sum_j b_{k,j} x_{j,i}$$ for k=1,2

let's write the taylor expansion around $(b_{1,0},b_{2,0})$ $$y3=f(y1,y2) = f(b_{1,0},b_{2,0}) + \frac{df(b_{1,0},b_{2,0})}{dy1} (y1-b_{1,0}) + \frac{df(b_{1,0},b_{2,0})}{dy2} (y2-b_{2,0}) $$

now, everything is expressed in terms of the $b_{k,j}$ coefficients. in your particular case:

$$ y_3 = \frac{b_{1,0}}{b_{2,0}} + \frac{1}{b_{2,0}} (\sum_j b_{1,j} x_{j,i}) - \frac{b_{1,0}}{(b_{2,0})^2} (\sum_j b_{2,j} x_{j,i}) \\ = \frac{b_{1,0}}{b_{2,0}} + \sum_j [\frac{1}{b_{2,0}} b_{1,j} - \frac{b_{1,0}}{(b_{2,0})^2} b_{2,j} ] x_{j,i} $$

i hope this makes sense and answers your question

few notes: 1)you can use any other type of regression not only linear and any other function of the $y$ variables 2) if 1st order approximation doesn't work use more derivatives 3) must be careful when $b_{2,0}=0$

Multiple Regression – Handling Multicollinearity: Residual Regression

I understand that this is a problem, due to overlapping explanations of data.

It depends on what is your ultimate goal of the regression.

If you want inference, multicollinearity is problematic.
But if you want prediction accuracy, there is no problem of overlapping explaination of data as long as the correlation between the predictors doesn't change. The problem from the correlated predictors is more on the side of expected test error (MSE), which can be decomposed into three components:

$$E [y_0 - \hat{y_0}]^2 = Var(\hat{y_0}) + [Bias(y_0)]^2 + Var(\epsilon)$$

Bias level of the estimates
Variance of them
Irreducable error $Var(\epsilon)$

When there are two strongly correlated predictors in the data base, the matrix $X^T X$ has a determinant closed to zero, wheares, the variance of the estimated value, under homoskedasticity, is $ \hat{\sigma}^2 (X^T X)^{-1}$. So, if the determinant gets close to zero, the variance gets uncomfortably big, the expected test errror gets big too. And it is not good.

Concerning your three questions:

Will it work? Yes, all models are true, but lots of them are useless. (see question 3)
Does it truly bypass the problem of multicollinearity? It will bypass the multicollinearity problem in one sense but there will be other problems with this approach.
Is there any unforeseen problems with this that are being overlooked? Yes, when you fit one variable each time, it's like one-way analysis, all the estimated coefficient will be highly biased. Furthermore, this aproach neglect the multi-variable effect, for example one variable that is not significant can become useful for the model when including together with other variables.

Let's take a simple example in which I simulated two independent random variables $X_1$ and $X_2$, and my response variable is $Y$ such that the true relationship is:

$$Y=1+2*X_1 + 7*X_2 + \epsilon$$

set.seed(3)
x1 = runif(100,0,100)
x2 = runif(100,0,100)
epsi = rnorm(100)
y = 1+2*x1 + 7*x2 + epsi

dat=data.frame(y,x1,x2,epsi)

Processing as in the "residual regression" way:

mod1 = lm(y~x1,data=dat)
summary(mod1)

Results:

   Call:
lm(formula = y ~ x1, data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-333.5 -138.0  -14.4  176.1  354.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 371.8279    36.8386  10.093   <2e-16 ***
x1            1.3336     0.6558   2.033   0.0447 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 186.9 on 98 degrees of freedom
Multiple R-squared:  0.04048,   Adjusted R-squared:  0.03069 
F-statistic: 4.135 on 1 and 98 DF,  p-value: 0.04471

Not talking about the intercept, the estimated coefficient associated with the variable $X_1$ is 1.3 wheares it should have been 2. This coefficient stays unchanged when you fit the residuals of this first model on the variable $X_2$. Why is the intercept so large, and why is the $\beta_1$ biased? This is to compensate the absence of $x_2$ in the model. And if we take the plot of $Y$ against $X_1$, it can be observed that the trend of $X_1$ can not be easily seen because of the large influence of $X_2$

And if we fit $Y$ on $X_1$ and $X_2$ together, all the estimated value gets the value that we expect, i.e $\beta_0 =1$, $\beta_1 =2$ , $\beta_3 =7$

> mod2 = lm(y~x1+x2,data=dat)
> summary(mod2)

Call:
lm(formula = y ~ x1 + x2, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.37635 -0.80796  0.09194  0.69199  2.64910 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 0.942346   0.311126    3.029  0.00315 ** 
x1          2.002710   0.003904  512.988  < 2e-16 ***
x2          6.998871   0.004186 1671.797  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.106 on 97 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.456e+06 on 2 and 97 DF,  p-value: < 2.2e-16

Best Answer

Related Solutions

Solved – Compare influence of same set of independent variables on two different dependent variables

Multiple Regression – Handling Multicollinearity: Residual Regression

Related Question