Solved – Multilinear regression with multicollinearity: residual regression

multicollinearitymultiple regressionregression

I am trying to build a multilinear regression with predictor variables that likely are correlated. I understand that this is a problem, due to overlapping explanations of data. I think I have a method that may get around this, but would like to see if it is valid.

The idea, simply put, is to do the following:

Perform a linear regression and find the residuals
$$\hat{y_1}=m_1\hat{x_1}+b_1$$ $$r_1=y_1-\hat{y_1}$$
Calculate a least squares regression between the residual of the prior regression and the next predictor variable. For example, $$\hat{r_1}=m_2\hat{x_2}+b_2$$ $$r_2=r_1-\hat{r_1}$$
Repeat the idea behind step 2 for $M$ variables. This can be written as $$\hat{r}_{i-1}=m_i\hat{x}_i+b_i$$ $$r_i=r_{i-1}-\hat{r}_{i-1}$$

The idea behind this is that the multilinear regression could be written as $$\hat{y}_{MLR}=\sum\limits_{j=1}^M m_j\hat{x}_j+b_j$$

My questions behind this is:

Will it work?
Does it truly bypass the problem of multicollinearity?
Is there any unforeseen problems with this that are being overlooked?

Best Answer

I understand that this is a problem, due to overlapping explanations of data.

It depends on what is your ultimate goal of the regression.

If you want inference, multicollinearity is problematic.
But if you want prediction accuracy, there is no problem of overlapping explaination of data as long as the correlation between the predictors doesn't change. The problem from the correlated predictors is more on the side of expected test error (MSE), which can be decomposed into three components:

$$E [y_0 - \hat{y_0}]^2 = Var(\hat{y_0}) + [Bias(y_0)]^2 + Var(\epsilon)$$

Bias level of the estimates
Variance of them
Irreducable error $Var(\epsilon)$

When there are two strongly correlated predictors in the data base, the matrix $X^T X$ has a determinant closed to zero, wheares, the variance of the estimated value, under homoskedasticity, is $ \hat{\sigma}^2 (X^T X)^{-1}$. So, if the determinant gets close to zero, the variance gets uncomfortably big, the expected test errror gets big too. And it is not good.

Concerning your three questions:

Will it work? Yes, all models are true, but lots of them are useless. (see question 3)
Does it truly bypass the problem of multicollinearity? It will bypass the multicollinearity problem in one sense but there will be other problems with this approach.
Is there any unforeseen problems with this that are being overlooked? Yes, when you fit one variable each time, it's like one-way analysis, all the estimated coefficient will be highly biased. Furthermore, this aproach neglect the multi-variable effect, for example one variable that is not significant can become useful for the model when including together with other variables.

Let's take a simple example in which I simulated two independent random variables $X_1$ and $X_2$, and my response variable is $Y$ such that the true relationship is:

$$Y=1+2*X_1 + 7*X_2 + \epsilon$$

set.seed(3)
x1 = runif(100,0,100)
x2 = runif(100,0,100)
epsi = rnorm(100)
y = 1+2*x1 + 7*x2 + epsi

dat=data.frame(y,x1,x2,epsi)

Processing as in the "residual regression" way:

mod1 = lm(y~x1,data=dat)
summary(mod1)

Results:

   Call:
lm(formula = y ~ x1, data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-333.5 -138.0  -14.4  176.1  354.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 371.8279    36.8386  10.093   <2e-16 ***
x1            1.3336     0.6558   2.033   0.0447 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 186.9 on 98 degrees of freedom
Multiple R-squared:  0.04048,   Adjusted R-squared:  0.03069 
F-statistic: 4.135 on 1 and 98 DF,  p-value: 0.04471

Not talking about the intercept, the estimated coefficient associated with the variable $X_1$ is 1.3 wheares it should have been 2. This coefficient stays unchanged when you fit the residuals of this first model on the variable $X_2$. Why is the intercept so large, and why is the $\beta_1$ biased? This is to compensate the absence of $x_2$ in the model. And if we take the plot of $Y$ against $X_1$, it can be observed that the trend of $X_1$ can not be easily seen because of the large influence of $X_2$

And if we fit $Y$ on $X_1$ and $X_2$ together, all the estimated value gets the value that we expect, i.e $\beta_0 =1$, $\beta_1 =2$ , $\beta_3 =7$

> mod2 = lm(y~x1+x2,data=dat)
> summary(mod2)

Call:
lm(formula = y ~ x1 + x2, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.37635 -0.80796  0.09194  0.69199  2.64910 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 0.942346   0.311126    3.029  0.00315 ** 
x1          2.002710   0.003904  512.988  < 2e-16 ***
x2          6.998871   0.004186 1671.797  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.106 on 97 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.456e+06 on 2 and 97 DF,  p-value: < 2.2e-16

Related Solutions

Solved – Comparing regression coefficients of same model across different data sets

From the ideal gas law here, $PV=nRT$, suggesting a proportional model. Make sure your units are in absolute temperature. Asking for a proportional result would imply a proportional error model. Consider, perhaps $Y=a D^b S^c$, then for multiple linear regression one can use $\ln (Y)=\ln (a)+b \ln (D)+c \ln (S)$ by taking the logarithms of the Y, D, and S values, so that this then looks like $Y_l=a_l+b D_l+c S_l$, where the $l$ subscripts mean "logarithm of." Now, this may work better than the linear model you are using, and, the answers are then relative error type.

To verify what type of model to use try one and check if the residuals are homoscedastic. If they are not then you have a biased model, then do something else like model the logarithms, as above, one or more reciprocals of x or y data, square roots, squaring, exponentiation and so forth until the residuals are homoscedastic. If the model cannot yield homoscedastic residuals then use multiple linear Theil regression, with censoring if needed.

How normally the data is distributed on the y axis is not required, but, outliers can and often do distort the regression parameter results markedly. If homoscedasticity cannot be found then ordinary least squares should not be used and some other type of regression needs to be performed, e.g. weighted regression, Theil regression, least squares in x, Deming regression and so forth. Also, the errors should not be serially correlated.

The meaning of the output: $z = (a_{1} - b_{1}) / \sqrt{SE_{a_{1}}^2 + SE_{b_{1}}^2 )}$, may or may not be relevant. This assumes that the total variance is the sum of two independent variances. To put this another way, independence is orthogonality (perpendicularity) on an $x,y$ plot. That is, the total variability (variance) then follows Pythagorean theorem, $H=+\sqrt{A^2+O^2}$, which may or may not be the case for your data. If that is the case, then the $z$-statistic is a relative distance, i.e., a difference of means (a distance), divided by Pythagorean, A.K.A. vector, addition of standard error (SE), which are standard deviations (SDs) divided by $\sqrt{N}$, where SEs are themselves distances. Dividing one distance by the other then normalizes them, i.e., the difference in means divided by the total (standard) error, which is then in a form so that one can apply ND(0,1) to find a probability.

Now, what happens if the measures are not independent, and how can one test for it? You may remember from geometry that triangles that are not right angled add their sides as $C^2=A^2+B^2-2 A B \cos (\theta ),\theta =\angle(A,B)$, if not refresh your memory here. That is, when there is something other than a 90-degree angle between the axes, we have to include what that angle is in the calculation of total distance. First recall what correlation is, standardized covariance. This for total distance $\sigma _T$ and correlation $\rho_{A,B}$ becomes $\sigma _T^2=\sigma _A^2+\sigma _B^2-2 \sigma _A \sigma _B \rho_{A,B}$. In other words, if your standard deviations are correlated (e.g., pairwise), they are not independent.

Solved – How to multiple regression be performed as a sequence of univariate regressions

As per Bill Huber's comments and answer elsewhere, the trick is to remove the influence of the independent variables on each other whenever producing each sequential regression. In other words instead of:

lm(lm(x ~ y1)$residuals ~ y2)

We want:

lm(lm(x ~ y1)$residuals ~ lm(y2 ~ y1)$residuals)

In this case, we DO get back to the multiple regression:

enter image description here

Moreover, we can show the coefficients are the same:

> round(coef(lm(lm(it30 ~ itpc1)$residuals ~ lm(itpc2 ~ itpc1)$residuals)), 5) 
(Intercept) lm(itpc2 ~ itpc1)$residuals  #$
    0.00000                    -0.21846 
> round(coef(lm(lm(it30 ~ itpc2)$residuals ~ lm(itpc1 ~ itpc2)$residuals)), 5) 
(Intercept) lm(itpc1 ~ itpc2)$residuals  #$
    0.00000                     0.29197 
> round(coef(lm(it30 ~ itpc1 + itpc2)), 5)
(Intercept)       itpc1       itpc2 
    0.01186     0.29197    -0.21846

Interestingly, and as expected, if the independent variables are orthogonal as in PCA regression, then we do not need to take out the influence of each of the regressors against each other. In this case it is true that:

lm(lm(x ~ y1)$residuals ~ y2)$residuals

is perfectly correlated with:

lm(x ~ y1 + y2)$residuals

as can be seen here:

enter image description here

This is because the orthogonal principal components have a zero-slope regression line and thus the residuals are equal to the dependent variable (with a vertical translation to mean=0).

enter image description here

Best Answer

Related Solutions

Solved – Comparing regression coefficients of same model across different data sets

Solved – How to multiple regression be performed as a sequence of univariate regressions

Related Question