Theoretical reason for multiple linear regression predictions being the same when adding and subtracting predictors

linearmultiple regressionregression

Say I have two variables $x_1$ and $x_2$, now I build a linear regression model as below

$$\hat{Y} = n_1 x_1 + n_2 x_2.$$

Then I build another model as below

$$\hat{Z} = m_1 (x_1 + x_2) + m_2 (x_1 – x_2).$$

Intuitively, $\hat{Y}$ should be equal to $\hat{Z}$. Below is my R code to demonstrate the equivalence

set.seed(1) 
num = 10
X1 = runif(num)
X2 = runif(num)
Y = runif(num)

mydata <- data.frame(X1, X2, Y)
fit1 = lm(Y ~ X1 + X2, data = mydata)
summary(fit1)

mydata <- data.frame(X1 + X2, X1 - X2, Y)
names(mydata)[1] <- 'new_X1'
names(mydata)[2] <- 'new_X2'

fit2 = lm(Y ~ new_X1 + new_X2, data = mydata)
summary(fit2)

My questions is, how can I conceptually prove the equivalence?

Best Answer

Hi: You can prove the equivalence by re-writing your second regression model as

$Z = (m_1 + m_2) \times x_1 + (m_1 - m_2) \times x_2 + \omega$

#==================================================================

EDITED ON 11/01/2021 IN ORDER TO PROVIDE CLARIFICATION BASED ON COMMENT FROM OP THAT AN EXTRA ASSUMPTION IS BEING MADE.

#==================================================================

This way, it looks a lot more like the first regression model

$Y = n_1 \times x_1 + n_2 \times x_2 + \epsilon $

Now, when the estimation of both models is done, $Z$ and $Y$, the respective responses in the two regression models are identical. Also, $n_1$ corresponds to $m_1 + m_2$ and $n_2$ corresponds to $m_1 - m_2$.

So, when the regression using $Z$ is carried out, the coefficient $(\hat{m_1 + m_2})$ is estimated and the coefficient $(\hat{m_1 - m_2})$ is estimated such that the squared deviations of $Z$ from $\hat{Z}$ are minimized.

Similarly, from a least squares standpoint, in the first regression model, one is minimizing the sum of squared deviations of $Y$ from $\hat{Y}$ by finding the coefficient estimates, $\hat{n_1}$ and $\hat{n_2}$.

Therefore, from a system of equations perspective ( taking the derivatives and setting them to zero and all of that ), one has two equations and two unknowns in both cases. Therefore, the result of the minimization procedures have to be identical in that $\hat{n_1}$ has to correspond to $(\hat{m_1 + m_2})$ and $\hat{n_2}$ has to corrrespond to $(\hat{m_1 - m_2})$.

Does that clarify why the two models are identical ? If not, then maybe someone else can give a clearer explanation. In practical terms, the fact that, in the second regression model, the first coefficient is $(m_1 + m_2)$ and the second coefficient is $({m_1 - m_2})$ makes no difference to the minimization algorithm. It views them as variables to be estimated. Since the second regression model faces the same minimization problem that the first regression model faces, the coefficient estimates have to correspond in that the sum of the coefficients in the second model corresponds to $n_1$ and the difference of the coefficients in the second model corresponds to $n_2$.

#=================================================================

EDIT ON 11-02-2021 IN ORDER TO COMMENT ON NICO'S QUESTION REGARDING HOW $m1$ and $m2$ ARE ACTUALLY OBTAINED.

#==================================================================== Nico: Notice that when the R code is run ( output shown below ), the coefficients for the second regression are $m_1$ and $m_2$. So, a system of equations for dealing with the dependence is not necessary because the relation I used to show equivalence of the regression models, is not used in the R code itself. So, my explanation of using a 2 by 2 system of equations to solve for $m_1$ and $m_2$ AFTERWARDS, is only conceptual. The lm call in the second regression model does not need to do anything fancy because there is no dependence between the two coefficients $m_1$ and $m_2$. I just introduced the dependence to show equivalence of the two models. I hope that helps.

Note that if you take the 2 coefficients in the second output and add them and subtract the second from the first, then you will obtain the same coefficients that are obtained in the first regression model.

#====================================================================

Call:
lm(formula = Y ~ X1 + X2, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3904 -0.2223 -0.0482  0.2495  0.4115 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.7706     0.3004   2.566   0.0373 *
X1           -0.2867     0.3326  -0.862   0.4172  
X2           -0.3475     0.3881  -0.895   0.4004  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3149 on 7 degrees of freedom
Multiple R-squared:  0.1815,    Adjusted R-squared:  -0.05238 
F-statistic: 0.776 on 2 and 7 DF,  p-value: 0.4961

> 
+ mydata <- data.frame(X1 + X2, X1 - X2, Y)
+ names(mydata)[1] <- 'new_X1'
+ names(mydata)[2] <- 'new_X2'
+ 
> 
+ fit2 = lm(Y ~ new_X1 + new_X2, data = mydata)
+ summary(fit2)
+ 
Call:
lm(formula = Y ~ new_X1 + new_X2, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3904 -0.2223 -0.0482  0.2495  0.4115 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.7706     0.3004   2.566   0.0373 *
new_X1       -0.3171     0.2550  -1.244   0.2536  
new_X2        0.0304     0.2562   0.119   0.9089  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3149 on 7 degrees of freedom
Multiple R-squared:  0.1815,    Adjusted R-squared:  -0.05238 
F-statistic: 0.776 on 2 and 7 DF,  p-value: 0.4961



Related Question