Regression Analysis – Discrepancy Between Multiple and Simple Linear Regression Results: Which to Report?

regression

I have a dependent variable followed by 3 independent variables that I am trying to fit the best model for (using R). Examples of my models are below:

#Multiple linear regression
mod <- lm(y ~ x1 + x2 + x3, data = data)

#Simple linear regression
mod2 <- lm(y ~ x1, data = data)
mod3 <- lm(y ~ x1, data = data)
mod4 <- lm(y ~ x1, data = data)

The output of the multiple regression model shows x1 and x3 are both significant whereas x2 is not, but when each of the simple linear regression models are run, x2 is significant as well. Some of the outputs of each are listed below.

My question is this: If the multiple linear regression model is significant, then should that one be used in a report/paper rather than the simple linear regression models? Even though one of the variable is not significant? I know that you're not supposed to go back to the main effects models when there is a significant interaction between main effects when using ANOVA, but is there a similar rule in regression?

Also, It should be noted that the R2 and AIC values all favor the multiple regression model.

#Multiple linear regression output (mod):
Call:
lm(formula = y ~ x1 + x2 + x3, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38347 -0.07256  0.01893  0.09067  0.25851 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.9149320  0.0175039  52.270  < 2e-16 ***
x1          -0.0008528  0.0001137  -7.503 2.16e-12 ***
x2          -0.0008818  0.0013214  -0.667    0.505    
x3          -0.0024993  0.0005614  -4.452 1.43e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1142 on 195 degrees of freedom
Multiple R-squared:  0.3791,    Adjusted R-squared:  0.3696 
F-statistic: 39.69 on 3 and 195 DF,  p-value: < 2.2e-16
#Simple linear regression - mod2
Call:
lm(formula = y ~ x1, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33985 -0.09797  0.01622  0.11713  0.21119 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.8384434  0.0103638  80.901  < 2e-16 ***
x1          -0.0008201  0.0001287  -6.373 1.29e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1312 on 197 degrees of freedom
Multiple R-squared:  0.1709,    Adjusted R-squared:  0.1667 
F-statistic: 40.61 on 1 and 197 DF,  p-value: 1.287e-09
#Simple linear regression - mod3
Call:
lm(formula = y ~ x2, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.44224 -0.08695  0.01411  0.11709  0.26569 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.8886699  0.0189938   46.79  < 2e-16 ***
x2          -0.0046871  0.0009664   -4.85  2.5e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1362 on 197 degrees of freedom
Multiple R-squared:  0.1067,    Adjusted R-squared:  0.1021 
F-statistic: 23.52 on 1 and 197 DF,  p-value: 2.5e-06
Simple linear regression - mod4
Call:
lm(formula = y ~ x3, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43780 -0.08383  0.02178  0.10805  0.21189 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.8755046  0.0131656  66.499  < 2e-16 ***
x3          -0.0027375  0.0003918  -6.987 4.23e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.129 on 197 degrees of freedom
Multiple R-squared:  0.1986,    Adjusted R-squared:  0.1945 
F-statistic: 48.82 on 1 and 197 DF,  p-value: 4.229e-11
```

Best Answer

If the multiple linear regression model is significant, then should that one be used in a report/paper rather than the simple linear regression models?

Yes. With correlated predictor variables (as you seem to have) each correlated with outcome (as you show in your individual models), removing any of them from a regression model can lead to omitted-variable bias. Even though x2 doesn't have a "statistically significant" association with outcome in the multiple-regression model, keeping it in the model is probably the best way to get good estimates of the other coefficients.*

There's no problem with reporting a model like that. You might even go on to show that the apparent association of x2 with outcome individually is spurious, perhaps explained by its correlations with the other 2 predictors.


*Models to be used for prediction often benefit from including as many variables as possible without overfitting, as even "insignificant" variables can improve predictive ability.

Related Question