I have a dependent variable followed by 3 independent variables that I am trying to fit the best model for (using R). Examples of my models are below:
#Multiple linear regression
mod <- lm(y ~ x1 + x2 + x3, data = data)
#Simple linear regression
mod2 <- lm(y ~ x1, data = data)
mod3 <- lm(y ~ x1, data = data)
mod4 <- lm(y ~ x1, data = data)
The output of the multiple regression model shows x1 and x3 are both significant whereas x2 is not, but when each of the simple linear regression models are run, x2 is significant as well. Some of the outputs of each are listed below.
My question is this: If the multiple linear regression model is significant, then should that one be used in a report/paper rather than the simple linear regression models? Even though one of the variable is not significant? I know that you're not supposed to go back to the main effects models when there is a significant interaction between main effects when using ANOVA, but is there a similar rule in regression?
Also, It should be noted that the R2 and AIC values all favor the multiple regression model.
#Multiple linear regression output (mod):
Call:
lm(formula = y ~ x1 + x2 + x3, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.38347 -0.07256 0.01893 0.09067 0.25851
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9149320 0.0175039 52.270 < 2e-16 ***
x1 -0.0008528 0.0001137 -7.503 2.16e-12 ***
x2 -0.0008818 0.0013214 -0.667 0.505
x3 -0.0024993 0.0005614 -4.452 1.43e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1142 on 195 degrees of freedom
Multiple R-squared: 0.3791, Adjusted R-squared: 0.3696
F-statistic: 39.69 on 3 and 195 DF, p-value: < 2.2e-16
#Simple linear regression - mod2
Call:
lm(formula = y ~ x1, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.33985 -0.09797 0.01622 0.11713 0.21119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8384434 0.0103638 80.901 < 2e-16 ***
x1 -0.0008201 0.0001287 -6.373 1.29e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1312 on 197 degrees of freedom
Multiple R-squared: 0.1709, Adjusted R-squared: 0.1667
F-statistic: 40.61 on 1 and 197 DF, p-value: 1.287e-09
#Simple linear regression - mod3
Call:
lm(formula = y ~ x2, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.44224 -0.08695 0.01411 0.11709 0.26569
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8886699 0.0189938 46.79 < 2e-16 ***
x2 -0.0046871 0.0009664 -4.85 2.5e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1362 on 197 degrees of freedom
Multiple R-squared: 0.1067, Adjusted R-squared: 0.1021
F-statistic: 23.52 on 1 and 197 DF, p-value: 2.5e-06
Simple linear regression - mod4
Call:
lm(formula = y ~ x3, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.43780 -0.08383 0.02178 0.10805 0.21189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8755046 0.0131656 66.499 < 2e-16 ***
x3 -0.0027375 0.0003918 -6.987 4.23e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.129 on 197 degrees of freedom
Multiple R-squared: 0.1986, Adjusted R-squared: 0.1945
F-statistic: 48.82 on 1 and 197 DF, p-value: 4.229e-11
```
Best Answer
Yes. With correlated predictor variables (as you seem to have) each correlated with outcome (as you show in your individual models), removing any of them from a regression model can lead to omitted-variable bias. Even though
x2
doesn't have a "statistically significant" association with outcome in the multiple-regression model, keeping it in the model is probably the best way to get good estimates of the other coefficients.*There's no problem with reporting a model like that. You might even go on to show that the apparent association of
x2
with outcome individually is spurious, perhaps explained by its correlations with the other 2 predictors.*Models to be used for prediction often benefit from including as many variables as possible without overfitting, as even "insignificant" variables can improve predictive ability.