Solved – Quadratic terms in multiple linear regression

least squaresmultiple regressionquadratic formrregression

I am trying to predict chronological age based on a set of eight DNA methylation markers, expressed in percentages. I'm working with a dataset of 181 samples. When I put the markers in a multiple OLS regression model, all of them are significant. I knew I had to check if it was necessary to add quadratic terms for some of the markers, because sometimes the relationship between a marker and chronological age is nonlinear. So I checked the residuals plot for every marker individually and found that two of them (X4 and X7) had a quadratic trend. Here are the plots for X4:
Nonlinear relationship between age and X4
Residuals plot of Age ~ X4

I proceded to add quadratic terms for these markers to my model (X4sq and X7sq). When I did it for both markers separately, I noticed that the linear term was suddenly no longer significant, while the quadratic term was. Furthermore, when I added the quadratic terms of both markers into one model, X7 became insignificant for the linear term as well as the quadratic one. Here is the output for that model:

Call:
lm(formula = Age ~ X1 + X2 + X3 + X4 + X4sq + X5 + X6 + X7 + 
    X7sq + X8)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.5754  -2.4720   0.1919   2.9345  15.7697 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.197343   7.609607   6.202 4.09e-09 ***
X1          -0.243983   0.050604  -4.821 3.15e-06 ***
X2          -0.111650   0.047949  -2.329 0.021062 *  
X3          -0.192721   0.048514  -3.972 0.000105 ***
X4           0.102764   0.192075   0.535 0.593334    
X4sq         0.005619   0.002043   2.751 0.006586 ** 
X5           0.355343   0.107318   3.311 0.001135 ** 
X6           0.293344   0.065174   4.501 1.25e-05 ***
X7          -0.166154   0.451214  -0.368 0.713155    
X7sq         0.019879   0.012136   1.638 0.103254    
X8          -0.144696   0.033079  -4.374 2.12e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.895 on 170 degrees of freedom
Multiple R-squared:  0.9504,    Adjusted R-squared:  0.9475 
F-statistic:   326 on 10 and 170 DF,  p-value: < 2.2e-16

Looking at the residual standard error and the adjusted R², the fit of my model improved when adding either quadratic term, and improved even further by adding both of them at the same time.

But considering the fact that some terms become insignificant when I do this, should I see that as a sign that I should not include the quadratic terms? Or is it okay to have insignificant predictors when the overall fit of the model does improve?

Best Answer

Partially because your other variables 'take credit' of X4sq or X8sq when they are not present in your models yet. Considering the situation when number of shark attack is significantly increased by the increase in ice-cream sales at the beach. But when we add temperature variables to the equation, which in turn causing more people to go to the beach, the ice-cream sale effect is no more.

It would be interesting to check the correlation among variables by using pairs(data)

Further improvement: Only some variables are significantly predictors to age. So you can consider to remove some of the variables and check the R squares using Bayesian Information Criterion.