Solved – Linearity between predictors and dependent variable in a linear model

assumptionsmultiple regressionpredictive-modelsrregression

I run the following linear model in R :

lm(formula = NA. ~ PC + I(1/SPCI), data = DSET)

The p-value for each predictor is significant, and it works fairly well with respect to most of the assumptions in linear regression, such as:

  • Normal distribution of errors.
  • High correlation between predicted values and estimated values.
  • Homoscedasticity.
  • Non-collinearity between the predictors: PC and (1/SPCI) are not correlated at all!

But, digging more into the topic, there's an assumption that fails in my model, and it says:

Linearity of the relationship between dependent and independent variables.

This kind of contradicts the non-collinearity assumption because, if NA. and PC are highly correlated and NA. and (1/SPCI) are too, then PC and (1/SPCI) are correlated, and this is violating the assumption of non-collinearity.

I think that I misunderstood the assumption, or there's an explanation to this.

Best Answer

To add to AdamO's answer, I was taught to base my decisions regarding model assumptions more on whether failing to correct the assumption in some way causes me to misrepresent my data. For a concrete example of what I mean, I simulated some data in R and created some plots and ran some diagnostics using these data.

# lmSupport contains the lm.modelAssumptions function that I use below
require(lmSupport)
set.seed(12234)

# Create some data with a strong quadratic component
x <- rnorm(200, sd = 1)
y <- x + .75 * x^2 + rnorm(200, sd = 1)

# There is a significant linear trend
mod <- lm(y ~ x)
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7972 -0.9511 -0.1312  0.6659  5.8659 

    Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.77981    0.10463   7.453 2.77e-12 ***
x            1.19417    0.09795  12.191  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.477 on 198 degrees of freedom
Multiple R-squared: 0.4288, Adjusted R-squared: 0.4259 
F-statistic: 148.6 on 1 and 198 DF,  p-value: < 2.2e-16 

However, when plotting the data, it's clear that the curvilinear component is an important aspect of the relationship between x and y.

pX <- seq(min(x), max(x), by = .1)
pY <- predict(mod, data.frame(x = pX))
plot(x, y, frame = F)
lines(pX, pY, col = "red")

enter image description here

A diagnostic test of linearity also supports our argument that the quadratic component is an important aspect of the relationship between x and y for these data.

lm.modelAssumptions(mod, "linear")

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     0.7798       1.1942  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
gvlma(x = model) 

                       Value   p-value                   Decision
Global Stat        180.04567 0.000e+00 Assumptions NOT satisfied!
Skewness            32.67166 1.091e-08 Assumptions NOT satisfied!
Kurtosis            23.99022 9.683e-07 Assumptions NOT satisfied!
Link Function      123.35831 0.000e+00 Assumptions NOT satisfied!
Heteroscedasticity   0.02547 8.732e-01    Assumptions acceptable.

# We should probably add the quadratic component to this model
mod <- lm(y ~ x + I(x^2))

Let's see what happens when we simulate data with a smaller (but still significant) nonlinear trend.

y <- x + .25 * x^2 + rnorm(200, sd = 1)

mod <- lm(y ~ x)
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.59701 -0.77446  0.03546  0.80261  2.75938 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.30500    0.07907   3.858 0.000155 ***
x            0.99934    0.07402  13.500  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.116 on 198 degrees of freedom
Multiple R-squared: 0.4793, Adjusted R-squared: 0.4767 
F-statistic: 182.3 on 1 and 198 DF,  p-value: < 2.2e-16 

If we examine a plot of these new data, it's pretty clear that they are well-represented by just the linear trend.

pX <- seq(min(x), max(x), by = .1)
pY <- predict(mod, data.frame(x = pX))
plot(x, y, frame = F)
lines(pX, pY, col = "red")

enter image description here

This is in spite of the fact that this model fails a diagnostic test of linearity.

lm.modelAssumptions(mod, "linear")

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     0.3050       0.9993  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
gvlma(x = model) 

                     Value   p-value                   Decision
Global Stat        34.6428 5.500e-07 Assumptions NOT satisfied!
Skewness            0.3355 5.624e-01    Assumptions acceptable.
Kurtosis            2.0094 1.563e-01    Assumptions acceptable.
Link Function      32.1379 1.436e-08 Assumptions NOT satisfied!
Heteroscedasticity  0.1600 6.892e-01    Assumptions acceptable.

My point is that diagnostic tests should not be a substitute for thinking on the part of the analyst; they are tools to help you understand whether your substantive conclusions follow from your analyses. For this reason, I prefer to look at different types of plots rather than rely on global tests when I'm making these sorts of decisions.