Solved – How to check the linearity assumption

hypothesis testinglinearmodelregression

At the moment I am trying to make a list of the different approaches that could be used to verify the linearity of an effect. In a model (Y = b0 + b1.X + etc.), I want to know whether it is acceptable to assume linearity for (X).

What I've been doing so far is to estimate another model (Y = b0 + b1.X + b2.X**2) based on a quadratic specification and (1) look at significance of the quadratic term (b2), and (2) eventually perform a log-likelihood ratio test.

However, I fear that this relatively simple approach would in some circumstances be misleading (especially if pattern of non-linearity is not in line with a quadratic shape). Indeed, this simple approach would fail to reject the assumption of linearity when I simulate data that would be be described by a S-shaped curve.

What approaches (other than polynomial specification + log-likelihood ratio test) would you recommend? Ideally a test – Not a simulation based approach, and something that would work also for non-nested models (unlike the LR test).

I came across the Vuong test (https://en.wikipedia.org/wiki/Vuong%27s_closeness_test), but I am sure there is more to known on this issue. Thanks for your help!

Best Answer

If you want to see if the relationship between (the conditional expectation of) $y$ and $x_0$ is linear, after adjusting for control variables $x_1, x_2, \dots, x_p$, a simple graphical approach is to create an added-variable plot using the following procedure.

First, regress $y$ on $x_1, x_2, \dots, x_p$ and obtain the residuals from that regression, $\hat{\epsilon}_y$. Then, regress $X_0$ on $x_1, x_2, \dots, x_p$ and obtain the residuals from that regression, $\hat{\epsilon}_{x_0}$.

Then, create a scatter plot of $\hat{\epsilon}_y$ against $\hat{\epsilon}_{x_0}$ and overlay a nonparametric curve (e.g. loess) along with the linear regression line. The linear regression line will have exactly the same slope as the "long" regression that includes all variables $x_0, x_1, \dots, x_p$ by the Frisch-Waugh theorem. The nonparametric curve will give you a sense of how well the relationship between $y$ and $x_0$ can be approximated as linear.

Some simple R code to demonstrate:

data(mtcars)

# full model, with all control variables 
fullmod = lm(mpg ~ wt + vs + gear + am, mtcars)
coef(mod)[2]
>     wt 
> -3.786

# regress y on controls and x on controls, extract residuals
eps_y = lm(mpg ~ vs + gear + am, mtcars)$residuals
eps_x = lm(wt ~ vs + gear + am, mtcars)$residuals

# regress epsilon_y on epsilon_x, see the coef is the same as above
coef(lm(eps_y ~ eps_x))[2]
>  eps_y 
> -3.786

# make added variable plot
library(ggplot2)
qplot(x = eps_x, y = eps_y) + 
  geom_smooth(method = "lm", colour = "black", se= FALSE) + 
  geom_smooth(method = "loess", colour = "red", se = FALSE)

added-variable plot

Related Question