Solved – How to prove linearity assumption in regression analysis for a continuous dependent and nominal independent variable

assumptionscorrelationregression

I want to check the assumptions for applying linear regression analysis. So, among others I check the linear dependency between my dependent (which is continuous) and my independent (nominal or dummy) variables.

As scatterplots and Pearson or Spearman correlations are not the right measure to check the linearity assumption in my case, I wonder what is another useful way applicable in my case with a continuous dependent and nominal or dummy independent variables?

Thank you for your help!

Best Answer

Let me explain what linearity means with nominal/dummy variables. In essence, it means there is no interaction term between your independent variables that you have left out.^†

Suppose we have two nominal variables $x_0$ and $x_1$, each taking values 0 or 1, and a response variable $y$. (The general case is similar.)

If we model $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$:

$\beta_0$ is the expected response when $x_1 = x_2 = 0$

$\beta_0 + \beta_1$ is the expected response when $x_1 = 1, x_2 = 0$

$\beta_0 + \beta_2$ is the expected response when $x_1 = 0, x_2 = 1$

$\beta_0 + \beta_1 + \beta_2$ is the expected response when $x_1 = x_2 = 1$

There's a relationship here, since we have 3 coefficients but four cases: The last minus the first is the sum of the second minus the first and the third minus the first.

If this relationship actually holds in your situation between the expected responses, then this linear model can be a good one. If not, then the failure of this relationship is a type of nonlinearity.

If we include an interaction term, then linearity is automatically satisfied, because we have four coefficients to fit the four cases. That is, with a model $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon$ there is no restriction on the relationship between the expected responses in the four cases above. (However the distributions of y in these 4 cases may still be different, which would violate the model as written.)

How do you test whether you can leave out the interaction term? One way would be to try including it and test whether the coefficient $\beta_3$ is distinct from zero. For example, in the case of normal error $\epsilon$, this would be a $t$-test for a slope coefficient in a regression.

† An interaction between $x_1$ & $x_2$ is a type of (multi-dimensional) nonlinearity: there's no possibility of a nonlinear relationship between $\operatorname{E}Y$ and $x_1$ when $x_1$ is a dummy variable, but there is between $\operatorname{E}Y$ and $(x_1,x_2)$. That is, there may be no plane passing through the four points $(0,0,\operatorname{E}(Y|\,0,0))$, $(1,0,\operatorname{\operatorname{E}}(Y|\,1,0))$, $(0,1,\operatorname{E}(Y|\,0,1))$, $(1,1,\operatorname{E}(Y|\,1,1))$.

For dummy variables, these interaction terms are the only potential source of nonlinearity of the expected responses.

Related Solutions

Solved – Linearity between predictors and dependent variable in a linear model

To add to AdamO's answer, I was taught to base my decisions regarding model assumptions more on whether failing to correct the assumption in some way causes me to misrepresent my data. For a concrete example of what I mean, I simulated some data in R and created some plots and ran some diagnostics using these data.

# lmSupport contains the lm.modelAssumptions function that I use below
require(lmSupport)
set.seed(12234)

# Create some data with a strong quadratic component
x <- rnorm(200, sd = 1)
y <- x + .75 * x^2 + rnorm(200, sd = 1)

# There is a significant linear trend
mod <- lm(y ~ x)
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7972 -0.9511 -0.1312  0.6659  5.8659 

    Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.77981    0.10463   7.453 2.77e-12 ***
x            1.19417    0.09795  12.191  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.477 on 198 degrees of freedom
Multiple R-squared: 0.4288, Adjusted R-squared: 0.4259 
F-statistic: 148.6 on 1 and 198 DF,  p-value: < 2.2e-16

However, when plotting the data, it's clear that the curvilinear component is an important aspect of the relationship between x and y.

pX <- seq(min(x), max(x), by = .1)
pY <- predict(mod, data.frame(x = pX))
plot(x, y, frame = F)
lines(pX, pY, col = "red")

enter image description here

A diagnostic test of linearity also supports our argument that the quadratic component is an important aspect of the relationship between x and y for these data.

lm.modelAssumptions(mod, "linear")

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     0.7798       1.1942  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
gvlma(x = model) 

                       Value   p-value                   Decision
Global Stat        180.04567 0.000e+00 Assumptions NOT satisfied!
Skewness            32.67166 1.091e-08 Assumptions NOT satisfied!
Kurtosis            23.99022 9.683e-07 Assumptions NOT satisfied!
Link Function      123.35831 0.000e+00 Assumptions NOT satisfied!
Heteroscedasticity   0.02547 8.732e-01    Assumptions acceptable.

# We should probably add the quadratic component to this model
mod <- lm(y ~ x + I(x^2))

Let's see what happens when we simulate data with a smaller (but still significant) nonlinear trend.

y <- x + .25 * x^2 + rnorm(200, sd = 1)

mod <- lm(y ~ x)
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.59701 -0.77446  0.03546  0.80261  2.75938 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.30500    0.07907   3.858 0.000155 ***
x            0.99934    0.07402  13.500  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.116 on 198 degrees of freedom
Multiple R-squared: 0.4793, Adjusted R-squared: 0.4767 
F-statistic: 182.3 on 1 and 198 DF,  p-value: < 2.2e-16

If we examine a plot of these new data, it's pretty clear that they are well-represented by just the linear trend.

pX <- seq(min(x), max(x), by = .1)
pY <- predict(mod, data.frame(x = pX))
plot(x, y, frame = F)
lines(pX, pY, col = "red")

enter image description here

This is in spite of the fact that this model fails a diagnostic test of linearity.

lm.modelAssumptions(mod, "linear")

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     0.3050       0.9993  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
gvlma(x = model) 

                     Value   p-value                   Decision
Global Stat        34.6428 5.500e-07 Assumptions NOT satisfied!
Skewness            0.3355 5.624e-01    Assumptions acceptable.
Kurtosis            2.0094 1.563e-01    Assumptions acceptable.
Link Function      32.1379 1.436e-08 Assumptions NOT satisfied!
Heteroscedasticity  0.1600 6.892e-01    Assumptions acceptable.

My point is that diagnostic tests should not be a substitute for thinking on the part of the analyst; they are tools to help you understand whether your substantive conclusions follow from your analyses. For this reason, I prefer to look at different types of plots rather than rely on global tests when I'm making these sorts of decisions.

Solved – How should I check the assumption of linearity to the logit for the continuous independent variables in logistic regression analysis

Logistic regression does NOT assume a linear relationship between the dependent and independent variables. It does assume a linear relationship between the log odds of the dependent variable and the independent variables (This is mainly an issue with continuous independent variables.) There is a test called the Box-Tidwell that you can use for this. The stata command is boxtid. I don't know the SPSS command, sorry.

This may be of help -- http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm

Best Answer

Related Solutions

Solved – Linearity between predictors and dependent variable in a linear model

Solved – How should I check the assumption of linearity to the logit for the continuous independent variables in logistic regression analysis

Related Question