Solved – How to prove linearity assumption in regression analysis for a continuous dependent and nominal independent variable

assumptionscorrelationregression

I want to check the assumptions for applying linear regression analysis. So, among others I check the linear dependency between my dependent (which is continuous) and my independent (nominal or dummy) variables.

As scatterplots and Pearson or Spearman correlations are not the right measure to check the linearity assumption in my case, I wonder what is another useful way applicable in my case with a continuous dependent and nominal or dummy independent variables?

Thank you for your help!

Best Answer

Let me explain what linearity means with nominal/dummy variables. In essence, it means there is no interaction term between your independent variables that you have left out.

Suppose we have two nominal variables $x_0$ and $x_1$, each taking values 0 or 1, and a response variable $y$. (The general case is similar.)

If we model $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$:

$\beta_0$ is the expected response when $x_1 = x_2 = 0$

$\beta_0 + \beta_1$ is the expected response when $x_1 = 1, x_2 = 0$

$\beta_0 + \beta_2$ is the expected response when $x_1 = 0, x_2 = 1$

$\beta_0 + \beta_1 + \beta_2$ is the expected response when $x_1 = x_2 = 1$

There's a relationship here, since we have 3 coefficients but four cases: The last minus the first is the sum of the second minus the first and the third minus the first.

If this relationship actually holds in your situation between the expected responses, then this linear model can be a good one. If not, then the failure of this relationship is a type of nonlinearity.


If we include an interaction term, then linearity is automatically satisfied, because we have four coefficients to fit the four cases. That is, with a model $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon$ there is no restriction on the relationship between the expected responses in the four cases above. (However the distributions of y in these 4 cases may still be different, which would violate the model as written.)


How do you test whether you can leave out the interaction term? One way would be to try including it and test whether the coefficient $\beta_3$ is distinct from zero. For example, in the case of normal error $\epsilon$, this would be a $t$-test for a slope coefficient in a regression.


† An interaction between $x_1$ & $x_2$ is a type of (multi-dimensional) nonlinearity: there's no possibility of a nonlinear relationship between $\operatorname{E}Y$ and $x_1$ when $x_1$ is a dummy variable, but there is between $\operatorname{E}Y$ and $(x_1,x_2)$. That is, there may be no plane passing through the four points $(0,0,\operatorname{E}(Y|\,0,0))$, $(1,0,\operatorname{\operatorname{E}}(Y|\,1,0))$, $(0,1,\operatorname{E}(Y|\,0,1))$, $(1,1,\operatorname{E}(Y|\,1,1))$.

For dummy variables, these interaction terms are the only potential source of nonlinearity of the expected responses.

Related Question