Solved – How to test assumptions with a categorical predictor (planned contrasts) in linear regression

assumptionscategorical datacontrastsrregression

I am using regression with planned contrasts and would like to test statistical assumptions. Assumptions are normally tested on the residuals of the regression model, but in this case, I don't know if it makes sense because the predictor variable is categorical (i.e., group) and contrasts are only tested later (one contrast at a time, meaning two groups at a time).

For example, using the amazing performance package (which needs package see for the plot), we can see for the normality, homoscedasticity and homogeneity of variance plots, the observations cluster in three points only:

library(performance)
library(see)
mod <- lm(mpg ~ factor(cyl), data=mtcars)
check_model(mod)

enter image description here

While normally the observations would be distributed more equally like this:

mod2 <- lm(mpg ~ disp, data=mtcars)
check_model(mod2)

enter image description here

Question

Which option below is the correct/best one, given the situation? (Feel free to suggest another one) (Edit: Also note that I am using the assumption of normality for the sake of simplicity and conciseness but I am interested in all assumptions for this situation)

(a) Assess normality of the dependent variable

shapiro.test(mtcars$mpg)

    Shapiro-Wilk normality test

data:  mtcars$mpg
W = 0.94756, p-value = 0.1229

(b) Assess normality of the residuals of the whole model

shapiro.test(mod$residuals)

    Shapiro-Wilk normality test

data:  mod$residuals
W = 0.97065, p-value = 0.5177

(c) Assess normality of model residuals for each group contrast (combination of two groups) by excluding the third group manually before respecifying the regression

mod1 <- lm(mpg ~ factor(cyl), data=mtcars[which(mtcars$cyl!=4),])
mod2 <- lm(mpg ~ factor(cyl), data=mtcars[which(mtcars$cyl!=6),])
mod3 <- lm(mpg ~ factor(cyl), data=mtcars[which(mtcars$cyl!=8),])

shapiro.test(mod1$residuals)

    Shapiro-Wilk normality test

data:  mod1$residuals
W = 0.9515, p-value = 0.3636

shapiro.test(mod2$residuals)

    Shapiro-Wilk normality test

data:  mod2$residuals
W = 0.95956, p-value = 0.4058

shapiro.test(mod3$residuals)

    Shapiro-Wilk normality test

data:  mod3$residuals
W = 0.96698, p-value = 0.7393

Note: With option (c), I believe the assumptions would not apply to the model as a whole, but to each comparison test separately. So say you have 2 models with 3 contrasts each, instead of having assumption checks for the two models, you would have assumptions checks for the 6 contrasts.

Best Answer

The distributional assumptions are the same with categorical predictors as with numerical predictors; namely, the conditional distributions are all assumed to be normal with common variance. However, with categorical predictors you do not have to resort to looking at residuals, because you can investigate the conditional distributions directly by subsetting the data for the different levels of the categorical predictor.

In the case of numerical predictors, there is typically little data within the subsets to allow meaningful investigation of the conditional distributions. That is why people often resort to looking at the residuals in such cases.

So your "method 3" looks ok, except (i) there is no need to look at residuals, just look at the raw data, and (ii) use the S-W test only as an afterthought. Instead, look mainly at the histograms and q-q plots, and use your subject matter knowledge.

Related Question