Solved – Interpreting results of a multiple linear regression (categorical independent variables)

categorical datainferencemultiple regressionrstatistical significance

Example Data:

GENDER <- c("m", "f", "f", "m", "f", "f")

YEAR <- c("1", "2", "3", "2", "1", "3")

SCORE <- c(23, 25, 26, 23, 19, 29)
as.data.frame(cbind(GENDER, YEAR, SCORE))

Multiple Linear regression with interaction:

result <- lm(SCORE~GENDER*YEAR)

summary(result)

I want to carry out a multiple linear regression with interaction to find out whether there is a significant difference between scores of male and female, years 1 2 and 3 and their interaction. After a lot of searching and studying, I'm still at a loss as to how to get the information I want.

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)     19.000      2.121   8.957   0.0708 .
GENDERm          4.000      3.000   1.333   0.4097  
YEAR2            6.000      3.000   2.000   0.2952  
YEAR3            8.500      2.598   3.272   0.1888  
GENDERm:YEAR2   -6.000      4.243  -1.414   0.3918  
GENDERm:YEAR3       NA         NA      NA       NA 

Ignore the data as it is made up and insignificant, and I understand the P values etc.

I simply want to know, what does GENDERm mean in this case? If this result was significant (p<0.05) what could I interpret from this? From what I understand, it means that the group "male, year 1" has relatively 4 scores points higher than "female, year 1".

Another case, just to be clear. What does GENDERm:YEAR2 mean? If this result was significant, what could I interpret from this? I understand it as this: The group "male, year 2" is (relatively) 6 score points lower than "female year 1".

If I am understanding this correctly, which I highly doubt, then how can I get some meaningful information from this result?

I fully understand the result when conducting this test on continuous independent variables, I just have a problem with the categoricals!

Thanks in advance 🙂

Best Answer

@juod provides a great explanation of the interpretation of the regression coefficients. I want to add that for models with categorical predictors with more than two levels, you may find an ANOVA table more informative than typical regression output.

The ANOVA style output will give you an F test for each effect, whereas the regression output gives you tests for each regression coefficient; a categorical variable with k levels will have k-1 coefficients (from k-1 dummy codes), so a single variable will be represented across multiple lines of output. Any interactions with those categorical variables will also be represented across multiple lines of output. This can make it difficult to tell at a glance whether, for example, whether there is a significant interaction between GENDER and YEAR. For factors with two levels, the F-test in the ANOVA output will be equivalent to the t-test in the standard regression output ($F=t^2$).

To get ANOVA style output, you can use aov in base R, or Anova in the car package --- I recommend the latter. Note that aov will give you Type 1 sums of squares, which may not make sense unless you have a balanced design. Anova lets you select the type of sums of squares you want to calculate. See this previous answer for relevant discussion.

In addition, you'll note in the help documentation for aov and Anova that they recommend you use orthogonal contrast codes for your categorical predictors. By default, R uses traditional dummy coding, which sets the first level of a factor as the reference group and then tests each other level against that --- those are not orthogonal comparisons. If you want to use an ANOVA output summary, first make sure you're using orthogonal contrasts when you estimate the model:

(The dataset you provided is actually too small to test the model you use, so I'm creating a new dataset here with more cases)

set.seed(24601)
SCORE <- sample(15:25, 30, replace = TRUE)
GENDER <- gl(n=2,k=1, length=30, labels=c("m", "f"))
YEAR <- gl(n=3, k=1, length = 30, labels=c("1", "2", "3"))
result <- lm(SCORE~GENDER*YEAR, contrasts = list(GENDER = contr.helmert, YEAR = contr.helmert))
library(car)
Anova(result, type=2) # type 2 sums of squares (in this case it's a balanced design, so the type of SS won't make a difference)

Here's the output:

Anova Table (Type II tests)

Response: SCORE
             Sum Sq Df F value Pr(>F)
GENDER        8.533  1  0.9143 0.3485
YEAR          0.067  2  0.0036 0.9964
GENDER:YEAR  37.267  2  1.9964 0.1577
Residuals   224.000 24   
Related Question