Example Data:
GENDER <- c("m", "f", "f", "m", "f", "f")
YEAR <- c("1", "2", "3", "2", "1", "3")
SCORE <- c(23, 25, 26, 23, 19, 29)
as.data.frame(cbind(GENDER, YEAR, SCORE))
Multiple Linear regression with interaction:
result <- lm(SCORE~GENDER*YEAR)
summary(result)
I want to carry out a multiple linear regression with interaction to find out whether there is a significant difference between scores of male and female, years 1 2 and 3 and their interaction. After a lot of searching and studying, I'm still at a loss as to how to get the information I want.
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.000 2.121 8.957 0.0708 .
GENDERm 4.000 3.000 1.333 0.4097
YEAR2 6.000 3.000 2.000 0.2952
YEAR3 8.500 2.598 3.272 0.1888
GENDERm:YEAR2 -6.000 4.243 -1.414 0.3918
GENDERm:YEAR3 NA NA NA NA
Ignore the data as it is made up and insignificant, and I understand the P values etc.
I simply want to know, what does GENDERm mean in this case? If this result was significant (p<0.05) what could I interpret from this? From what I understand, it means that the group "male, year 1" has relatively 4 scores points higher than "female, year 1".
Another case, just to be clear. What does GENDERm:YEAR2 mean? If this result was significant, what could I interpret from this? I understand it as this: The group "male, year 2" is (relatively) 6 score points lower than "female year 1".
If I am understanding this correctly, which I highly doubt, then how can I get some meaningful information from this result?
I fully understand the result when conducting this test on continuous independent variables, I just have a problem with the categoricals!
Thanks in advance 🙂
Best Answer
@juod provides a great explanation of the interpretation of the regression coefficients. I want to add that for models with categorical predictors with more than two levels, you may find an ANOVA table more informative than typical regression output.
The ANOVA style output will give you an F test for each effect, whereas the regression output gives you tests for each regression coefficient; a categorical variable with
k
levels will havek-1
coefficients (fromk-1
dummy codes), so a single variable will be represented across multiple lines of output. Any interactions with those categorical variables will also be represented across multiple lines of output. This can make it difficult to tell at a glance whether, for example, whether there is a significant interaction between GENDER and YEAR. For factors with two levels, the F-test in the ANOVA output will be equivalent to the t-test in the standard regression output ($F=t^2$).To get ANOVA style output, you can use
aov
in base R, orAnova
in thecar
package --- I recommend the latter. Note thataov
will give you Type 1 sums of squares, which may not make sense unless you have a balanced design.Anova
lets you select the type of sums of squares you want to calculate. See this previous answer for relevant discussion.In addition, you'll note in the help documentation for
aov
andAnova
that they recommend you use orthogonal contrast codes for your categorical predictors. By default, R uses traditional dummy coding, which sets the first level of a factor as the reference group and then tests each other level against that --- those are not orthogonal comparisons. If you want to use an ANOVA output summary, first make sure you're using orthogonal contrasts when you estimate the model:(The dataset you provided is actually too small to test the model you use, so I'm creating a new dataset here with more cases)
Here's the output: