Solved – interpretation of dumthe coded linear regression

interpretationrregression

I'm struggling with the interpretation of a regression model where a categorial variable (5 levels) is dummy coded. Here is the result of my calculation in R:

Call:
lm(formula = DV ~ Age + Gender + factor(Categorial) + 
Continuous 1 + Continuous 2 + Continuous 3, 
data = dat)

Residuals:
 Min       1Q   Median       3Q      Max 
-1.30058 -0.25326  0.00349  0.28123  1.49877 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)   
(Intercept)           -0.42367    0.30694  -1.380  0.16842   
Age                   -0.05949    0.02026  -2.936  0.00356 **
Gender                -0.01800    0.04828  -0.373  0.70952   
factor(Categorial)2   -0.30625    0.12645  -2.422  0.01596 * 
factor(Categorial)3   -0.03441    0.07752  -0.444  0.65736   
factor(Categorial)4   -0.12603    0.09914  -1.271  0.20453   
factor(Categorial)5   -0.08417    0.13269  -0.634  0.52630    
Continuous 1           0.12080    0.04346   2.779  0.00575 **
Continuous 2          -0.06592    0.04383  -1.504  0.13354   
Continuous 3          -0.06230    0.03475  -1.793  0.07392 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4259 on 336 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.1315,    Adjusted R-squared:  0.1057 
F-statistic: 5.089 on 10 and 336 DF,  p-value: 6.353e-07

Ok. Age, Factor 2 of the categorial variable and the first continuous variable are significant predictors of the dependent variable. so far so good.

What I'm not understanding is:

  1. The reference category of the dummy coded categorial variable is the intercept and the first category of the categorial variable. right? How do I interpret this?

  2. When doing an anova with the categorial variable as a independent variable, this factor is a significant predictor. With the results of the linear model, one could conclude that this is only due to category 2, right?

  3. Can I test contrasts with this linear regression model (e.g. Category1 vs. Category2)?

  4. Should I include interactions?

I'd be glad for any help 🙂

Best Answer

I'll give you some examples, but try to find some more info about regression output interpretation.

  • The reference category of the dummy variable IS NOT the intercept, but the information of the reference category is included in the intercept (maybe that's what you had in mind). So, given all the other variables in the model (that's always the case when interpreting) the difference between factor 2 and factor 1 (reference category) of the dummy variable is -0.30625 and that's statistically significant. So, when you compare cases that their only difference is factor 2 vs. factor 1, you expect on average that the one with factor 1 is 0.30625 higher.

  • When you do anova and you include ONLY that variable and then you create a regression model with that variable + some other variables, it is not the same. Anova takes into account only this variable, but regression takes into account all variables together. If you include other variables or if you exclude some others maybe you won't get same results.

  • Gender should probably be coded as a dummy variable as well, with factor 1 = Male and factor 2 = Female. Because the interpretation here is that when gender increases by 1 unit then the dependent variable decreases by 0.018 (which sounds wrong).

  • The best way to find if you need to include interaction is to include them and check what happens to your predictive accuracy and what are the p values you obtain. Keep in mind that there are so many interactions you can include here.

  • Try to simulate some simple data, run a regression model and check results. Try to use 2 dummy variables, try to include interactions, remove some variables and check how p values change.