R Linear Regression – Categorical Variable “Hidden” Value in Linear Regression

categorical datacategorical-encodingrregressionregression coefficients

This is just an example that I have come across several times, so I don't have any sample data. Running a linear regression model in R:

a.lm = lm(Y ~ x1 + x2)

x1 is a continuous variable. x2 is categorical and has three values e.g. "Low", "Medium" and "High". However the output given by R would be something like:

summary(a.lm)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.521     0.20       1.446   0.19        
x1            -0.61     0.11       1.451   0.17
x2Low         -0.78     0.22       -2.34   0.005
x2Medium      -0.56     0.45       -2.34   0.005

I understand that R introduces some sort of dummy coding on such factors (x2 being a factor). I'm just wondering, how do I interpret the x2 value "High"? For example, what effect does "High" x2s have on the response variable in the example given here?

I've seen examples of this elsewhere (e.g. here) but haven't found an explanation I could understand.

Best Answer

Q: " ... how do I interpret the x2 value "High"? For example, what effect does "High" x2s have on the response variable in the example given here??

A: You have no doubt noticed that there is no mention of x2="High" in the output. At the moment x2High is chosen as the "base case". That's because you offered a factor variable with the default coding for levels despite an ordering that would have been L/M/H more naturally to the human mind. But "H" being lexically before both "L" and "M" in the alphabet, was chosen by R as the base case.

Since 'x2' was not ordered, each of the reported contrasts were relative to x2="High" and so x2=="Low" was estimated at -0.78 relative to x2="High". At the moment the Intercept is the estimated value of "Y" when x2="High" and x1= 0. You probably want to re-run your regression after changing the levels ordering (but not making the factor ordered).

x2a = factor(x2, levels=c("Low", "Medium", "High"))

Then your 'Medium' and 'High' estimate will be more in line with what you expect.

Edit: There are alternative coding arrangements (or more accurately arrangements of the model matrix.) The default choice for contrasts in R is "treatment contrasts" which specifies one factor level (or one particular combination of factor levels) as the reference level and reports estimated mean differences for other levels or combinations. You can, however have the reference level be the overall mean by forcing the Intercept to be 0 (not recommended) or using one of the other contrast choices:

?contrasts
?C   # which also means you should _not_ use either "c" or "C" as variable names.

You can choose different contrasts for different factors, although doing so would seem to impose an additional interpretive burden. S-Plus uses Helmert contrasts by default, and SAS uses treatment contrasts but chooses the last factor level rather than the first as the reference level.

Related Question