Solved – Interpreting coefficient in a linear regression model with categorical variables

interpretationmultiple regressionrregression coefficients

I will give my examples with R calls. First a simple example of a linear regression with a dependent variable 'lifespan', and two continuous explanatory variables.

data.frame(height=runif(4000,160,200))->human.life
human.life$weight=runif(4000,50,120)
human.life$lifespan=sample(45:90,4000,replace=TRUE)
summary(lm(lifespan~1+height+weight,data=human.life))

Call:
lm(formula = lifespan ~ 1 + height + weight, data = human.life)

Residuals:
Min       1Q   Median       3Q      Max 
-23.0257 -11.9124  -0.0565  11.3755  23.8591 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 63.635709   3.486426  18.252   <2e-16 ***
height       0.007485   0.018665   0.401   0.6884    
weight       0.024544   0.010428   2.354   0.0186 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 13.41 on 3997 degrees of freedom
Multiple R-squared: 0.001425,   Adjusted R-squared: 0.0009257 
F-statistic: 2.853 on 2 and 3997 DF,  p-value: 0.05781

In order to find the estimate of 'lifespan' when the value of 'weight' is 1, I add (Intercept)+height=63.64319

Now what if I have a similar data frame, but one where one of the explanatory variables is categorical?

data.frame(animal=rep(c("dog","fox","pig","wolf"),1000))->animal.life
animal.life$weight=runif(4000,8,50)
animal.life$lifespan=sample(1:10,replace=TRUE)
summary(lm(lifespan~1+animal+weight,data=animal.life))

Call:
lm(formula = lifespan ~ 1 + animal + weight, data = animal.life)

Residuals:
Min      1Q  Median      3Q     Max 
-4.7677 -2.7796 -0.1025  3.1972  4.3691 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.565556   0.145851  38.159  < 2e-16 ***
animalfox   0.806634   0.131198   6.148  8.6e-10 ***
animalpig   0.010635   0.131259   0.081   0.9354    
animalwolf  0.806650   0.131198   6.148  8.6e-10 ***
weight      0.007946   0.003815   2.083   0.0373 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.933 on 3995 degrees of freedom
Multiple R-squared: 0.01933,    Adjusted R-squared: 0.01835 
F-statistic: 19.69 on 4 and 3995 DF,  p-value: 4.625e-16

In this case, to find the estimate of 'lifespan' when the value of 'weight' is 1, should I add each of the coefficients for 'animal' to the intercept: (Intercept)+animalfox+animalpig+animalwolf? Or what is the proper way to do this?

Thanks
Sverre

Best Answer

No, you shouldn't add all of the coefficients together. You essentially have the model

$$ {\rm lifespan} = \beta_{0} + \beta_{1} \cdot {\rm fox} + \beta_{2} \cdot {\rm pig} + \beta_{3} \cdot {\rm wolf} + \beta_{4} \cdot {\rm weight} + \varepsilon $$

where, for example, ${\rm pig} = 1$ if the animal was a pig and 0 otherwise. So, to calculate $\beta_{0} + \beta_{1} + \beta_{2} + \beta_{3} + \beta_{4}$ as you've suggested for getting the overall average when ${\rm weight}=1$ is like saying "if you were a pig, a wolf, and a fox, and your weight was 1, what is your expected lifespan?". Clearly since each animal is only one of those things, that doesn't make much sense.

You will have to do this separately for each animal. For example, $\beta_{0} + \beta_{2} + \beta_{4}$ is the expected lifespan for a pig when its weight is 1.

Related Question