Solved – Does dumthe code a variable affect the intercept in a linear regression model

categorical-encodinginterceptregressionregression coefficients

My colleague and I were both using R to fit a linear regression with the same dataset and same variables. The outcome variable is test grade while the independent variables are gender, age, and times of homework submission. My colleague threw everything in the model while I dummy coded the gender first. It turned out our model coefficients are almost the same except for the intercept. The difference in the intercept is small (.08). I am curious if dummy coding the gender variable was causing that nuance.

With a categorical variable like gender, the 2 levels (0,1) could mean the same thing (?) to the regression even though strictly speaking, one is categorical while the other is continuous. But I am not sure is that's the reason? Or it is because there is certain degree of difference in intercept when we are fitting a regression model. Can anyone help? Thanks!

Best Answer

R uses dummy coding by default for categorical predictor variables which are declared as factors. The way R does this is it treats the first level of that categorical variable as the *reference level" and creates dummy variables that will enable the comparison of each subsequent level against that reference level. To see what level R treats as "first" for a factor, simply use the levels() command on that factor. However, you can reorder the levels of a factor to make your comparisons more meaningful if needed.

In the example below, am is a categorical predictor variable from the mtcars dataset which stands for type of transmission for a car (0 = automatic, 1 = manual). In this dataset, am is declared as numeric (num) but we can convert it to a factor named am1vs0:

str(mtcars$am)
mtcars$am1vs0 <- factor(mtcars$am, levels = c(0,1))

By listing the levels of this factor in the order seen above (i.e., 0, 1), we are forcing R to treat 0 as the reference level and compare the remaining level, 1, against 0.

We can then fit the model below, where mpg stands for miles per gallon and wt stands for weight of car:

M1 <- lm(mpg ~ wt + am1vs0, data = mtcars)
summary(M1) 

The summary of the model fit you will include this portion of output:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.32155    3.05464  12.218 5.84e-13 ***
wt          -5.35281    0.78824  -6.791 1.87e-07 ***
am1vs01     -0.02362    1.54565  -0.015    0.988    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this model, the intercept denotes the mean value of mpg for cars for which am1vs0 is equal to the reference level 0 (i.e., cars with automated transmission) who have the same weight.

Now, let's say that you want 1 to be the reference level for am:

mtcars$am0vs1 <- factor(mtcars$am, levels = c(1,0))

By listing the levels of the factor am0vs1 in the order seen above (i.e., 1, 0) we are forcing R to treat 1 as the reference level and compare the remaining level, 0, against 1. The corresponding model would be:

M2 <- lm(mpg ~ wt + am0vs1, data = mtcars)
summary(M2) 

The output for this second model is:

    Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.29794    2.08566  17.883  < 2e-16 ***
wt          -5.35281    0.78824  -6.791 1.87e-07 ***
am0vs10      0.02362    1.54565   0.015    0.988    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The intercept for this second model represents the mean value of mpg for cars for which am0vs1 is equal to the reference level 1 (i.e., cars with manual transmission) who have the same weight.

So the two models have intercepts with different interpretations, because each model uses a different reference level for the categorical predictor am.

Now, we don't know from your post whether both you and your colleague used the same reference level for your categorical predictor gender. If you are seeing different intercepts for your models, chances are that you used difference reference levels for gender. Of course, you didn't use a factor variable rather a numeric binary variable in your model. But that should still give you the same result as if you coded gender as a factor and treated the level denoted by 0 as the reference level.

Related Question