Regression in R – How to Do Regression with Effect Coding Instead of Dummy Coding

categorical datacategorical-encodingrregression

I am currently working on a regression model where I have only categorical/factor variables as independent variables. My dependent variable is a logit transformed ratio.

It is fairly easy just to run a normal regression in R, as R automatically know how to code dummies as soon as they are of the type "factor". However this type of coding also implies that one category from each variable is used as a baseline, making it hard to interpret.

My professor have told me to just use effect coding instead (-1 or 1), as this implies the use of the grand mean for the intercept.

Does anyone know how to handle that?

Until now I have tried:

gm <- mean(tapply(ds$ln.crea, ds$month,  mean))
model <- lm(ln.crea ~ month + month*month + year + year*year, data = ds, contrasts = list(gm = contr.sum))

Call:
lm(formula = ln.crea ~ month + month * month + year + year * 
    year, data = ds, contrasts = list(gm = contr.sum))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89483 -0.19239 -0.03651  0.14955  0.89671 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.244493   0.204502 -15.865   <2e-16 ***
monthFeb    -0.124035   0.144604  -0.858   0.3928    
monthMar    -0.365223   0.144604  -2.526   0.0129 *  
monthApr    -0.240314   0.144604  -1.662   0.0993 .  
monthMay    -0.109138   0.144604  -0.755   0.4520    
monthJun    -0.350185   0.144604  -2.422   0.0170 *  
monthJul     0.050518   0.144604   0.349   0.7275    
monthAug    -0.206436   0.144604  -1.428   0.1562    
monthSep    -0.134197   0.142327  -0.943   0.3478    
monthOct    -0.178182   0.142327  -1.252   0.2132    
monthNov    -0.119126   0.142327  -0.837   0.4044    
monthDec    -0.147681   0.142327  -1.038   0.3017    
year1999     0.482988   0.200196   2.413   0.0174 *  
year2000    -0.018540   0.200196  -0.093   0.9264    
year2001    -0.166511   0.200196  -0.832   0.4073    
year2002    -0.056698   0.200196  -0.283   0.7775    
year2003    -0.173219   0.200196  -0.865   0.3887    
year2004     0.013831   0.200196   0.069   0.9450    
year2005     0.007362   0.200196   0.037   0.9707    
year2006    -0.281472   0.200196  -1.406   0.1625    
year2007    -0.266659   0.200196  -1.332   0.1855    
year2008    -0.248883   0.200196  -1.243   0.2164    
year2009    -0.153083   0.200196  -0.765   0.4461    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.3391 on 113 degrees of freedom
Multiple R-squared: 0.3626, Adjusted R-squared: 0.2385 
F-statistic: 2.922 on 22 and 113 DF,  p-value: 0.0001131 

Best Answer

In principle, there are two types of contrast coding, with which the intercept will estimate the Grand Mean. These are sum contrasts and repeated contrasts (sliding differences).

Here's an example data set:

set.seed(42)
x <- data.frame(a = c(rnorm(100,2), rnorm(100,1),rnorm(100,0)),
                b = rep(c("A", "B", "C"), each = 100))

The conditions' means:

tapply(x$a, x$b, mean)
         A           B           C 
2.03251482  0.91251629 -0.01036817 

The Grand Mean:

mean(tapply(x$a, x$b, mean))
[1] 0.978221

You can specify the type of contrast coding with the contrasts parameter in lm.

Sum contrasts

lm(a ~ b, x, contrasts = list(b = contr.sum))

Coefficients:
(Intercept)           b1           b2  
     0.9782       1.0543      -0.0657 

The intercept is the Grand Mean. The first slope is the difference between the first factor level and the Grand Mean. The second slope is the difference between the second factor level and the Grand Mean.

Repeated contrasts

The function for creating repeated contrasts is part of the MASS package.

lm(a ~ b, x, contrasts = list(b = MASS::contr.sdif))

Coefficients:
(Intercept)         b2-1         b3-2  
     0.9782      -1.1200      -0.9229 

The intercept is the Grand Mean. The slopes indicate the differences between consecutive factor levels (2 vs. 1, 3 vs. 2).

Related Question