multiple-regression – When is Deviation Coding Useful in Multiple Regression Analysis?

categorical datacategorical-encodingcontrastsmultiple regression

After many years of learning about contrasts in linear models I am curious about the relative usefulness of deviation coding, as it is defined by this website. I would appreciate someone filling me in on a few things. Here is a data simulation exercise to highlight the source of my confusion. I want to simulate data where there is one two-level categorical predictor (group) and one continuous predictor (pred). I have programmed in between-group differences in the relationship of both predictors to the outcome variable (outcome).

# toy data

set.seed(0001)

# two variables with a .7 correlation, where mean of outcome variable is 0
mu1 <- c(0,0)
Sigma1 <- matrix(.7, nrow=2, ncol=2) + diag(2)*.3 
rawvars1 <- as.data.frame(MASS::mvrnorm(n=100, mu=mu1, Sigma=Sigma1))
names(rawvars1) <- c("pred", "outcome")

# two variables with a .3 correlation where mean of outcome variable is 4
mu2 <- c(0,4)
Sigma2 <- matrix(.3, nrow = 2, ncol = 2) + diag(2)*.7
rawvars2 <- as.data.frame(MASS::mvrnorm(n=100, mu=mu2, Sigma=Sigma2))
names(rawvars2) <- c("pred", "outcome")

# bind dataframes and add a grouping variable 
df <- rbind(rawvars1, rawvars2)
df$group <- factor(rep(letters[1:2], each = 100))

Let's check that we did what we intended to by getting the pred-outcome correlations at each level of group

by(df, df$group, function (i) cor.test(i$outcome, i$pred))

Good. Correlations of r = .666 in group a and r = .3 in group b. Very close to the correlations we programmed into the two datasets before we bound them. What about the means of the outcomes?

tapply(df$outcome, df$group, mean)

Means of ~0 for the outcome in group a and ~4 in group b. So our dataframe behaves like we told it to. Now let's try regression with different coding scheme

Treatment coding

lm(outcome ~ pred*group, df, contrasts = list(group = c(0,1)))

# Coefficients:
#   (Intercept)         pred       group1  pred:group1  
#      0.009227     0.665201     4.047715    -0.288114 

Ok so here the intercept is the mean of the outcome in group a, which is about 0. Next the pred coefficient is the amount thatpred predicts the outcome in group a only, about the same as the correlation we extracted earlier, so good. Next, the group1 coefficient is the difference in mean outcome value (i.e. intercept) between group b compared to group a. Once again this is right, a predicted difference of ~4. Lastly the pred:group1 coefficient is the difference in slope between group b and group a = .665 + -.288 = .377, which, once again is about what the correlation was between the predictor and outcome in group b. This coding scheme has retrieved the parameters accurately. Next simple coding

Simple coding

lm(outcome ~ pred*group, df, contrasts = list(group = c(-.5,.5)))

# Coefficients:
#   (Intercept)         pred       group1  pred:group1  
#        2.0331       0.5211       4.0477      -0.2881  

Here the intercept should be the grand mean

mean(df$outcome) 
[1] 2.070099

…close enough. And the pred coefficient should be the slope/correlation averaged across groups

mean(c(.665, .377))
[1] 0.521 

The group coefficient is also correct, a mean difference of four, and the pred:group1 coefficient once again is the difference in slope between groups: -2.88. Like treatment coding, simple coding gives us accurate information. Its incremental benefit over treatment coding is that it yields something like a 'main effect' contrast, the pred coefficient, which is the amount that the pred predictor predicts the outcome, averaged across groups.

Deviation coding

lm(outcome ~ pred*group, df, contrasts = list(group = c(-1,1)))

# Coefficients:
#   (Intercept)         pred       group1  pred:group1  
#        2.0331       0.5211       2.0239      -0.1441 

I am confused about what the coefficients are giving us here. Here the intercept is once again the grand mean, the pred coefficient is the slope averaged across groups, but the coefficients for the group1 and pred:group1 are exactly half what they are in under the simple coding scheme

So is the group coefficient simply half of the real difference between groups, half of the coefficient obtained by the simple coding scheme, i.e. incorrect?

2 * 2.0239
[1] 4.0478

Or is it the difference between the group b mean outcome and the grand mean?

2.0331 + 2.0239
[1] 4.057 # not quite right but close

And what is the pred:group1 coefficient, if not mistaken?

And how does any of this relate to type-1 and type-3 sums of squares? The deviation coding is what we get if we specify options(contrasts = c("contr.sum", "contr.poly)), which is supposed to be type-3 (a la SPSS default coding). But it doesn't retrieve the correct parameters, unless I am mistaken and interpreting the regression output incorrectly.

Best Answer

@llewmills: This week, I encountered a project where the deviation coding you inquired about came in handy, so I thought I would share here what I learned on this topic.

First, I think it will be easier if we start with a simpler model:

m0.dev <- lm(outcome ~ group, df, 
             contrasts = list(group = c(-1,1)))
summary(m0.dev)

whose output is given by:

Call:
lm(formula = outcome ~ group, data = df, 
   contrasts = list(group = c(-1, 1)))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3119 -0.6728  0.1027  0.6748  2.8539 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.07010    0.07019   29.49   <2e-16 ***
group1       1.98435    0.07019   28.27   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9926 on 198 degrees of freedom
Multiple R-squared:  0.8015,    Adjusted R-squared:  0.8005 
F-statistic: 799.3 on 1 and 198 DF,  p-value: < 2.2e-16

In this model, which uses deviation coding for group, the intercept represents the overall mean of the outcome value y across the groups a and b, whereas the coefficient of group1 represents the difference between this overall mean and the mean of the outcome value y for group a. The overall mean is none other than the mean of these two means: (i) the mean of y for group a and (ii) the mean of y for group b.

In other words:

2.07010 = overall mean of y across groups a and b

1.98435 = (overall mean of y across groups a and b) - mean of y for group a

Recall that, for your simulated data, the mean of y for group a is 0.08574619 and the mean of y for group b is 4.05445164. Indeed:

means <- tapply(df$outcome, df$group, mean)
means

> means    
     a          b 
0.08574619 4.05445164 

Here is how we recover these means from the model summary reported above:

# group a: 
mean of group a = overall mean  - (overall mean - mean of group a) 
            = intercept in deviation coded model m0.dev - coef of group a (aka, group 1) in model m0.dev 
            = 2.07010 - 1.98435 = 0.08575

# group b: 
mean of group b = overall mean  - (overall mean - mean of   group b) 
            = overall mean + (overall mean - mean of group a) 
            = intercept in deviation coded model m0.dev + coef of group a (aka, group 1) in model m0.dev 
            = 2.07010 + 1.98435 = 4.05445

In the above, we used the fact that the sum of the deviations of each group's mean from the overall mean is zero:

(overall mean - mean of group a) + (overall mean - mean of group b) = 0.  

This implies that:

(overall mean - mean of group b) = -(overall mean - mean of group a).      

Now, we can plot the group-specific means and the overall means derived from the summary of model m0.dev using this code:

plot(1:2, as.numeric(means), type="n", xlim=c(0.5,2.5), ylim=c(0,5), xlab="group", ylab="y", xaxt="n", las=1)
segments(x0=1-0.2, y0=means[1], x1 = 1+0.2, y1=means[1], lwd=3, col="dodgerblue") # group a mean
segments(x0=2-0.2, y0=means[2], x1 = 2+0.2, y1=means[2], lwd=3, col="orange")# group b mean  
segments(x0=1.5-0.2, y0=coef(m0.dev)[1], x1 = 1.5+0.2, y1=coef(m0.dev)[1], lwd=3, col="red3", lty=2) # overall mean
text(1, means[1]+0.3, "Mean value of y \nfor group a\n(0.0857)", cex=0.8)
text(2, means[2]+0.3, "Mean value of y \nfor group b\n(4.0545)", cex=0.8)
text(1.5, coef(m0.dev)[1]+0.3, "Overall mean value of y \n across groups a and b\n(2.0701)", cex=0.8)
axis(1, at = c(1,2), labels=c("a","b"))

enter image description here

We are now ready to move to the more complicated model below:

m1.dev <- lm(outcome ~ pred*group, df, contrasts = list(group = c(-1,1)))

summary(m1.dev)

whose output is given by:

Call:
lm(formula = outcome ~ pred * group, data = df, contrasts = list(group = c(-1, 
1)))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.78611 -0.60587 -0.05853  0.58578  2.42940 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.03308    0.06125  33.191  < 2e-16 ***
pred         0.52114    0.06557   7.947 1.47e-13 ***
group1       2.02386    0.06125  33.041  < 2e-16 ***
pred:group1 -0.14406    0.06557  -2.197   0.0292 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8628 on 196 degrees of freedom
Multiple R-squared:  0.8515,    Adjusted    R-squared:  0.8493 
F-statistic: 374.7 on 3 and 196 DF,  p-value: < 2.2e-16

This model is essentially fitting two different lines - one per each of the groups a and b - which describe how the outcome value y tends to vary with the predictor pred. The intercepts of the two lines are expressed in relation to the intercept of the overall line across the two groups. Furthermore, the slopes of the two lines are expressed in relation to the slopes of the overall regression line across the two groups. The model reports the following quantities:

2.03308 = estimated intercept of the overall regression line across the two groups a and b (i.e., intercept reported in the summary for model m1.dev)

0.52114 = estimated slope of the overall regression line (i.e., the coefficient of pred reported in the summary for model m1.dev)

2.02386 = estimated difference between the intercept of the overall regression line and the intercept of the regression line for group a (aka, group 1) (i.e., the coefficient of group1 reported in the summary for model m1.dev)

-0.14406 = estimated difference between the slope of the overall regression line and the slope of the regression line for group a (aka, group 1) (i.e., the coefficient of pred:group1 reported in the summary for model m1.dev)

Here is the R code which allows you to recover the intercept and slope for the group-specific regression lines:

# Note: group a is re-labelled as group 1

intercept of group a line = intercept of overall line  - (intercept of overall line - intercept of group a line)  
            = intercept in deviation coded model m1.dev - coef of group 1 in model m1.dev 
            = 2.03308 - (2.02386) = 0.00922

slope of group a line = slope of overall line  - (slope of overall line - slope of group a line) = 
            = slope of pred in deviation coded model m1.dev - slope of pred:group1 in model m1.dev 
            = 0.52114 - (-0.14406) =  0.6652

# Now, use that: 
# (intercept of overall line - intercept of group a line)  + (intercept of overall line - intercept of group b line) = 0 
# which means that: 
# (intercept of overall line - intercept of group b line) = - (intercept of overall line - intercept of group a line)  

# Similarly: 
# (slope of overall line - slope of group a line)  + (slope of overall line - slope of group b line) = 0 
# which means that: 
# (slope of overall line - slope of group b line) = - (slope of overall line - slope of group a line)  


intercept of group b line = intercept of overall line  - (intercept of overall line - intercept of group b line)
                      = intercept of overall line  + (intercept of overall line - intercept of group a line)  
                      = intercept in deviation coded model m1.dev + coef of group 1 in model m1.dev 
                      = 2.03308 + (2.02386) = 4.05694

slope of group b line = slope of overall line  - (slope of overall line - slope of group b line) = 
                  = slope of overall line  + (slope of overall line - slope of group a line) = 
                  = slope of pred in deviation coded model m1.dev + slope of pred:group1 in model m1.dev 
                  = 0.52114 + (-0.14406) = 0.37708

Using this information, you can plot the group-specific and overall regression lines with this code:

plot(outcome ~ pred, data = df, type="n")
# overall regression line 
abline(a = coef(m1.dev)["(Intercept)"], 
       b = coef(m1.dev)["pred"], 
       col="red3", lwd=2, lty=2)
# regression line for group a 
abline(a = coef(m1.dev)["(Intercept)"] - coef(m1.dev)["group1"], 
      b = coef(m1.dev)["pred"] - coef(m1.dev)["pred:group1"], 
      col="dodgerblue", lwd=2)
 # regression line for group b 
 abline(a = coef(m1.dev)["(Intercept)"] + coef(m1.dev)["group1"], 
        b = coef(m1.dev)["pred"] + coef(m1.dev)["pred:group1"], col="orange", lwd=2)
 points(outcome ~ pred, data = subset(df, group=="a"), col="dodgerblue")
 points(outcome ~ pred, data = subset(df, group=="b"), col="orange")

enter image description here

Of course, the summary of model m1.dev allows you to test the significance of the intercept and slope of the overall regression line by judging the significance of the p-values reported for the Intercept and pred portions of the output. It also allows you to test the significance of the:

  • Deviation of the intercept of the group a regression line from the overall intercept (via the p-value reported for group1);
  • Deviation of the slope of the group b regression line from the overall slope (via the p-value reported for pred:group1).

When would you want to use the deviation coding situation captured by the model m1.dev? If:

  • The outcome variable y represents some water quality parameter;
  • The predictor variable pred represents a (centered) year variable;
  • The group factor represents a season variable (with levels a = Winter and b = Summer);

then you can envision wanting to report not just a winter-specific and a summer-specific (linear) trend over time in the values of the water quality parameter, but also an overall trend across the two seasons. However, for the overall trend to be interpretable, you might want to have some overlap in the data for the two seasons. (In your generated data, you have essentially no overlap - so an overall trend may not be informative.)

I hope this answers your question, which was very interesting.

Related Question