After many years of learning about contrasts in linear models I am curious about the relative usefulness of deviation coding, as it is defined by this website. I would appreciate someone filling me in on a few things. Here is a data simulation exercise to highlight the source of my confusion. I want to simulate data where there is one two-level categorical predictor (group
) and one continuous predictor (pred
). I have programmed in between-group differences in the relationship of both predictors to the outcome variable (outcome
).
# toy data
set.seed(0001)
# two variables with a .7 correlation, where mean of outcome variable is 0
mu1 <- c(0,0)
Sigma1 <- matrix(.7, nrow=2, ncol=2) + diag(2)*.3
rawvars1 <- as.data.frame(MASS::mvrnorm(n=100, mu=mu1, Sigma=Sigma1))
names(rawvars1) <- c("pred", "outcome")
# two variables with a .3 correlation where mean of outcome variable is 4
mu2 <- c(0,4)
Sigma2 <- matrix(.3, nrow = 2, ncol = 2) + diag(2)*.7
rawvars2 <- as.data.frame(MASS::mvrnorm(n=100, mu=mu2, Sigma=Sigma2))
names(rawvars2) <- c("pred", "outcome")
# bind dataframes and add a grouping variable
df <- rbind(rawvars1, rawvars2)
df$group <- factor(rep(letters[1:2], each = 100))
Let's check that we did what we intended to by getting the pred-outcome
correlations at each level of group
by(df, df$group, function (i) cor.test(i$outcome, i$pred))
Good. Correlations of r = .666 in group a
and r = .3 in group b
. Very close to the correlations we programmed into the two datasets before we bound them. What about the means of the outcomes?
tapply(df$outcome, df$group, mean)
Means of ~0 for the outcome in group a
and ~4 in group b
. So our dataframe behaves like we told it to. Now let's try regression with different coding scheme
Treatment coding
lm(outcome ~ pred*group, df, contrasts = list(group = c(0,1)))
# Coefficients:
# (Intercept) pred group1 pred:group1
# 0.009227 0.665201 4.047715 -0.288114
Ok so here the intercept is the mean of the outcome in group a
, which is about 0. Next the pred
coefficient is the amount thatpred
predicts the outcome in group a
only, about the same as the correlation we extracted earlier, so good. Next, the group1
coefficient is the difference in mean outcome value (i.e. intercept) between group b
compared to group a
. Once again this is right, a predicted difference of ~4. Lastly the pred:group1
coefficient is the difference in slope between group b
and group a
= .665 + -.288 = .377, which, once again is about what the correlation was between the predictor and outcome in group b
. This coding scheme has retrieved the parameters accurately. Next simple coding
Simple coding
lm(outcome ~ pred*group, df, contrasts = list(group = c(-.5,.5)))
# Coefficients:
# (Intercept) pred group1 pred:group1
# 2.0331 0.5211 4.0477 -0.2881
Here the intercept should be the grand mean
mean(df$outcome)
[1] 2.070099
…close enough. And the pred
coefficient should be the slope/correlation averaged across groups
mean(c(.665, .377))
[1] 0.521
The group
coefficient is also correct, a mean difference of four, and the pred:group1
coefficient once again is the difference in slope between groups: -2.88. Like treatment coding, simple coding gives us accurate information. Its incremental benefit over treatment coding is that it yields something like a 'main effect' contrast, the pred
coefficient, which is the amount that the pred
predictor predicts the outcome, averaged across groups.
Deviation coding
lm(outcome ~ pred*group, df, contrasts = list(group = c(-1,1)))
# Coefficients:
# (Intercept) pred group1 pred:group1
# 2.0331 0.5211 2.0239 -0.1441
I am confused about what the coefficients are giving us here. Here the intercept is once again the grand mean, the pred
coefficient is the slope averaged across groups, but the coefficients for the group1
and pred:group1
are exactly half what they are in under the simple coding scheme
So is the group
coefficient simply half of the real difference between groups, half of the coefficient obtained by the simple coding scheme, i.e. incorrect?
2 * 2.0239
[1] 4.0478
Or is it the difference between the group b
mean outcome and the grand mean?
2.0331 + 2.0239
[1] 4.057 # not quite right but close
And what is the pred:group1
coefficient, if not mistaken?
And how does any of this relate to type-1 and type-3 sums of squares? The deviation coding is what we get if we specify options(contrasts = c("contr.sum", "contr.poly))
, which is supposed to be type-3 (a la SPSS default coding). But it doesn't retrieve the correct parameters, unless I am mistaken and interpreting the regression output incorrectly.
Best Answer
@llewmills: This week, I encountered a project where the deviation coding you inquired about came in handy, so I thought I would share here what I learned on this topic.
First, I think it will be easier if we start with a simpler model:
whose output is given by:
In this model, which uses deviation coding for group, the intercept represents the overall mean of the outcome value y across the groups a and b, whereas the coefficient of group1 represents the difference between this overall mean and the mean of the outcome value y for group a. The overall mean is none other than the mean of these two means: (i) the mean of y for group a and (ii) the mean of y for group b.
In other words:
2.07010 = overall mean of y across groups a and b
1.98435 = (overall mean of y across groups a and b) - mean of y for group a
Recall that, for your simulated data, the mean of y for group a is 0.08574619 and the mean of y for group b is 4.05445164. Indeed:
Here is how we recover these means from the model summary reported above:
In the above, we used the fact that the sum of the deviations of each group's mean from the overall mean is zero:
This implies that:
Now, we can plot the group-specific means and the overall means derived from the summary of model m0.dev using this code:
We are now ready to move to the more complicated model below:
whose output is given by:
This model is essentially fitting two different lines - one per each of the groups a and b - which describe how the outcome value y tends to vary with the predictor pred. The intercepts of the two lines are expressed in relation to the intercept of the overall line across the two groups. Furthermore, the slopes of the two lines are expressed in relation to the slopes of the overall regression line across the two groups. The model reports the following quantities:
2.03308 = estimated intercept of the overall regression line across the two groups a and b (i.e., intercept reported in the summary for model m1.dev)
0.52114 = estimated slope of the overall regression line (i.e., the coefficient of pred reported in the summary for model m1.dev)
2.02386 = estimated difference between the intercept of the overall regression line and the intercept of the regression line for group a (aka, group 1) (i.e., the coefficient of group1 reported in the summary for model m1.dev)
-0.14406 = estimated difference between the slope of the overall regression line and the slope of the regression line for group a (aka, group 1) (i.e., the coefficient of pred:group1 reported in the summary for model m1.dev)
Here is the R code which allows you to recover the intercept and slope for the group-specific regression lines:
Using this information, you can plot the group-specific and overall regression lines with this code:
Of course, the summary of model m1.dev allows you to test the significance of the intercept and slope of the overall regression line by judging the significance of the p-values reported for the Intercept and pred portions of the output. It also allows you to test the significance of the:
When would you want to use the deviation coding situation captured by the model m1.dev? If:
then you can envision wanting to report not just a winter-specific and a summer-specific (linear) trend over time in the values of the water quality parameter, but also an overall trend across the two seasons. However, for the overall trend to be interpretable, you might want to have some overlap in the data for the two seasons. (In your generated data, you have essentially no overlap - so an overall trend may not be informative.)
I hope this answers your question, which was very interesting.