MATLAB: Dumthe variable coding in mixed models (LME)

anovacoeftestcontrastsfitlmefixed effectsfixed factorslinear mixed modelslinear modelslinear regressionlmemixed modelsregression

Hi all,
I've been a little perplexed by the different ways to code dummy variables when fitting a linear mixed model (using fitlme). Specifically I have a model with two categorical fixed factors. I'd like to do contrasts between the different levels of each factor. Now several online sources tell me I should use 'effects' coding for this, but it isn't clear to me why this is, nor is it clear to me how I should code my contrast matrix when using 'effects' coding rather than reference coding.
For example, if I have a model with an intercept and one categorical fixed factor with three levels, such that:
T = table(y,var1); % y is my response variable, var1 a categorical variable with three levels
formula = 'y ~ var1';
m1 = fitlme(T,formula,'DummyVarCoding', 'reference');
me1 = fitlme(T,formula,'DummyVarCoding', 'effects');
Now I'd like to test whether there's a difference between the first and second categories. In the case of the m1 (reference coding) and me1 (effects coding) I would do this as follows:
[p,F,df1,df2] = coefTest(m1,[0 1 0]); % coefTest(lme,H)

[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);
This gives me very different results even though in both cases multiplying each contrast matrix with the associated fixed effects matrix gives me the same value (according to the documentation for coefTest: "It tests the null hypothesis that H0: Hβ = 0, where β is the fixed-effects vector."
When I try both options with some simulated data, where the first two levels of the fixed factor differ significantly, I only find this significant difference when using reference coding. Any insight would be greatly appreciated.
Here's the code for some simulated data:
d1 = zeros(100,1)+randn(100,1);
d2 = ones(100,1)+randn(100,1);
d3 = ones(100,1).*2+randn(100,1);
d = [d1;d2;d3];
cov1 = [zeros(100,1);ones(100,1);ones(100,1).*2];
T = table(d,categorical(cov1),'VariableNames',{'d','cov1'});
f = 'd~cov1';
m1 = fitlme(T,f);
me1 = fitlme(T,f,'DummyVarCoding','effects');
[p,F,df1,df2] = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);

Best Answer

Hi! Yes I solved it.
The problem was that, in the example above, I calculated the H vector incorrectly. This is the vector [0 1 0] which is input to the coefTest() function and which tells the function which contrast to perform.
To compute correct H vectors we need to first look at the model coefficients. For the dummy coded model these are:
(intercept)
Cov_1
Cov_2
And for the effects coded model these are:
(intercept)
Cov_0
Cov_1
To compute the H vector for a contrast, you need to make a vector for each of the two conditions you want to contrast and subtract those vectors from each other. In this case we want to compare condition 0 with condition 1.
For the dummy coded model condition 0 is represented by only the intercept, because in dummy coding the intercept is set equal to the estimate of the first condition, in other words the intercept = Cov_0, so this vector = [1 0 0]. Condition 1 is represented by the intercept + the coefficient Cov_1, so the vector = [1 1 0]. The difference between these vectors: H= [0 -1 0]. You could flip the order of the subtraction around: H = [0 1 0] would be equivalent.
For the effects coded model the intercept does NOT represent condition 0. Instead, condition 0 is now represented by the intercept + the coefficient Cov_0, so this vector = [1 1 0]. The second condition is equal to the intercept + Cov_1, just like in the dummy coded model, but now this vector = [1 0 1]. The difference between these two is H = [0 1 -1] or [0 -1 1].
Using the new vectors both coefTest(m1, [0 1 0]) and coefTest(me1, [0 1 -1]) return the same output.
I personally prefer using dummy coding for doing contrasts, because I find it a bit easier to intuit the H vector, but in principle they are entirely equivalent. When using the anova() function though, you NEED to use the effects coded model to get accurate statistics.
However, you might wonder how to compute contrasts with the third condition, represented by Cov_2, when using effects coding. Cov_2 is not explicitly mentioned in the effects coded model, so how does this work? Well condition 2 is represented by the vector [1 -1 -1], so a contrast between condition 1 and 2 would give H = [1 0 1] - [1 -1 -1] = [0 1 2] when using effects coding. Whereas using dummy coding it would give H = [1 1 0] - [1 0 1] = [0 1 -1].
Another thing I found hard to figure out because the Matlab documentation doesn't discuss it, is how to compare groups of conditions, for instance when you have nested categories. I thought it might be helpful to mention here as well. For example imagine you want to test the efficacy of two diets on weight loss and you have data from both men and women. You would have four categories: weight loss with diet 1 and weight loss with diet 2 for both men and women. You want to study the interaction between gender and type of diet. Your model would be: 'Weight loss ~ DietType * Gender'.
Your dummy coded model would have the following coefficients:
(intercept) < which represents the category Diet_0 + Gender_0
Diet_1
Gender_1
Diet_1*Gender_1
If you want to see whether there's a difference between the first and second diet, the procedure is to add the vectors for each individual category within a group, and then subtract the two group vectors, as follows:
(v1 + v2) - (v3 + v4) = H
Here v1 represents Gender_0 on Diet_0 (just the intercept), v2 represents Gender_1 on Diet_0, v3 represents Gender_0 on Diet_1 and v4 represents Gender_1 on Diet_1. The calculation would be:
([1 0 0 0] + [1 0 1 0]) - ([1 1 0 0]+[1 1 1 1]) = [0 -2 0 -1] = H
If you want to compare efficacy of dieting between men and women, you would say H =
([1 0 0 0] + [1 1 0 0]) - ([1 0 1 0] + [1 1 1 1]) = [0 0 -2 -1]