Solved – How to compare nested factor levels to mean, not to first level, and how to test the last nested level in R lm()

categorical datacontrastslmnested datar

I have two factors j (with 3 levels A, B, C) and k (with 3 levels M, N, O), with k nested within j. Level A of j is the reference level and it has only one k level, M, in it.

What I want to test with the model is: (1) are the means for each non-control j level significantly different than the mean for j="A"? (2) is the mean of each j:k interaction significantly different from the j mean?

Here I generate a synthetic data set where every j:k level-pair is significant but every level of j is not (according to the above criteria), then use the data to fit the model "x ~ j/k" (k is nested in j), and look at the summary data:

> set.seed(0)
> j = gl(3, 30, labels=LETTERS[1:3])
> k = gl(3, 10, length=90, labels=LETTERS[13:15])
> k[1:30] = "M" # Reference j="A" has K="M" only
> x = 100*(2^as.numeric(k)) # 200,400,800 for k=M,N,O
> x[1:30] = mean(c(200,400,800)) # Reference j="A" has x=same mean as "B" and "C"
> x = x + rnorm(90, 0, 4) # A little randomness
> x = x + rep(c(0, 1, 1.2), each=30) # A slight non-significant deviation by j.
> df = data.frame(x=x, j=j, k=k)
> f = lm(x ~ j/k, data=df)
> summary(f)

Call:
lm(formula = x ~ j/k, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3711 -2.7546 -0.0538  2.2673 10.2899 

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.0878     0.6635  150.84   <2e-16 ***
jB          101.5278     1.3270   76.51   <2e-16 ***
jC          100.6014     1.3270   75.81   <2e-16 ***
jA:kN             NA         NA      NA       NA    
jB:kN       198.9841     1.6253  122.43   <2e-16 ***
jC:kN       200.3907     1.6253  123.30   <2e-16 ***
jA:kO             NA         NA      NA       NA    
jB:kO       598.8600     1.6253  368.46   <2e-16 ***
jC:kO       601.9267     1.6253  370.35   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.634 on 83 degrees of freedom
Multiple R-squared:  0.9998,    Adjusted R-squared:  0.9998 
F-statistic: 8.132e+04 on 6 and 83 DF,  p-value: < 2.2e-16

At first glance it appears that part of what I wanted to test is present in the table. I should see six significant entries besides the intercept, and I do. But closer inspection shows that none of the summary table entries, showing results of testing the 7 coefficients generated by the model, is actually testing what I want to test!

Rows jB and jC are significant. I expected these would be testing the jB and jC means against the jA mean and thus would not be significant. But that is not what they are testing! Row jB:kN and the other three interaction rows are significant. I expected these would be testing the jB:kN (etc.) means against the jB (etc.) mean, but they are not!

Notice that jB and jC coefficients in the "Estimate" column are significantly different than 0. This indicates the summary table jB and jC entries are not showing the test of a coefficient representing the jB and jC means minus the jA mean, since these were constructed to be insignificantly different. Actually, these coefficients are comparing the jB:kM and jc:kM nested group means to the control group jA mean, as verified by looking at the difference of the "jB:kM" and "jC:kM" means and the intercept:

> mean(df$x[df$j == "B" & df$k == "M"]) - mean(df$x[df$j == "A"])
    [1] 101.5278
    > mean(df$x[df$j == "C" & df$k == "M"]) - mean(df$x[df$j == "A"])
[1] 100.6014

These are exactly the jB and jC coefficient values. Thus coefficient jB is significant if the jB:kM group mean differs significantly from the jA control group mean. That is not the coefficient or test I expected and is not one in which I am interested. Instead, I want to see a coefficient that is the entire j level's mean minus the control mean, so the test tells if the j level differs from the control. Also, I want two more coefficients that are the jB:kM and jC:kM nested level's means minus the j level mean, so the test tells if the nested level differs from the mean of the level it is nested within. How can I respecify the model (or contrast matrix?) to make this happen?

Are the jB:kN coefficient and other three interaction coefficients equal to the means within that nested level minus the containing level's mean? They are not:

> mean(df$x[df$j == "B" & df$k == "N"]) - mean(df$x[df$j == "B" & df$k == "M"])
[1] 198.9841

This value is the jB:kN coefficent, so it is the difference between the jB:kN group mean and the jB:kM group mean! Again that is not the coefficient or test I expected and is not one in which I am interested. Instead, I want to see a coefficient that is the jB nested level's mean minus the j level mean, so the test tells if the nested level differs from the mean of the level it is nested within.

In summary, I have three problems:

I want the reference level mean compared to the mean of each outer j level, NOT to the mean of the first nested k level within the j level.
I want my nested levels compared to the mean of the j level that contains them, NOT to the reference level mean or the "pseudo-reference" level that is the first nested level within each j level.
I want additional tests to appear in the summary table, comparing the last nested k level means to the mean of the j level that contains them.

How can I do this?

Best Answer

The answer in short is "contrasts". I had heard of them before but hadn't learned what they are. In fact, they serve exactly the purpose I was looking for. They can be used to alter the coefficients. For a given model you would always have the same number of coefficients, but by using contrasts they can be changed to be different linear combinations of the default coefficients. That is, the model is reparameterized - same model, different coefficient meanings and values.

Also, additional comparisons of other "virtual parameters" (linear combinations of actual parameters), not compared in the model summary, can be made after the model is fit, using the multcomp package, although this is not necessarily simple. (The "contrasts" package may also do this, though I couldn't figure out how to use it).

The default contrast in R is "contr.treatment" and it leads to the coefficients in the problem statement, with levels being compared to the base level within that factor. Another contrast, "contr.sum", generates coefficients that compare each level to the mean of all the levels of that factor.

In my case, I want the outer factor to continue using the "contr.treatment" contrast because I do want the outer factor to compare to the control level, jA. For the inner factor, though, I want to use "contr.sum" so the k factor levels get compared to the mean of all k levels at that j level. This is done after creating the k factor, with the assignment statement: contrasts(k) = "contr.sum"

The full code now looks like this:

> j = gl(3, 30, labels=LETTERS[1:3])
> k = gl(3, 10, length=90, labels=LETTERS[13:15])
> k[1:30] = "M" # Reference j="A" has K="M" only
> contrasts(k, 3) = "contr.sum"
> x = 100*(2^as.numeric(k)) # 200,400,800 for k=M,N,O
> x[1:30] = mean(c(200,400,800)) # Reference j="A" has x=same mean as "B" and "C"
> x = x + rnorm(90, 0, 4) # A little randomness
> x = x + rep(c(0, 1, 1.2), each=30) # A slight non-significant deviation by j.
> df = data.frame(x=x, j=j, k=k)
> f = lm(x ~ j/k, data=df, drop.unused.levels=TRUE)
> summary(f)

Call:
lm(formula = x ~ j/k, data = df, drop.unused.levels = TRUE)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6628  -2.9090  -0.4446   2.9412  10.0984 

Coefficients: (2 not defined because of singularities)
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  467.3444     0.7878  593.259   <2e-16 ***
jB             0.6631     1.1141    0.595    0.553    
jC             0.4530     1.1141    0.407    0.685    
jA:k1              NA         NA       NA       NA    
jB:k1       -266.9330     1.1141 -239.604   <2e-16 ***
jC:k1       -267.5542     1.1141 -240.162   <2e-16 ***
jA:k2              NA         NA       NA       NA    
jB:k2        -66.0202     1.1141  -59.261   <2e-16 ***
jC:k2        -65.7784     1.1141  -59.044   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.315 on 83 degrees of freedom
Multiple R-squared:  0.9996,    Adjusted R-squared:  0.9996 
F-statistic: 3.344e+04 on 6 and 83 DF,  p-value: < 2.2e-16

The coefficients have been renamed with "k1" and "k2" instead of "kM" and "kN" for some reason, but the results are now what is expected, as verified here:

> mean(df$x[df$j == "A"])
[1] 467.3444
> mean(df$x[df$j == "B"])-mean(df$x[df$j == "A"])
[1] 0.6631321
> mean(df$x[df$j == "C"])-mean(df$x[df$j == "A"])
[1] 0.4530243
> mean(df$x[df$j == "B" & df$k == "M"])-mean(df$x[df$j == "B"])
    [1] -266.933
    > mean(df$x[df$j == "C" & df$k == "M"])-mean(df$x[df$j == "C"])
[1] -267.5542
> mean(df$x[df$j == "B" & df$k == "N"])-mean(df$x[df$j == "B"])
    [1] -66.02023
    > mean(df$x[df$j == "C" & df$k == "N"])-mean(df$x[df$j == "C"])
[1] -65.77839

The comparisons of jB:kO and jC:kO (aka jB:k3 and jC:k3) are still missing. These do not appear as coefficients because they are simple linear combinations of the existing coefficients. In fact, with the "contr.sum" contrast, the value these coefficients would have is jB:k3 = -(jB:k1+jB:k2) and jC:k3 = -(jC:k1+jC:k2), as can be verified with a little algebra, remembering that coefficient jB is the mean of level jB minus the intercept, and jB:k1 is the mean of level kM minus the mean of level jB.

To use that knowledge to do the other two tests that I want, the glht() function in the multcomp package can be used. However, it chokes on the NA coefficient values that resulted from level jA containing only jB:M, requiring that we modify the model matrix to remove the columns corresponding to the NA coefficients, as well as the intercept column:

> # With contr.sum contrasts, the jB:k2 term is the negative of jB:k1+jB:k2
> summary(glht(f, linfct="0 - jB:k1 - jB:k2 = 0"))
Error in summary(glht(f, linfct = "0 - jB:k1 - jB:k2 = 0")) : 
  error in evaluating the argument 'object' in selecting a method for function 'summary': Error in modelparm.default(model, ...) : 
  dimensions of coefficients and covariance matrix don't match
>
> # It fails because of NA coefficients.  They can be removed by removing their
> # columns in the model matrix and then updating the model.
> M = model.matrix(f)
> # Save the "assign" and "contrasts" attributes of the model matrix.
> attr.assign = attr(M, "assign")
> attr.contrasts = attr(M, "contrasts")
> rmvCols = c(1, which(summary(f)$aliased))
> M = M[,-rmvCols] # Remove intercept column (required) and NA columns
> # Reassign the "assign" attribute, removing the corresponding elements from it.
> attr(M, "assign") = attr.assign[-rmvCols]
> # Reassign the "contrasts" attribute to its original value
> attr(M, "contrasts") = attr.contrasts

Now we update model with new model matrix and try glht() again.

> f1 = update(f, ~M)
> f1

Call:
lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)

Coefficients:
(Intercept)          MjB          MjC       MjB:k1       MjC:k1       MjB:k2       MjC:k2  
   466.9725       0.2974      -0.4174    -266.9183    -267.7272     -65.2277     -66.5216  

> # Now try glht using updated model fit.  Must prepend an "M" on the coefficient
> # names, which was added because model matrix was named M.
> summary(glht(f1, linfct="0 - MjB:k1 - MjB:k2 = 0"))

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)

Linear Hypotheses:
                         Estimate Std. Error t value Pr(>|t|)    
0 - MjB:k1 - MjB:k2 == 0  332.953      1.114   298.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

Excellent! This is one of the two missing comparisons I wanted!!!

To confirm estimate correctness:

> coef(f)
 (Intercept)           jB           jC        jA:k1        jB:k1        jC:k1        jA:k2        jB:k2        jC:k2 
 467.3444162    0.6631321    0.4530243           NA -266.9329982 -267.5542422           NA  -66.0202315  -65.7783877 
> -(coef(f)[5] + coef(f)[8])
   jB:k1 
332.9532

Yes, it is correct!

Do both missing comparisons, jB:k3 and jC:k3, at the same time.

> summary(glht(f1, linfct=c("0 - MjB:k1 - MjB:k2 = 0", "0 - MjC:k1 - MjC:k2 = 0")))

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)

Linear Hypotheses:
                         Estimate Std. Error t value Pr(>|t|)    
0 - MjB:k1 - MjB:k2 == 0  332.953      1.114   298.9   <1e-10 ***
0 - MjC:k1 - MjC:k2 == 0  333.333      1.114   299.2   <1e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

Sweet!

Related Solutions

Solved – How to specify a contrast matrix (in R) for the difference between one level and an average of the others

That comparison of one with the mean of all later variables is (aside from scale), called Helmert coding or Helmert contrasts. The one you give is the first contrast, the other would be a scaled version of $(0, 1, -1)^\top$.

What R calls helmert coding, this calls 'reverse Helmert'. They're equivalent up to a change of variable order.

GLMM – How to Model Nested Fixed-Factor with Generalized Linear Mixed Models

Stream of consciousness:

you might want to consider log-transforming the response (provided there are no zeros) rather than using the log link, i.e. lmer(log(WaterChlA) ~ ...) rather than glmer(WaterChla ~ ..., family=gaussian(link="log")); I say this because log-transforming can take care of heteroscedasticity in the response (specifically a standard deviation approximately proportional to the expected mean value), which the log-link approach doesn't do (also, lmer tends to be a little bit faster and more stable than glmer)
Nested fixed effects are indeed hard to specify in linear models (as opposed to ANOVA frameworks), where the underlying framework is explicitly trying to estimate parameters rather than just evaluate sums of squares/proportions of variance explained. If the parameters are not uniquely identifiable, then the model will end up dropping some terms. If your experimental design doesn't allow it, you simply can't estimate an interaction, which is necessary for nested terms. For example, you can't tell how the effect of Compo=AB varies among species richnesses, because Compo=AB only exists when species richness is 2. You have a further problem: you can't even estimate the effect of species richness uniquely when Compo is in the model, because Compo is redundant with species richness (once you know the Compo, you know the species richness). Such redundancy is allowed in variance-decomposition approaches, or in Bayesian models, but not in linear models based on parameter estimation and models descended from them. You could make Compo a random effect (also a good modeling strategy because it has many levels, which will be expensive in terms of degrees of freedom); that would still allow you to quantify the effects.
Since you have 480 data points, you should keep in mind the rule of thumb (e.g. from Harrell's Regression Modeling Strategies) that you should not try to fit a model with more than at most n/10 = 48 parameters (that's extreme - typically the rule is stated as n/20, which would mean 24 parameters). Because your predictors are all factors, trying to estimate interactions among them will rapidly make your model size blow up.
lme4 is known to give false-positive convergence warnings in some cases, but they look real in this case; I think the problem is that you have a model that is way overspecified/too complex ... see if stripping it down (as suggested below) helps.
In general you shouldn't include a categorical variable (factor) as both a fixed effect and a random-effect grouping variable: that's a redundant model specification.

I would say a reasonable start for this model would be

lmer(log(WaterChlA) ~ Day*SPrich + (1|Compo) + (1|ExpRun/TankNo),
 data = Wetland, na.action = na.exclude)

although trying to estimate the interaction between Day and SPrich gives you 20 parameters, which is pushing it a bit. You might consider whether you're willing to convert one or both of those factors to numeric (i.e. only looking for linear trends, or if you used poly(Day,SPrich,degree=2) that would give you quadratic terms with only 5 parameters ...)

Best Answer

Related Solutions

Solved – How to specify a contrast matrix (in R) for the difference between one level and an average of the others

GLMM – How to Model Nested Fixed-Factor with Generalized Linear Mixed Models

Related Question