Solved – How to compare nested factor levels to mean, not to first level, and how to test the last nested level in R lm()

categorical datacontrastslmnested datar

I have two factors j (with 3 levels A, B, C) and k (with 3 levels M, N, O), with k nested within j. Level A of j is the reference level and it has only one k level, M, in it.

What I want to test with the model is: (1) are the means for each non-control j level significantly different than the mean for j="A"? (2) is the mean of each j:k interaction significantly different from the j mean?

Here I generate a synthetic data set where every j:k level-pair is significant but every level of j is not (according to the above criteria), then use the data to fit the model "x ~ j/k" (k is nested in j), and look at the summary data:

> set.seed(0)
> j = gl(3, 30, labels=LETTERS[1:3])
> k = gl(3, 10, length=90, labels=LETTERS[13:15])
> k[1:30] = "M" # Reference j="A" has K="M" only
> x = 100*(2^as.numeric(k)) # 200,400,800 for k=M,N,O
> x[1:30] = mean(c(200,400,800)) # Reference j="A" has x=same mean as "B" and "C"
> x = x + rnorm(90, 0, 4) # A little randomness
> x = x + rep(c(0, 1, 1.2), each=30) # A slight non-significant deviation by j.
> df = data.frame(x=x, j=j, k=k)
> f = lm(x ~ j/k, data=df)
> summary(f)

Call:
lm(formula = x ~ j/k, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3711 -2.7546 -0.0538  2.2673 10.2899 

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.0878     0.6635  150.84   <2e-16 ***
jB          101.5278     1.3270   76.51   <2e-16 ***
jC          100.6014     1.3270   75.81   <2e-16 ***
jA:kN             NA         NA      NA       NA    
jB:kN       198.9841     1.6253  122.43   <2e-16 ***
jC:kN       200.3907     1.6253  123.30   <2e-16 ***
jA:kO             NA         NA      NA       NA    
jB:kO       598.8600     1.6253  368.46   <2e-16 ***
jC:kO       601.9267     1.6253  370.35   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.634 on 83 degrees of freedom
Multiple R-squared:  0.9998,    Adjusted R-squared:  0.9998 
F-statistic: 8.132e+04 on 6 and 83 DF,  p-value: < 2.2e-16

At first glance it appears that part of what I wanted to test is present in the table. I should see six significant entries besides the intercept, and I do. But closer inspection shows that none of the summary table entries, showing results of testing the 7 coefficients generated by the model, is actually testing what I want to test!

Rows jB and jC are significant. I expected these would be testing the jB and jC means against the jA mean and thus would not be significant. But that is not what they are testing! Row jB:kN and the other three interaction rows are significant. I expected these would be testing the jB:kN (etc.) means against the jB (etc.) mean, but they are not!

Notice that jB and jC coefficients in the "Estimate" column are significantly different than 0. This indicates the summary table jB and jC entries are not showing the test of a coefficient representing the jB and jC means minus the jA mean, since these were constructed to be insignificantly different. Actually, these coefficients are comparing the jB:kM and jc:kM nested group means to the control group jA mean, as verified by looking at the difference of the "jB:kM" and "jC:kM" means and the intercept:

> mean(df$x[df$j == "B" & df$k == "M"]) - mean(df$x[df$j == "A"])
    [1] 101.5278
    > mean(df$x[df$j == "C" & df$k == "M"]) - mean(df$x[df$j == "A"])
[1] 100.6014

These are exactly the jB and jC coefficient values. Thus coefficient jB is significant if the jB:kM group mean differs significantly from the jA control group mean. That is not the coefficient or test I expected and is not one in which I am interested. Instead, I want to see a coefficient that is the entire j level's mean minus the control mean, so the test tells if the j level differs from the control. Also, I want two more coefficients that are the jB:kM and jC:kM nested level's means minus the j level mean, so the test tells if the nested level differs from the mean of the level it is nested within. How can I respecify the model (or contrast matrix?) to make this happen?

Are the jB:kN coefficient and other three interaction coefficients equal to the means within that nested level minus the containing level's mean? They are not:

> mean(df$x[df$j == "B" & df$k == "N"]) - mean(df$x[df$j == "B" & df$k == "M"])
[1] 198.9841

This value is the jB:kN coefficent, so it is the difference between the jB:kN group mean and the jB:kM group mean! Again that is not the coefficient or test I expected and is not one in which I am interested. Instead, I want to see a coefficient that is the jB nested level's mean minus the j level mean, so the test tells if the nested level differs from the mean of the level it is nested within.

In summary, I have three problems:

  1. I want the reference level mean compared to the mean of each outer j level, NOT to the mean of the first nested k level within the j level.

  2. I want my nested levels compared to the mean of the j level that contains them, NOT to the reference level mean or the "pseudo-reference" level that is the first nested level within each j level.

  3. I want additional tests to appear in the summary table, comparing the last nested k level means to the mean of the j level that contains them.

How can I do this?

Best Answer

The answer in short is "contrasts". I had heard of them before but hadn't learned what they are. In fact, they serve exactly the purpose I was looking for. They can be used to alter the coefficients. For a given model you would always have the same number of coefficients, but by using contrasts they can be changed to be different linear combinations of the default coefficients. That is, the model is reparameterized - same model, different coefficient meanings and values.

Also, additional comparisons of other "virtual parameters" (linear combinations of actual parameters), not compared in the model summary, can be made after the model is fit, using the multcomp package, although this is not necessarily simple. (The "contrasts" package may also do this, though I couldn't figure out how to use it).

The default contrast in R is "contr.treatment" and it leads to the coefficients in the problem statement, with levels being compared to the base level within that factor. Another contrast, "contr.sum", generates coefficients that compare each level to the mean of all the levels of that factor.

In my case, I want the outer factor to continue using the "contr.treatment" contrast because I do want the outer factor to compare to the control level, jA. For the inner factor, though, I want to use "contr.sum" so the k factor levels get compared to the mean of all k levels at that j level. This is done after creating the k factor, with the assignment statement: contrasts(k) = "contr.sum"

The full code now looks like this:

> j = gl(3, 30, labels=LETTERS[1:3])
> k = gl(3, 10, length=90, labels=LETTERS[13:15])
> k[1:30] = "M" # Reference j="A" has K="M" only
> contrasts(k, 3) = "contr.sum"
> x = 100*(2^as.numeric(k)) # 200,400,800 for k=M,N,O
> x[1:30] = mean(c(200,400,800)) # Reference j="A" has x=same mean as "B" and "C"
> x = x + rnorm(90, 0, 4) # A little randomness
> x = x + rep(c(0, 1, 1.2), each=30) # A slight non-significant deviation by j.
> df = data.frame(x=x, j=j, k=k)
> f = lm(x ~ j/k, data=df, drop.unused.levels=TRUE)
> summary(f)

Call:
lm(formula = x ~ j/k, data = df, drop.unused.levels = TRUE)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6628  -2.9090  -0.4446   2.9412  10.0984 

Coefficients: (2 not defined because of singularities)
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  467.3444     0.7878  593.259   <2e-16 ***
jB             0.6631     1.1141    0.595    0.553    
jC             0.4530     1.1141    0.407    0.685    
jA:k1              NA         NA       NA       NA    
jB:k1       -266.9330     1.1141 -239.604   <2e-16 ***
jC:k1       -267.5542     1.1141 -240.162   <2e-16 ***
jA:k2              NA         NA       NA       NA    
jB:k2        -66.0202     1.1141  -59.261   <2e-16 ***
jC:k2        -65.7784     1.1141  -59.044   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.315 on 83 degrees of freedom
Multiple R-squared:  0.9996,    Adjusted R-squared:  0.9996 
F-statistic: 3.344e+04 on 6 and 83 DF,  p-value: < 2.2e-16

The coefficients have been renamed with "k1" and "k2" instead of "kM" and "kN" for some reason, but the results are now what is expected, as verified here:

> mean(df$x[df$j == "A"])
[1] 467.3444
> mean(df$x[df$j == "B"])-mean(df$x[df$j == "A"])
[1] 0.6631321
> mean(df$x[df$j == "C"])-mean(df$x[df$j == "A"])
[1] 0.4530243
> mean(df$x[df$j == "B" & df$k == "M"])-mean(df$x[df$j == "B"])
    [1] -266.933
    > mean(df$x[df$j == "C" & df$k == "M"])-mean(df$x[df$j == "C"])
[1] -267.5542
> mean(df$x[df$j == "B" & df$k == "N"])-mean(df$x[df$j == "B"])
    [1] -66.02023
    > mean(df$x[df$j == "C" & df$k == "N"])-mean(df$x[df$j == "C"])
[1] -65.77839

The comparisons of jB:kO and jC:kO (aka jB:k3 and jC:k3) are still missing. These do not appear as coefficients because they are simple linear combinations of the existing coefficients. In fact, with the "contr.sum" contrast, the value these coefficients would have is jB:k3 = -(jB:k1+jB:k2) and jC:k3 = -(jC:k1+jC:k2), as can be verified with a little algebra, remembering that coefficient jB is the mean of level jB minus the intercept, and jB:k1 is the mean of level kM minus the mean of level jB.

To use that knowledge to do the other two tests that I want, the glht() function in the multcomp package can be used. However, it chokes on the NA coefficient values that resulted from level jA containing only jB:M, requiring that we modify the model matrix to remove the columns corresponding to the NA coefficients, as well as the intercept column:

> # With contr.sum contrasts, the jB:k2 term is the negative of jB:k1+jB:k2
> summary(glht(f, linfct="0 - jB:k1 - jB:k2 = 0"))
Error in summary(glht(f, linfct = "0 - jB:k1 - jB:k2 = 0")) : 
  error in evaluating the argument 'object' in selecting a method for function 'summary': Error in modelparm.default(model, ...) : 
  dimensions of coefficients and covariance matrix don't match
>
> # It fails because of NA coefficients.  They can be removed by removing their
> # columns in the model matrix and then updating the model.
> M = model.matrix(f)
> # Save the "assign" and "contrasts" attributes of the model matrix.
> attr.assign = attr(M, "assign")
> attr.contrasts = attr(M, "contrasts")
> rmvCols = c(1, which(summary(f)$aliased))
> M = M[,-rmvCols] # Remove intercept column (required) and NA columns
> # Reassign the "assign" attribute, removing the corresponding elements from it.
> attr(M, "assign") = attr.assign[-rmvCols]
> # Reassign the "contrasts" attribute to its original value
> attr(M, "contrasts") = attr.contrasts

Now we update model with new model matrix and try glht() again.

> f1 = update(f, ~M)
> f1

Call:
lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)

Coefficients:
(Intercept)          MjB          MjC       MjB:k1       MjC:k1       MjB:k2       MjC:k2  
   466.9725       0.2974      -0.4174    -266.9183    -267.7272     -65.2277     -66.5216  

> # Now try glht using updated model fit.  Must prepend an "M" on the coefficient
> # names, which was added because model matrix was named M.
> summary(glht(f1, linfct="0 - MjB:k1 - MjB:k2 = 0"))

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)

Linear Hypotheses:
                         Estimate Std. Error t value Pr(>|t|)    
0 - MjB:k1 - MjB:k2 == 0  332.953      1.114   298.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

Excellent! This is one of the two missing comparisons I wanted!!!

To confirm estimate correctness:

> coef(f)
 (Intercept)           jB           jC        jA:k1        jB:k1        jC:k1        jA:k2        jB:k2        jC:k2 
 467.3444162    0.6631321    0.4530243           NA -266.9329982 -267.5542422           NA  -66.0202315  -65.7783877 
> -(coef(f)[5] + coef(f)[8])
   jB:k1 
332.9532 

Yes, it is correct!

Do both missing comparisons, jB:k3 and jC:k3, at the same time.

> summary(glht(f1, linfct=c("0 - MjB:k1 - MjB:k2 = 0", "0 - MjC:k1 - MjC:k2 = 0")))

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)

Linear Hypotheses:
                         Estimate Std. Error t value Pr(>|t|)    
0 - MjB:k1 - MjB:k2 == 0  332.953      1.114   298.9   <1e-10 ***
0 - MjC:k1 - MjC:k2 == 0  333.333      1.114   299.2   <1e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

Sweet!

Related Question