This is an interesting question, and actually quite subtle. It gets at the core of the definition of interactions, as well as the assumptions underlying dummy coding of categorical variables.
OP's question is that, when we are testing for interactions between the levels as given above, why do we (seemingly!) not bother to take into account interactions between one variable and the other variable's reference level?
The short answer is that we actually are taking these into account, as I'll explain here. First, to simplify subsequent notation, let $Y_{AE}$ denote the random variable $Y|\{X_1=A\} \cap \{X_2=E\}$, that is, $Y$ conditional on observing levels $A$ and $E$ in variables $X_1$ and $X_2$, respectively.
Now, recall that if we claim that there is no interaction between $X_1$ and $X_2$, this is equivalent to asserting that
$$
E[Y_{AE}]-E[Y_{BE}] = E[Y_{AF}]-E[Y_{BF}] = E[Y_{AG}]-E[Y_{BG}] \qquad (Eq. 1)
$$
that is, the expected difference between $Y|X_1=A$ and $Y|X_1=B$ is unrelated to the level of $X_2$, as long as $X_2$ is held constant. Similarly, no interaction means that
$$ E[Y_{AE}]-E[Y_{AF}] = \cdots = E[Y_{DE}]-E[Y_{DF}]. $$
Now, OP correctly states that given our choice of reference levels $A$ and $E$, the definition of $\beta_1$ is the expected difference between $Y$ when $X_1=B$ versus $X_1=A$, given that $X_2$ is held constant at $E$. In other words,
$$ \beta_1 = E[Y_{AE}] - E[Y_{BE}], $$
but if there is no interaction between $X_1$ and $X_2$, it follows immediately from equation 1 above that
$$
\beta_1 = E[Y_{AE}]-E[Y_{BE}] = E[Y_{AF}]-E[Y_{BF}] = E[Y_{AG}]-E[Y_{BG}]
$$
In other words, if there is no interaction, then $\beta_1$ is sufficient to explain the differences in $Y$ based on levels of $X_1$ when $X_2$ is held constant. This is why the test for an interaction involves leaving $\beta_1$ in the model and testing whether the interaction terms are zero.
The full answer to your question involves a simultaneous derivation for $\beta_1$ through $\beta_5$, but from this example you hopefully get the idea of what's going on.
The answer in short is "contrasts". I had heard of them before but hadn't learned what they are. In fact, they serve exactly the purpose I was looking for. They can be used to alter the coefficients. For a given model you would always have the same number of coefficients, but by using contrasts they can be changed to be different linear combinations of the default coefficients. That is, the model is reparameterized - same model, different coefficient meanings and values.
Also, additional comparisons of other "virtual parameters" (linear combinations of actual parameters), not compared in the model summary, can be made after the model is fit, using the multcomp package, although this is not necessarily simple. (The "contrasts" package may also do this, though I couldn't figure out how to use it).
The default contrast in R is "contr.treatment" and it leads to the coefficients in the problem statement, with levels being compared to the base level within that factor. Another contrast, "contr.sum", generates coefficients that compare each level to the mean of all the levels of that factor.
In my case, I want the outer factor to continue using the "contr.treatment" contrast because I do want the outer factor to compare to the control level, jA. For the inner factor, though, I want to use "contr.sum" so the k factor levels get compared to the mean of all k levels at that j level. This is done after creating the k factor, with the assignment statement: contrasts(k) = "contr.sum"
The full code now looks like this:
> j = gl(3, 30, labels=LETTERS[1:3])
> k = gl(3, 10, length=90, labels=LETTERS[13:15])
> k[1:30] = "M" # Reference j="A" has K="M" only
> contrasts(k, 3) = "contr.sum"
> x = 100*(2^as.numeric(k)) # 200,400,800 for k=M,N,O
> x[1:30] = mean(c(200,400,800)) # Reference j="A" has x=same mean as "B" and "C"
> x = x + rnorm(90, 0, 4) # A little randomness
> x = x + rep(c(0, 1, 1.2), each=30) # A slight non-significant deviation by j.
> df = data.frame(x=x, j=j, k=k)
> f = lm(x ~ j/k, data=df, drop.unused.levels=TRUE)
> summary(f)
Call:
lm(formula = x ~ j/k, data = df, drop.unused.levels = TRUE)
Residuals:
Min 1Q Median 3Q Max
-10.6628 -2.9090 -0.4446 2.9412 10.0984
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 467.3444 0.7878 593.259 <2e-16 ***
jB 0.6631 1.1141 0.595 0.553
jC 0.4530 1.1141 0.407 0.685
jA:k1 NA NA NA NA
jB:k1 -266.9330 1.1141 -239.604 <2e-16 ***
jC:k1 -267.5542 1.1141 -240.162 <2e-16 ***
jA:k2 NA NA NA NA
jB:k2 -66.0202 1.1141 -59.261 <2e-16 ***
jC:k2 -65.7784 1.1141 -59.044 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.315 on 83 degrees of freedom
Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
F-statistic: 3.344e+04 on 6 and 83 DF, p-value: < 2.2e-16
The coefficients have been renamed with "k1" and "k2" instead of "kM" and "kN" for some reason, but the results are now what is expected, as verified here:
> mean(df$x[df$j == "A"])
[1] 467.3444
> mean(df$x[df$j == "B"])-mean(df$x[df$j == "A"])
[1] 0.6631321
> mean(df$x[df$j == "C"])-mean(df$x[df$j == "A"])
[1] 0.4530243
> mean(df$x[df$j == "B" & df$k == "M"])-mean(df$x[df$j == "B"])
[1] -266.933
> mean(df$x[df$j == "C" & df$k == "M"])-mean(df$x[df$j == "C"])
[1] -267.5542
> mean(df$x[df$j == "B" & df$k == "N"])-mean(df$x[df$j == "B"])
[1] -66.02023
> mean(df$x[df$j == "C" & df$k == "N"])-mean(df$x[df$j == "C"])
[1] -65.77839
The comparisons of jB:kO and jC:kO (aka jB:k3 and jC:k3) are still missing. These do not appear as coefficients because they are simple linear combinations of the existing coefficients. In fact, with the "contr.sum" contrast, the value these coefficients would have is jB:k3 = -(jB:k1+jB:k2) and jC:k3 = -(jC:k1+jC:k2), as can be verified with a little algebra, remembering that coefficient jB is the mean of level jB minus the intercept, and jB:k1 is the mean of level kM minus the mean of level jB.
To use that knowledge to do the other two tests that I want, the glht() function in the multcomp package can be used. However, it chokes on the NA coefficient values that resulted from level jA containing only jB:M, requiring that we modify the model matrix to remove the columns corresponding to the NA coefficients, as well as the intercept column:
> # With contr.sum contrasts, the jB:k2 term is the negative of jB:k1+jB:k2
> summary(glht(f, linfct="0 - jB:k1 - jB:k2 = 0"))
Error in summary(glht(f, linfct = "0 - jB:k1 - jB:k2 = 0")) :
error in evaluating the argument 'object' in selecting a method for function 'summary': Error in modelparm.default(model, ...) :
dimensions of coefficients and covariance matrix don't match
>
> # It fails because of NA coefficients. They can be removed by removing their
> # columns in the model matrix and then updating the model.
> M = model.matrix(f)
> # Save the "assign" and "contrasts" attributes of the model matrix.
> attr.assign = attr(M, "assign")
> attr.contrasts = attr(M, "contrasts")
> rmvCols = c(1, which(summary(f)$aliased))
> M = M[,-rmvCols] # Remove intercept column (required) and NA columns
> # Reassign the "assign" attribute, removing the corresponding elements from it.
> attr(M, "assign") = attr.assign[-rmvCols]
> # Reassign the "contrasts" attribute to its original value
> attr(M, "contrasts") = attr.contrasts
Now we update model with new model matrix and try glht() again.
> f1 = update(f, ~M)
> f1
Call:
lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)
Coefficients:
(Intercept) MjB MjC MjB:k1 MjC:k1 MjB:k2 MjC:k2
466.9725 0.2974 -0.4174 -266.9183 -267.7272 -65.2277 -66.5216
> # Now try glht using updated model fit. Must prepend an "M" on the coefficient
> # names, which was added because model matrix was named M.
> summary(glht(f1, linfct="0 - MjB:k1 - MjB:k2 = 0"))
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
0 - MjB:k1 - MjB:k2 == 0 332.953 1.114 298.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
Excellent! This is one of the two missing comparisons I wanted!!!
To confirm estimate correctness:
> coef(f)
(Intercept) jB jC jA:k1 jB:k1 jC:k1 jA:k2 jB:k2 jC:k2
467.3444162 0.6631321 0.4530243 NA -266.9329982 -267.5542422 NA -66.0202315 -65.7783877
> -(coef(f)[5] + coef(f)[8])
jB:k1
332.9532
Yes, it is correct!
Do both missing comparisons, jB:k3 and jC:k3, at the same time.
> summary(glht(f1, linfct=c("0 - MjB:k1 - MjB:k2 = 0", "0 - MjC:k1 - MjC:k2 = 0")))
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = x ~ M, data = df, drop.unused.levels = TRUE)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
0 - MjB:k1 - MjB:k2 == 0 332.953 1.114 298.9 <1e-10 ***
0 - MjC:k1 - MjC:k2 == 0 333.333 1.114 299.2 <1e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
Sweet!
Best Answer
That comparison of one with the mean of all later variables is (aside from scale), called Helmert coding or Helmert contrasts. The one you give is the first contrast, the other would be a scaled version of $(0, 1, -1)^\top$.
What R calls helmert coding, this calls 'reverse Helmert'. They're equivalent up to a change of variable order.