I'm doing a biology course and am very inexperienced with stats. My supervisor recommended that I go away and try an ANOVA with block design, but I hadn't heard of this. Having looked up about it, I can't tell the difference between that and a nested ANOVA. Please could someone explain? I will be running it in R and have only ever done regular one-way ANOVAs, GLMs and ANCOVAs before, so I'd like to keep my analysis as simple as possible.
Solved – Difference between one-way ANOVA with randomised blocks and nested ANOVA
anovablockingmixed modelnested data
Related Solutions
The crossed, non-nested model is incorrect, as you suspect, and seems to be over-specified for your posted data. It's important to get the nesting correct, which can be confusing. This page is a good guide. With multiple individuals per site and each individual restricted to one site, the individuals are nested within site. If you were to use your anova(lm())
approach you should be writing:
total_conc ~ tissue/site/indiv
which is expanded to:
total_conc ~ 1 + tissue + tissue:site + tissue:site:indiv
So this model examines the main effect of tissue
plus its interactions with site
and with site:indiv
, without main effects for site
or indiv
. The result of your anova(lm())
on your posted data* would then be:
Analysis of Variance Table
Response: total_conc
Df Sum Sq Mean Sq F value Pr(>F)
tissue 4 3023661191 755915298 1288.243 0.02089 *
tissue:site 12 1280847591 106737299 181.903 0.05788 .
tissue:site:indv 95 1325836886 13956178 23.784 0.16203
Residuals 1 586780 586780
You only have 1 residual degree of freedom because most cases have only one measurement on each tissue from an individual. So the three-way interaction is fairly meaningless. Although results from the Type I sums of squares used by anova()
, when applied to unbalanced data, can depend on the order of entry of variables into the model formula, in this case there really is only one order of entry that makes sense, so that's not such an issue.
These data are highly unbalanced, with observations on fruit
restricted to 2 indiv
at one particular site
. If that's characteristic of your real data, then you probably should consider removing such tissues with low numbers of observations from your analyses. Would you (or a skeptical reviewer) really trust analyses based on 2 observations from a single site
? Removing the fruit
observations from these data, at least, would leave a reasonably balanced design that would probably be good enough for standard nested ANOVA.
Note that both the nested ANOVA approach and mixed models with intercept-only terms like (1|indiv)
are making an implicit assumption that all effects of individuals and sites on total_conc
are additive without regard to the tissue
being evaluated. You must use your knowledge of the subject matter as to the validity of that assumption. With the raw means of total_conc
ranging from 2804 (pollen
) to 16844 (fruit
) I would be a bit worried about that assumption. In principle the formalism of a mixed model could incorporate random effects that differ among tissue
values, but you don't seem to have enough data for that. The suggestion from @MarkWhite to try a Bayesian multilevel model might help, but I have no experience with such models.
If there is some irreducible imbalance in the data then I think that many would argue for a mixed-model approach. My guess is that the errors that you discover with mixed models due to small numbers of observations would also be seen in the anova(lm())
approach if you dug more deeply. For example, that approach on the crossed model applied to your data provides what seems to be a perfectly reasonable ANOVA table.
> anova(lm(total_conc ~ tissue+site+indv, data=data))
Analysis of Variance Table
Response: total_conc
Df Sum Sq Mean Sq F value Pr(>F)
tissue 4 3023661191 755915298 47.967 < 2.2e-16 ***
site 3 746832764 248944255 15.797 3.552e-08 ***
indv 24 583953341 24331389 1.544 0.07733 .
Residuals 81 1276485152 15759076
Nevertheless, the underlying crossed model can't define 3 regression coefficients because of singularities:
> summary(lm(total_conc ~ tissue+site+indv, data=data))
Call:
lm(formula = total_conc ~ tissue + site + indv, data = data)
Residuals:
Min 1Q Median 3Q Max
-9199.3 -2226.1 -84.8 1890.3 9891.5
Coefficients: (3 not defined because of singularities)
*Note that the indiv
values in the posted data need to be converted to factors first.
Borrowing notation from wikipedia:
Full model: $y_{i,j}=\mu_j+\varepsilon_{i,j}$
Reduced model: $y_{i,j}=\mu+\varepsilon_{i,j}$
Consider two groups, A and B. Then $j =1,2$. You have a total of 10 observations, so $i = 1,2,..,10$. The observations in A have a mean and those in B have a mean, $\mu_1, \mu_2$. You can also pool them together and obtain a grand mean $\mu$. You estimates of these parameters depend on which model you choose because that choice defines the optimization problem.
In the full model we assume that each observation is a function of a specific group mean, so there are more parameters (one for each group's mean--hence "full"). In the reduced model we assume that each observation is a function of the grand mean, so there are fewer parameters $\mu$ (hence, "reduced").
If truthfully the observations are actually drawn from the reduced model---they are all generated by the same mean--then there should be no real difference between the error achieved with the estimates from the full and the estimates from the reduced model.
This concept is generally introduced using linear regression instead of ANOVA (https://onlinecourses.science.psu.edu/stat501/node/295).
Best Answer
Here is my understanding of the difference. With a randomized block design, you have a characteristic of the units-of-analysis that you stratify (block) and then randomize into your treatment conditions within each block. For example, you could block on sex (male and female) and then randomly assign to a treatment and control condition separately for males and females, ensuring balance across the blocks in the number assigned to each group. In this design, you have one factor (treatment/control) and one block (male/female). This design also controls for any variance associated with the block (you would only want to use a block that you have good reason to believe is associated with the dependent variable)
In a nested design, you have two factors but rather than the factors being fully crossed, they are nested. An example from Snedecor and Cochran (1989) of a nested design has multiple samples taken from individual leaves, with three leaves per plant and four plants. In this design, you have two factors: plants and leaves, and the leaves are nested within the plants.
Statistically, the analysis of these designs is the same. The distinction, as far as I can tell, is either you are simply ensuring balance of a potential nuisance variable via blocking thru block-randomization or whether you simply have two nested factors.
The bottomline is that I don't think the distinction matters much. I'd be interested in what others think, however. The classic book by Kirk on ANOVA addresses both block-randomized and nested ANOVA designs, if I recall, but my copy is at the office and I am not.