Solved – How is the standard error used in calculating a Tukey HSD defined and calculated

standard errorstatistical significancetreatment-effecttukey-hsd-test

I am looking at someone else’s published data where they report treatment means and then a single standard error value across treatment means.

I want to know if certain treatments are significantly different from each other. For example, I’ve highlighted a comparison I am interested in in green.

I am thinking that I might be able to use the Tukey HSD test, but I am not sure because I am not sure how standard error is defined and used in the Tukey HSD.

I am looking at an example from the following textbook by Robert O. Kuehl: “Design of Experiments: Statistical Principles of Research Design and Analysis. 2nd Edition.”

In the example (p. 107-109), you calculate the HSD by multiplying the Studentized range statistic (q) by the standard error.

Here’s my confusion. On page 107 it says the standard error is “the standard error of a treatment mean” which, to me, evokes the idea of a unique standard error for each treatment mean calculated from that treatments individual replications.

That understanding of it doesn’t reconcile well with the idea that the HSD is going to be the same across treatments comparisons. On page 109 there’s an actual numeric example, and the standard error used is constant no matter what means are being compared. That suggests that the standard error is calculated from all the treatments, but I am not sure how.

What does the standard error used in the Tukey HSD calculation refer to and how is it calculated?

Also, is it safe to assume that the standard error reported in the table above (highlighted in blue) is the standard error you would use in a Tukey HSD calculation? There's no additional information in the text of the article further defining what is meant by SE in the table.

Best Answer

Here’s my confusion. On page 107 it says the standard error is “the standard error of a treatment mean” which, to me, evokes the idea of a unique standard error for each treatment mean calculated from that treatments individual replications.

Statistical analysis often involves pooling information among observations to get more reliable estimates of parameter values. In an analysis of variance like what you show in your question, you do not use calculations "from that treatment's individual replications" to get standard errors. Rather, based on an assumption that the underlying variance of observations is the same among all treatments, you use all the replications on all treatments to get an overall estimate of the variance. For each treatment you estimate the variance around that treatment's mean value, and then pool those estimates from all treatments.

If all treatments had the same number of replicates, then the standard errors of the mean values are the same for all treatments. The use of "SE" by the authors of that table suggests that the values reported are for those standard errors of the mean values, taking the number of replicates into account.

For t-tests in general when testing the difference of a value from 0, the statistic you calculate is the ratio of that value to the standard error of that value. For Tukey's range test, instead of evaluating the statistic against a t distribution with the appropriate number of degrees of freedom you evaluate it against a studentized range distribution that also takes into account the number of values that are under consideration for testing. This provides a correction for multiple comparisons.

So Tukey's test uses a t-test type of statistic that is the ratio of the difference between two mean values divided by the standard error of the difference of those two means. Assuming independence and the same individual standard errors, that would be $\sqrt 2$ times the standard error of an individual mean.* You would have to check the software you are using to see whether it expects from you the standard error of an individual mean or the standard error of a difference of two means.

That said, the Tukey test might not be the best way to accomplish what you want. It's appropriate if you have several treatments (with a common estimated standard error) that you want to compare in a way that corrects for multiple comparisons. If you have a single pre-specified comparison in mind, not developed based on looking at the data, then you don't have to correct for multiple comparisons. Note in particular that if a difference isn't significant without that correction, then it certainly won't be significant after correction. In the particular comparison you highlight, the difference of 567 in treatment means is only 1.55 standard errors and thus would not pass the standard test of statistical significance at p < 0.05.

*The Wikipedia web page I linked says it's the standard error of the sum of the values, but for uncorrelated variables that's the same as the standard error of their difference. One situation in which you do have to deal with treatments individually is if there are different numbers of replicates for the two treatments being compared, not addressed directly in the Wikipedia page.

Related Solutions

Solved – Anova and Tukey HSD vs Linear Regression

Is there any adjustment for multiple comparisons built into the p-values for individual regression coefficients? If not, is it prudent to apply an adjustment (Bonferronni, FDR etc) to the coefficient p-values?

In short, yes, there can be. I usually use linear regression for everything, including designs that could be estimated with an ANOVA. I use R, so I estimate models using the lm function, and then I estimate specific pairwise comparisons using the emmeans package. There is a lot of discussion about adjusting p-values for multiple comparisons in this package. See the vignette section here: https://cran.r-project.org/web/packages/emmeans/vignettes/confidence-intervals.html#adjust

I usually don't see discussion of multiple comparisons adjustment for regression coefficients

This lack of discussion is probably due to your field using ANOVA for these problems more than regression models. It could also be an artifact of what software your field uses, too.

I wasn't sure if this is already accounted for or if many just don't consider it a needed adjustment.

It is not already accounted for in the estimation of the model. Whether or not one considers an adjustment needed is a theoretical discussion. But, since you note that the two are equivalent, if you/your field thinks multiple comparisons should be adjusted for in ANOVA, then that applies to linear regression, as well.

ANOVA – Effect Size Calculation for One-Way ANOVA and Tukey-HSD Tests

I was not able to reproduce the results you got from WebPower using the pilot data you supplied. I was able to reproduce your R code however.

You are correct that you can't use the $\eta^2$ for Cohen's f, but $f^2 = \frac{\eta^2}{1-\eta^2}$

"However, how should I compute the effect size from the pilot study" - use the $\eta^2$ from the pilot study.
"Why are there interaction effect sizes, i.e, the effect size for group x vs group y?" Those are the effect sizes for the pair-wise comparisons (if you were using a t-test or a TukeyHSD)

require(dplyr)
require(reshape2)

pilot <- data.frame(option1 = c(6.3, 2.8, 7.8, 7.9, 4.9),
                    option2 = c(9.9, 4.1, 3.9, 6.3, 6.9),
                    option3 = c(5.1, 2.9, 3.6, 5.7, 4.5),
                    option4 = c(1.0, 2.8, 4.8, 3.9, 1.6))
pilot2 <- pilot %>% 
  reshape2::melt(value.name = "y") %>%
  dplyr::rename("option" = "variable")

lm1 <- lm(y ~ option, data = pilot2)
aov1 <- aov(lm1)

means <- apply(pilot, 2, mean)
vs <- apply(pilot, 2, var)

# cohen's f for overall anova
# eta^2 = SSR / SST
eta.sq <- anova(lm1)$`Sum Sq`[2] / sum(anova(lm1)$`Sum Sq`)
f <- sqrt(eta.sq / (1-eta.sq))

# cohen's d for pairwise
d <- abs(means[c(1,1,1,2,2,3)] - means[c(2,3,4,3,4,4)]) / sqrt(((5-1)*vs[c(1,1,1,2,2,3)] + (5-1)*vs[c(2,3,4,3,4,4)])/ (5+5))
names(d) <- c("1-2", "1-3", "1-4", "2-3", "2-4", "3-4")

require(pwr)

# with 5 samples, we have the power to detect effect size f = 0.835
#  i.e. with only 5 samples, we need a large effect to detect

pwr::pwr.anova.test(k = 4, n = 5, sig.level = 0.05, power = 0.80)
#> 
#>      Balanced one-way analysis of variance power calculation 
#> 
#>               k = 4
#>               n = 5
#>               f = 0.8352722
#>       sig.level = 0.05
#>           power = 0.8
#> 
#> NOTE: n is number in each group

# since we have a really large effect in the pilot for f = 1.2,
#   we only need 3 per group to detect with 80% power

pwr::pwr.anova.test(k = 4, f = 1.2414, sig.level = 0.05, power = 0.80)
#> 
#>      Balanced one-way analysis of variance power calculation 
#> 
#>               k = 4
#>               n = 2.950833
#>               f = 1.2414
#>       sig.level = 0.05
#>           power = 0.8
#> 
#> NOTE: n is number in each group