Equivalence Tests – Intuitive Explanation of Differences Between TOST and UMP Tests

equivalencetost

Hypothesis tests for equivalence differ from the more common hypothesis tests for difference.

In tests for difference, the null hypothesis is some form of "separate quantities are the same", and extreme enough evidence prompts rejection in favor of a conclusion that "separate quantities are different."

In tests for equivalence, the null hypothesis is some form of "separate quantities differ by at least $\Delta$" and extreme enough evidence prompts rejection in favor of a conclusion that "separate quantities are equivalent within an interval defined by $\Delta$."

Pro-tip: Combining inference from tests for difference with tests for equivalence rocks, because it places power and relevant effect size within the testing framework. Following Reagle & Vinod (2003), I adopt the nomenclature $\text{H}^{+}_{0}$ to refer to the positivist null hypothesis associated with a test for difference, and $\text{H}^{-}_{0}$ to refer to the negativist null hypothesis associated with a test for equivalence:

combined inference from difference and equivalence tests

I am comfortable enough articulating and calculating the two one-sided tests (TOST; see for example, Hauck and Anderson, 1984 or Schuirmann, 1987) approach to tests for equivalence (i.e. $\text{H}^{-}_{0}\text{: }\left|\theta\right| \ge \Delta$ translates the to one-sided tests $\text{H}^{-}_{01}\text{: }\theta \ge \Delta$ or $\text{H}^{-}_{02}\text{: }\theta \le -\Delta$, and rejecting both of these implies $\text{H}_{\text{A}}\text{: }-\Delta < \theta <\Delta$). However I am still undertaking the steep learning curve for uniformly most powerful (UMP) tests for equivalence.

In intuitive terms:

What is the motivation for the UMP equivalence tests? I gather that the interval hypothesis $\text{H}^{-}_{0}$ alters rejection probabilities by way of noncentral distributions. But I don't understand how that works in a general sense.

Aside from regulatory preferenece for TOST, what considerations would lead to a preference from TOST versus UMP equivalence tests? One thing I like about TOST is that the equivalence term can be expressed and communicated easily either in units of the measured variable or in units of the test statistic's distribution, and these quantities are readily translated back and forth. Less clear to me are the units of the equivalence term in UMP equivalence tests.

References

Reagle, D. P. and Vinod, H. D. (2003). Inference for negativist theory using numerically computed rejection regions. Computational Statistics & Data Analysis, 42(3):491–512.

Hauck, W. W. and Anderson, S. (1984). A new statistical procedure for testing equivalence in two-group comparative bioavailability trials. Journal of Pharmacokinetics and Pharmacodynamics, 12(1):83–91.

Schuirmann, D. A. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Pharmacometrics, 15(6):657–680.

Best Answer

First question: UMP is, nomen es omen, most powerful. If both the sample size and the equivalence region are small, it may happen to the TOST that confidence intervals will hardly ever fit into the equivalence region. This results in nearly zero power. Also, the TOST is generally conservative (even with an $1-2\alpha$ confidence interval). Whenever the UMP exists, it will always have power $> \alpha$.

Second question: Sometimes an UMP doesn't exist. It is this strictly total positivity of order 3 that has to hold for the density, see the appendix of Wellek's textbook on equivalence and noninferiority tests. Intuitively, this condition guarantees that the power curve of the respective point hypothesis test has exactly one maximum. Then the critical values are the points where this power curve has level $\alpha$. That's why you find them with this $F_{1,n-1,\psi^2}$-distribution in this question: Obtaining $p$-values for UMP $t$ tests for equivalence.

Also if your equivalence hypothesis is not standardized, i.e. $\mu \in ]-\epsilon, \epsilon[$ instead of $\mu \in ]-\frac{\epsilon}{\sigma}, \frac{\epsilon}{\sigma}[$, then even for normally distributed data an UMP has a strange rejection area in the $(\hat{\mu},\hat{\sigma}^2)$-space. See Brown, Hwang and Munk (1997) as an example.

The most important is, as you mentioned, that confidence intervals on the observed scale are more instructive than $p$-values. So the ICH guidelines require confidence intervals. This leads automatically to the TOST, because if you supply a confidence interval to the $p$-value of the UMP, confidence interval and $p$-value may contradict each other. The UMP may be significant but the confidence interval still touches the hypothesis space. This is not desired.

In conclusion, if you use the equivalence test "internally", i.e. not for direct scientific reporting but only as part of some data mining algorithm e.g., UMP may be preferable if it exists. Otherwise take the TOST.

Related Question