Solved – Why the null hypothesis should always be written as an equality

equivalencehypothesis testing

It is often stated that the null hypothesis should be written as an equality (see for example here), like for example $\mu_A-\mu_B=0$, while the alternative uses an inequality (for example, $\mu_B\gt\mu_A$). I think that this kind of null hypothesis is called "simple". My question is, why should we always prefer simple null hypotheses?

Suppose for example that I have an existing machine learning code A, and I develop a new code B, based on a different paradigm (maybe A is a neural network, while B is a random forest). I test A and B on some data. I can define an accuracy metric for the codes, thus for each test I can say whether A or B was more accurate. It seems extremely unlikely to expect that A and B could have the same average accuracy on the population from which the data are drawn. After all, they're completely different algorithms. Thus, a NH such as $H_0:p_A=p_B$ seems quite "unnatural" to me. Rather, since I want to know if new code is more accurate than the old one, I would try to falsify the NH that B is no more accurate than A, i.e., $H_0: p_B\le p_A$. Thus, the alternative would be $H_a:p_B\gt p_A$. Why is this kind of null not appropriate for testing?

I guess the issue is that with the "compound" null (is this the correct term?) I cannot easily find a distribution of a test statistics, and thus I may not be able to perform NHST. For example, in the two sample t-test, the null that the mean of the two samples is the same allows me to derive easily the distribution of the test statistics (mean is zero, and standard deviation is derived from sample standard deviation and number of samples). If instead I assume that one mean is larger than the other, then I don't know the parameters of the test statistics distribution. Is this correct?

Best Answer

There is nothing wrong with your proposed test. It is possible to derive the sampling distribution of the null with a compound null. What we do, in essence, is use the sampling distribution of the simple null, and if the truth were that $p_B\ll p_A$, then the rate of type I errors would be less than your stated alpha.

What you are getting at is called a non-inferiority test. Here is a non-gated and practitioner-friendly paper:

It may also help to read some of the existing threads on CV that are related to this topic. I have provided somewhat related answers:

You can also read the threads categorized under the tag.


On an unrelated note, you should not use the t-test to compare the accuracy of two classifiers. Since these are correct or incorrect, you would use some method appropriate for binary data, such as the z-test for two proportions. However, since the classifiers will almost certainly be assessed using the same data, McNemar's test should be used (see my answer here: Compare classification performance of two heuristics).

Related Question