Solved – Why does rejecting the null in goodness-of-fit tests not imply accepting the null

hypothesis testing

From All of Statistics by Wasserman:

Goodness-of-fit testing has some serious limitations. If reject $H_0$ then we
conclude we should not use the model. But if we do not reject $H_0$ we cannot conclude that the model is correct. We may have failed to reject simply
because the test did not have enough power. This is why it is better to use
nonparametric methods whenever possible rather than relying on parametric
assumptions.

My questions are:

  1. Are goodness-of-fit tests parametric or nonparametric?

  2. why "if we do not reject $H_0$ we cannot conclude that the model is correct"?

    If it is because "We may have failed to reject simply because the test did not have enough power", why "it is better to use nonparametric methods whenever possible rather than relying on parametric assumptions"? Doesn't the same reason and conclusion apply to nonparametric methods?

  3. Is it correct that "Goodness-of-fit testing" here mean testing if the distribution of a sample is a specific distribution?

    Does the same conclusion "if we do not reject $H_0$ we cannot conclude that the model is correct" also apply to testing if the distribution of a sample and the distribution of another sample are the same, such as z test for two normally distributed groups of samples, and Kolmogorov-Smirnov two-sample test?

Thanks and regards!

Best Answer

Question 3: that depends on the goodness of fit test. So it is always a good idea to read up on the specific goodness of fit test you want to apply to figure out what exactly the null hypothesis is that is being tested.

Question 2: To understand this you need to see that a goodness of fit test is just like any other statistical test, and understand exactly what the logic is behind statistical tests. The outcome of a statistical test is a $p$-value, which is the probability of finding data that deviates from $H_0$ at least as much as the data you have observed when $H_0$ is true. So it is a thought experiment with the following steps:

  1. Assume a population in which $H_0$ is true, that is, your model is correct in some specific sense depending on the goodness of fit test.
  2. We draw many samples at random from this population, fit the model, and compute the goodness of fit test in each of these samples.
  3. Since you have drawn samples at randome, some of these samples will be "weird", i.e. deviate from $H_0$.
  4. The $p$-value is the expected proportion of samples that are "as weird or weirder" than the data you have observed.

If you find data with a small $p$-value then that data is unlikely to have come from a population in which the $H_0$ is true, and the fact that you have observed that data is considered evidence against $H_0$. If the $p$-value is below some pre-defined but arbitrary cut off point $\alpha$ (common values are 5% or 1%), then we call it "siginificant" and reject the $H_0$.

Notice what the opposite, not-significant, means: we have not found enough information to reject $H_0$. This is a case of "absence of evidence", which is not the same thing as "evidence of absence". So, "not rejecting $H_0$" is not the same thing as "accepting $H_0$".

Another way to answer your question would be to ask: "could it be that the $H_0$ is true?" the answer is simply no. In a goodness of fit test, the $H_0$ is that the model is in some sense true. The definition of a model is that it is a simplification of reality and simplification is just an other word for "wrong in some useful way". So models are by definition wrong, and thus the $H_0$ cannot be true.

This has consequences for the statement you quoted: "If reject $H_0$ then we conclude we should not use the model." This is incorrect, all that the significance of a goodness of fit test tells you that your model is likely to be wrong, but you already knew that. The interesting question is whether it is so wrong that it is no longer useful. This is a judgement call. Statistical tests can help you in differentiating between patterns that could just be the result of the randomness that is the result of sampling and "real" patterns. A significant result tells you that the latter is likely to be true, but that is not enough to conclude that the model is not a useful simplification of reality. You now need to investigate what exactly the deviation is, how large that deviation is, and what the concequences are for the performance of your model.