Solved – Why does rejecting the null in goodness-of-fit tests not imply accepting the null

hypothesis testing

From All of Statistics by Wasserman:

Goodness-of-fit testing has some serious limitations. If reject $H_0$ then we
conclude we should not use the model. But if we do not reject $H_0$ we cannot conclude that the model is correct. We may have failed to reject simply
because the test did not have enough power. This is why it is better to use
nonparametric methods whenever possible rather than relying on parametric
assumptions.

My questions are:

Are goodness-of-fit tests parametric or nonparametric?
why "if we do not reject $H_0$ we cannot conclude that the model is correct"?

If it is because "We may have failed to reject simply because the test did not have enough power", why "it is better to use nonparametric methods whenever possible rather than relying on parametric assumptions"? Doesn't the same reason and conclusion apply to nonparametric methods?
Is it correct that "Goodness-of-fit testing" here mean testing if the distribution of a sample is a specific distribution?

Does the same conclusion "if we do not reject $H_0$ we cannot conclude that the model is correct" also apply to testing if the distribution of a sample and the distribution of another sample are the same, such as z test for two normally distributed groups of samples, and Kolmogorov-Smirnov two-sample test?

Thanks and regards!

Best Answer

Question 3: that depends on the goodness of fit test. So it is always a good idea to read up on the specific goodness of fit test you want to apply to figure out what exactly the null hypothesis is that is being tested.

Question 2: To understand this you need to see that a goodness of fit test is just like any other statistical test, and understand exactly what the logic is behind statistical tests. The outcome of a statistical test is a $p$-value, which is the probability of finding data that deviates from $H_0$ at least as much as the data you have observed when $H_0$ is true. So it is a thought experiment with the following steps:

Assume a population in which $H_0$ is true, that is, your model is correct in some specific sense depending on the goodness of fit test.
We draw many samples at random from this population, fit the model, and compute the goodness of fit test in each of these samples.
Since you have drawn samples at randome, some of these samples will be "weird", i.e. deviate from $H_0$.
The $p$-value is the expected proportion of samples that are "as weird or weirder" than the data you have observed.

If you find data with a small $p$-value then that data is unlikely to have come from a population in which the $H_0$ is true, and the fact that you have observed that data is considered evidence against $H_0$. If the $p$-value is below some pre-defined but arbitrary cut off point $\alpha$ (common values are 5% or 1%), then we call it "siginificant" and reject the $H_0$.

Notice what the opposite, not-significant, means: we have not found enough information to reject $H_0$. This is a case of "absence of evidence", which is not the same thing as "evidence of absence". So, "not rejecting $H_0$" is not the same thing as "accepting $H_0$".

Another way to answer your question would be to ask: "could it be that the $H_0$ is true?" the answer is simply no. In a goodness of fit test, the $H_0$ is that the model is in some sense true. The definition of a model is that it is a simplification of reality and simplification is just an other word for "wrong in some useful way". So models are by definition wrong, and thus the $H_0$ cannot be true.

This has consequences for the statement you quoted: "If reject $H_0$ then we conclude we should not use the model." This is incorrect, all that the significance of a goodness of fit test tells you that your model is likely to be wrong, but you already knew that. The interesting question is whether it is so wrong that it is no longer useful. This is a judgement call. Statistical tests can help you in differentiating between patterns that could just be the result of the randomness that is the result of sampling and "real" patterns. A significant result tells you that the latter is likely to be true, but that is not enough to conclude that the model is not a useful simplification of reality. You now need to investigate what exactly the deviation is, how large that deviation is, and what the concequences are for the performance of your model.

Related Solutions

Bootstrap – Why Resample Under Null Hypothesis in Hypothesis Testing

This is the bootstrap analogy principle. The (unknown) underlying true distribution $F$ produced a sample at hand $x_1, \ldots, x_n$ with cdf $F_n$, which in turn produced the statistic $\hat\theta=T(F_n)$ for some functional $T(\cdot)$. Your idea of using the bootstrap is to make statements about the sampling distribution based on a known distribution $\tilde F$, where you try to use an identical sampling protocol (which is exactly possible only for i.i.d. data; dependent data always lead to limitations in how accurately one can reproduce the sampling process), and apply the same functional $T(\cdot)$. I demonstrated it in another post with (what I think is) a neat diagram. So the bootstrap analogue of the (sampling + systematic) deviation $\hat\theta - \theta_0$, the quantity of your central interest, is the deviation of the bootstrap replicate $\hat\theta^*$ from what is known to be true for the distribution $\tilde F$, the sampling process you applied, and the functional $T(\cdot)$, i.e. your measure of central tendency is $T(\tilde F)$. If you used the standard nonparametric bootstrap with replacement from the original data, your $\tilde F=F_n$, so your measure of the central tendency has to be $T(F_n) \equiv \hat \theta$ based on the original data.

Besides the translation, there are subtler issues going on with the bootstrap tests which are sometimes difficult to overcome. The distribution of a test statistic under the null may be drastically different from the distribution of the test statistic under the alternative (e.g., in tests on the boundary of the parameter space which fail with the bootstrap). The simple tests you learn in undergraduate classes like $t$-test are invariant under shift, but thinking, "Heck, I just shift everything" fails once you have to move to the next level of conceptual complexity, the asymptotic $\chi^2$ tests. Think about this: you are testing that $\mu=0$, and your observed $\bar x=0.78$. Then when you construct a $\chi^2$ test $(\bar x-\mu)^2/(s^2/n) \equiv \bar x^2/(s^2/n)$ with the bootstrap analogue $\bar x_*^2/(s_*^2/n)$, then this test has a built-in non-centrality of $n \bar x^2/s^2$ from the outset, instead of being a central test as we would expect it to be. To make the bootstrap test central, you really have to subtract the original estimate.

The $\chi^2$ tests are unavoidable in multivariate contexts, ranging from Pearson $\chi^2$ for contingency tables to Bollen-Stine bootstrap of the test statistic in structural equation models. The concept of shifting the distribution is extremely difficult to define well in these situations... although in case of the tests on the multivariate covariance matrices, this is doable by an appropriate rotation.

Solved – If any parametric test does not reject null, does its nonparametric alternative do the same

If a parametric test fails to reject the null hypothesis then its nonparametric equivalent can definitely still reject the null hypothesis. Like @John said, this usually occurs when assumptions that would warrant use of the parametric test are violated. For example, if we compare the two-sample t-test with the Wilcoxon rank sum test then we can get this situation to happen if we include outliers in our data (with outliers we should not use the two sample-test).

#Test Data
x = c(-100,-100,rnorm(1000,0.5,1),100,100)
y = rnorm(1000,0.6,1)

#Two-Sample t-Test
t.test(x,y,var.equal=TRUE)

#Wilcoxon Rank Sum Test
wilcox.test(x,y)

The results of running the test:

> t.test(x,y,var.equal=TRUE)

    Two Sample t-test

data:  x and y 
t = -1.0178, df = 2002, p-value = 0.3089
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.6093287  0.1929563 
sample estimates:
mean of x mean of y 
0.4295556 0.6377417 

> 
> wilcox.test(x,y)

    Wilcoxon rank sum test with continuity correction

data:  x and y 
W = 443175, p-value = 5.578e-06
alternative hypothesis: true location shift is not equal to 0

Best Answer

Related Solutions

Bootstrap – Why Resample Under Null Hypothesis in Hypothesis Testing

Solved – If any parametric test does not reject null, does its nonparametric alternative do the same

Related Question