What are the reason/s to use a nonparametric normality test (e.gr., Shapiro-Wilk, Jarque-Bera) instead of generic, parametric goodness-of-fit tests (good for any distribution including but not limited to the normal, with parameters, like $\chi^2$ or Kolmogorov-Smirnov) for some data we want to check for normality?
Hypothesis Testing – Why Use Normality Tests if Goodness-of-Fit Tests Are Available
goodness of fithypothesis testingnormal distributionnormality-assumption
Related Solutions
It appears that your data can only take on positive values. In this case, the hypothesis of normality is often rejected. Normally distributed random variables range from positive to negative infinity, so only positive values would violate this. You could try taking the log of the observations and seeing whether these are normally distributed.
If your data follow a normal distribution, then the points in your QQ-plot should lie on a 45-degree line through the origin. Your plots do not look like that at all.
The KS test is giving an error because the distributions being tested are presumed to be continuous. In this case, the probability of witnessing two observations with the exact same value is 0. Your data set contains ties, invalidating this assumption. When there are ties, an asymptotic approximation is used (you can read about this in the help file). The error that you are receiving has nothing to do with data sets with different sizes.
In your post, you never specified the question that you are trying to answer--with sufficient precision, anyway. Do you really want to test that the distributions are the same? Would it be sufficient to test that the means are the same?
Unless you are willing to assume that the variables follow some distribution, there isn't much of an alternative to the KS test if you want to test for the distributions being the same. But there are several ways to test for differences in means.
Question 3: that depends on the goodness of fit test. So it is always a good idea to read up on the specific goodness of fit test you want to apply to figure out what exactly the null hypothesis is that is being tested.
Question 2: To understand this you need to see that a goodness of fit test is just like any other statistical test, and understand exactly what the logic is behind statistical tests. The outcome of a statistical test is a $p$-value, which is the probability of finding data that deviates from $H_0$ at least as much as the data you have observed when $H_0$ is true. So it is a thought experiment with the following steps:
- Assume a population in which $H_0$ is true, that is, your model is correct in some specific sense depending on the goodness of fit test.
- We draw many samples at random from this population, fit the model, and compute the goodness of fit test in each of these samples.
- Since you have drawn samples at randome, some of these samples will be "weird", i.e. deviate from $H_0$.
- The $p$-value is the expected proportion of samples that are "as weird or weirder" than the data you have observed.
If you find data with a small $p$-value then that data is unlikely to have come from a population in which the $H_0$ is true, and the fact that you have observed that data is considered evidence against $H_0$. If the $p$-value is below some pre-defined but arbitrary cut off point $\alpha$ (common values are 5% or 1%), then we call it "siginificant" and reject the $H_0$.
Notice what the opposite, not-significant, means: we have not found enough information to reject $H_0$. This is a case of "absence of evidence", which is not the same thing as "evidence of absence". So, "not rejecting $H_0$" is not the same thing as "accepting $H_0$".
Another way to answer your question would be to ask: "could it be that the $H_0$ is true?" the answer is simply no. In a goodness of fit test, the $H_0$ is that the model is in some sense true. The definition of a model is that it is a simplification of reality and simplification is just an other word for "wrong in some useful way". So models are by definition wrong, and thus the $H_0$ cannot be true.
This has consequences for the statement you quoted: "If reject $H_0$ then we conclude we should not use the model." This is incorrect, all that the significance of a goodness of fit test tells you that your model is likely to be wrong, but you already knew that. The interesting question is whether it is so wrong that it is no longer useful. This is a judgement call. Statistical tests can help you in differentiating between patterns that could just be the result of the randomness that is the result of sampling and "real" patterns. A significant result tells you that the latter is likely to be true, but that is not enough to conclude that the model is not a useful simplification of reality. You now need to investigate what exactly the deviation is, how large that deviation is, and what the concequences are for the performance of your model.
Best Answer
First, it's worth noting that testing for normality is a basically useless activity (cf., Is normality testing 'essentially useless'?). No dataset in the real world is normally distributed, so we already know the null hypothesis behind these tests is false. What's left is that the test can correctly reject the null, if the sample size is large enough relative to the way the data deviate from true normality, or can yield a type II error, if the dataset is relatively smaller. However, what really matters isn't how many data you have, but the size and nature of the deviation from normality, which tests can't tell you.
That having been said, the reason specialized tests like the Shapiro-Wilk are used instead of generic goodness of fit tests, is because we primarily care about some specific types of deviations from normality. Data can deviate from normality in potentially innumerable ways. For simplicity, you can imagine a distribution that has the same kurtosis (fat-tailed-ness) as a normal, but differs in being skewed, or a distribution that differs in kurtosis, but is perfectly symmetrical. If you tested one of those parameters, you would miss the other. Of course, a general test will in some sense cover everything, but not with equal power—it will be more sensitive to some deviations than others. Which deviation will be most detectable will differ by test. Thus, you might as well use the test that is maximally sensitive to the deviations you care about. Those are deviations in the tails, and the Shapiro-Wilk is weighted to preferentially detect them.