The answer is No!! I have several comments.
[1] WHAT VERSION OF DIXON'S RATIO TEST ARE YOU USING?
There are different versions of Dixon's ratio test.
The simplest form of the test is really designed for a single outlier. It uses the test statistic
[X$_($$_n$$_)$$-X$$_($$_n$$_-$$_1$$_)$]/[X$_($$_n$$_)$$- X$$_($$_1$$_)$] for a very large value and
[X$_($$_2$$_)$$-X$$_($$_1$$_)$]/[X$_($$_n$$_)$$- X$$_($$_1$$_)$] for a very small value.
So you do 1 test if you suspect an outlier in one direction and both if you look for both direction. You should always keep multiplicity in mind.
However if you expect two of more outliers in one direction Dixon's ratio test suffers from the masking effect (as much so if not more so than other outlier tests).
So the late Will Dixon in his wisdom devised this test to overcome the masking effect that the second largest observation has on the hiding the largest observation as an outlier and the test statistic for large values is
[X$_($$_n$$_)$$-X$$_($$_n$$_-$$_2$$_)$]/[X$_($$_n$$_)$$- X$$_($$_1$$_)$]
The same idea can be used for masking at the lower extreme of the empirical distribution.
This idea can be extended to more than two outliers but it is wrong to take this much further. The idea of these tests is to find a few isolated outliers. If you have a lot of them you can informally detect them by clustering. There are other methods to deal with several outliers via formal testing. If that is what you are looking for you should find this in Barnett and Lewis' book or the monograph by Douglas Hawkins.
[2] ONE TEST OR SEVERAL?
You are talking about doing Dixon's test in groups of three. This is a common procedure to screen for outliers. But if you want to interpret a p-value correctly and do formal inference, the multiplicity for the test becomes an issue. Without taking multiplicity into account you may identify too many outliers. Since you only did three tests, multiplicity is less of an issue than if you did 50 or 100 tests. But didn't you say you had 100,000 rows? If the real problem is going to involve much more than 3 tests watch out for this pitfall.
Now I don't know why you test the rows separately. If the rows are really poolible you can do 1 test instead of 3 and n will be 24 instead of 8. So you need to decide whether you should do one test or several and should you be looking for (1) only extremely large outliers, (2) only extremely small outliers or (3) both types.
[3] ROBUSTNESS TO NON-NORMALITY.
Grubbs' test has optimality properties when the "good" data can be considered to come from a single normal distribution. However, that makes it sensitive to departures from normality when the "good" data is not normal. Good is a subjective term but some qualification of this kind is necessary because a big issue with outliers is whether the large observation indicates something out of the ordinary (a recording error being one possibility) or just that contrary to your preconceived belief about the data the actual distribution is heavy-tailed and observations that would be extreme for the normal distribution would not be for the true underlying distribution.
On the other hand, although Dixon's test involves an assumption of normality it is both good at detecting outliers and is robust to departures from normality. My paper in the American Statistician in 1982 was titled "On the Robustness of Dixon's Ratio Test in Small Samples." The paper shows that when the sample size is 3 to 5 the test retains it significance level when the distribution is very non-normal (i.e. uniform, exponential etc.). Because it is based on ratios of the spacings between order statistics, I think that it should not be surprising that it is robust and I believe the robustness property would hold up in larger samples as well. But I have not investigated that and I don't know what if anything is out there in the literature regarding my conjecture.
[4] CONCLUSION.
Given all the other important issues about outliers that you should be concerned with about Dixon's test, I think it is wrong to focus on the normality of the data. That I think is the least important issue (although not to be ignored).
As has been said many times on this site, goodness of fit tests for non-normality such as Shapiro-Wilk and Anderson-Darling are among the most powerful. But the problem with these tests or any other goodness-of-fit test is that in large samples you will be able to reject normality for distributions that depart only slightly from normal. Most distributions are not exactly normal. So the real issue for you regarding normality is whether or not the population distribution is close enough to normal for Dixon's test to be valid.
A: You have the robustness property of Dixon's test.
B: Shapiro-Wilk test will detect slight departures from normality in large to very large sample sizes.
C: How does testing normality even make sense when you are looking for outliers? You really cannot check the normality assumption because what does rejecting normality tell you? (1) It could be that the data are non-normal in a way that makes the outlier test invalid because it assumes normality or (2) it just indicates that the underlying distribution is normal but the outliers are causing rejection in which case the test may be valid. If you have good reason to believe that your data should be approximately normal, rejecting normality may only mean that there are several outliers (enough sufficient to make the distribution of the full sample fail the test).
I have a set of measurements coming from a manufacturing processes. I want to test if the measurements come from a normal distribution. If I understand correctly, it's wrong to use K-S test or A-D test with mean and variance estimated by the sample.
You're correct -- at least with the usual tables.
BTW, that's what everybody else in my team is doing. Of course, that doesn't make it right! It's just that I cannot rely on my coworkers for guidance on this topic.
Indeed.
However, intuitively I suppose that, having fitted the parameters, now the test has much less power to reject the null.
Correct.
Now, even with estimated mean and variance, I usually get absurdly low p-values (stuff like 0.0001 or less!). Thus, I think I'm safe to say that my data are definitely not normal. Is that correct?
In effect, yes (either that or the null is true but an event with very low probability occurred).
However, you don't need the test to know that your data aren't drawn from a normal distribution. [I bet I could prove to you that none of the data you're testing can actually have come from a normal population, but I'd need to know what you're measuring to give you the right argument.]
In essence then, one could only fail to reject by gathering too small a sample. (So the wonder is then why you'd bother to test it. It's a question you already know the answer to, and in any case the answer is of no value to you. It doesn't matter if the distribution is truly normal or not. An answer to a slightly different question is much more useful.)
I know that the right way to approach this issue would be to follow the procedure in
Testing whether data follows T-Distribution
But that's a bit complicated for me, and there are quite a few steps I don't understand. For example, what's the R function random?
No wonder -- that's not R code!!!
While I could easily explain how to automate an Anderson-Darling test in R, (or even easier, point you to the package that already does it for you*), I see no reason why any of this would answer a question you should care about.
The critical question here is: Why are you testing normality?
* If you must test normality, for all that it makes no sense to do so that I can see, the package nortest
implements unspecified-parameter (i.e. composite) normality tests, including one based on the Anderson-Darling statistic ... but why on earth would you not use Shapiro-Wilk? It's in vanilla R, and it's nearly always more powerful even than the Anderson-Darling for alternatives people tend to care about.
Best Answer
A statistical test has to do with replications of the experiment and a null hypothesis that is not "discovered" by the incidental finding of a outlying data point. For that reason, it doesn't make any sense to use a statistical test for data points, but you can use critical values or other criteria to flag observations as possible outliers, and then proceed accordingly to verify the data's accuracy.
Because of Chebyshev's inequality, you can always probabilistically quantify the distance of an observation from the mean in terms of a Z-score. The famous rule of Tukey identifies outliers based on a lower bound of normal of Q1 - 1.5 IQR and Q3 + 1.5 IQR. To give you a sense, in a normal distribution the upper bound comes out to a value of 2.70, which in a sample of 6,000 would flag about 21 observations irrespective of their actually being outliers.
Along those lines, it is fair to consider any rule that suits the problem to rank and classify outlying observations. Some ad hoc examples below: