Hypothesis Testing – Anderson-Darling Test for Normality with Estimated Parameters

anderson darling testhypothesis testingkolmogorov-smirnov testnormality-assumption

I have a set of measurements coming from a manufacturing processes. I want to test if the measurements come from a normal distribution. If I understadn correctly, it's wrong to use K-S test or A-D test with mean and variance estimated by the sample. BTW, that's what everybody else in my team is doing. Of course, that doesn't make it right! It's just that I cannot rely on my coworkers for guidance on this topic. However, intuitively I suppose that, having fitted the parameters, now the test has much less power to reject the null. Now, even with estimated mean and variance, I usually get absurdly low p-values (stuff like 0.0001 or less!). Thus, I think I'm safe to say that my data are definitely not normal. Is that correct?

I know that the right way to approach this issue would be to follow the procedure in

Testing whether data follows T-Distribution

but I don't understand that procedure. I will write another question about it.

EDIT: people asked why I'm doing the normality test. Usually in my company they're done because when you investigate for variation sources in a MFG process, you use different tests whether subgroups are normally distributed or not. For example, normally distributed subgroups => use F-test to assess Homogeneity of Variance. Non normally distributed => use Levene's test. I guess that's related to the assumptions of each test.

My case is different. I need to perform an uncertainty quantification study of the performance of a machine: I want to propagate the manufacturing process variability across the chain of codes which are used to compute the performance of the machine, in order to estimate the uncertainty in the machine performance. I thus need to perform a Monte Carlo analysis (or Polynomial Chaos, or Gaussian Process – pick your favourite tool). In any case, I need to be able to sample from a distribution, and I was demanded by management to "keep things simple!". Because of this common "test for normality" mentality, in people minds "keep things simple!" means "just use normal distributions for inputs, and don't waste any precious minutes of work on mumbo-jumbo stuff!".

Now, of course MFG data can't have perfectly normal deviations – a diameter must be positive, an angle has bounded range, etc. But that's not the real issue here: for example, my diameter data have such a small coefficient of variation that a Gaussian distributon and a truncated Gaussian with lower limit 0 would look basically the same: P(D<0|normal distribution)= 1e-100 or something like that. Thus, if the only issue was that MFG data are bounded data, a Gaussian pdf could still be a reasonable proxy. However, I see much bigger deviations from normality: p=0.0001 or less. Thus I need the normality test as an objective measure to prove to my managers that, by using normal distributions for inputs, my uncertainty quantification study would be completely messed up.

Best Answer

I have a set of measurements coming from a manufacturing processes. I want to test if the measurements come from a normal distribution. If I understand correctly, it's wrong to use K-S test or A-D test with mean and variance estimated by the sample.

You're correct -- at least with the usual tables.

BTW, that's what everybody else in my team is doing. Of course, that doesn't make it right! It's just that I cannot rely on my coworkers for guidance on this topic.

Indeed.

However, intuitively I suppose that, having fitted the parameters, now the test has much less power to reject the null.

Correct.

Now, even with estimated mean and variance, I usually get absurdly low p-values (stuff like 0.0001 or less!). Thus, I think I'm safe to say that my data are definitely not normal. Is that correct?

In effect, yes (either that or the null is true but an event with very low probability occurred).

However, you don't need the test to know that your data aren't drawn from a normal distribution. [I bet I could prove to you that none of the data you're testing can actually have come from a normal population, but I'd need to know what you're measuring to give you the right argument.]

In essence then, one could only fail to reject by gathering too small a sample. (So the wonder is then why you'd bother to test it. It's a question you already know the answer to, and in any case the answer is of no value to you. It doesn't matter if the distribution is truly normal or not. An answer to a slightly different question is much more useful.)

I know that the right way to approach this issue would be to follow the procedure in
Testing whether data follows T-Distribution
But that's a bit complicated for me, and there are quite a few steps I don't understand. For example, what's the R function random?

  1. No wonder -- that's not R code!!!

  2. While I could easily explain how to automate an Anderson-Darling test in R, (or even easier, point you to the package that already does it for you*), I see no reason why any of this would answer a question you should care about.

The critical question here is: Why are you testing normality?

* If you must test normality, for all that it makes no sense to do so that I can see, the package nortest implements unspecified-parameter (i.e. composite) normality tests, including one based on the Anderson-Darling statistic ... but why on earth would you not use Shapiro-Wilk? It's in vanilla R, and it's nearly always more powerful even than the Anderson-Darling for alternatives people tend to care about.

Related Question