Normal Distribution – How to Conduct Normality Test for Discrete Values of a Continuous Variable

I have a dataset with several hundred pH measurements from a factory line. This dataset will be used to infer process capability using Minitab.

On a physical basis, pH is a continuous scale (being the negative logarithm of the concentration of free hydrogen ions in solution). However, due to the resolution of the measuring instrument (reads out to 0.01) and relatively narrow range of values (min: 3.34, max: 3.74), there is a limited number of discrete values the measurement can take.

Looking at the data it indeed appears to be normal, however the Anderson-Darling test gives a p-value of <0.05, indicating non-normal data:

If I dither the data using "small-compared-to-process-variation" normally distributed noise ($\mu$ = 0, $\sigma$ = 0.005) the distribution does not change in any meaningful way. The relevant population parameters and/or visually). However, the A-D test gives a much higher p-value, indicating normality of the data:

Coming from a Six Sigma Green Belt background, where normality is king & molesting the data is strictly verboten, this feels like a conundrum. I would like to use that dataset to estimate process capability, however Minitab warns me about the non-normality (and so would my former 6σ coach).

My question therefore is two-fold:
1) Can I use the raw (non-dithered) data to infer process capability?
2) Is dithering needed and/or a valid way to pre-process the data prior to capability analysis?

Best Answer

When you have many observations like the hundreds that you have, a goodness-of-fit test is going to pick up on subtle deviations that are unlikely to interest you.

You’re right: because of the discreteness of the measurements,$^{\dagger}$ your data cannot be normal, and your test is confirming that your data are not from a normal distribution.

But you already knew that.

The general sentiment on Cross Validated is that this type of testing is not that helpful. Either you lack the sample size to have adequate power to detect an interesting difference, or your test is overpowered and will detect differences that aren’t interesting.

Your plots, especially the quantile-quantile plot, are evidence to me that your data are normal enough for pretty much any purpose.

$^{\dagger}$ There may be other reasons. I was going to say that your values are bounded, but pH doesn’t have to fall between 0 and 14 like they told me in middle school, I’ve learned.

Best Answer

Related Solutions

Solved – What’s a good normality test for time-series / sleep data

Solved – Estimate range of values of continuous variable corresponding to every level of discrete variable

Related Question