[Math] Normality test vs. Fitting a Gaussian curve

data analysisleast squaresMATLABnormal distributionstatistics

I have a set of one-dimensional data, and I suspect that the data is normally distributed. Before embarking in a normality test, I decided to fit a Gaussian curve to the histogram of relative frequencies, and see how well it fits. The result can be seen in the following image:
Gaussian fit to data set
In my opinion it fits quite well, but I am a novice in Statistics (and just about everything else). So my questions would be:

  1. Based on this graphical evidence, would you venture to hypothesize that the data is normally distributed? If so, do we still need to proceed with a chi-square normality test?

  2. More generally, would it be possible to replace a normality test by, say, a least-squares Gaussian fit? In that case, $R^2$ (the sum of squared differences) would be some kind of measure of the normality of the data.

By the way, I obtained this fit with MATLAB's cftool. I don't know exactly which method it uses. I suppose it uses least squares, but I'm not sure. If anybody can confirm that, I'd appreciate it.

Best Answer

You say nothing about the sample size. It seems the dotted line in your plot is essentially a histogram with dots from the middle of the tops of the bars. To get such a smooth result the sample size must be large.

Small samples. For small samples, it can be very difficult to judge normality. A formal test, such as a Shapiro-Wilk or Anderson-Darling test, has very poor power for small samples. A p-value above 0.05 can be interpreted as 'consistent with normal', but a small sample might also be consistent with lots of other distributional models.

As an example, I generated a random sample of size $n = 20$ from $\mathsf{Unif}(0,1),$ and did a Shapiro-Wilk test of normality. The p-value was about $0.29 > .05$ so this sample known to be from a uniform population is judged as 'consistent with normal'. [Code for this experiment in R statistical sofware follows.]

x = runif(20); shapiro.test(x)  

        Shapiro-Wilk normality test

data:  x
W = 0.94402, p-value = 0.2852

There is not much use making a histogram or a normal probability plot for such a small sample (except perhaps as a drill problem for homework in an elementary statistics course). Here is a stripchart of the 20 observations tested above.

stripchart(x, pch=19)

enter image description here

This was the first uniform sample of size 20 I tried. Was 'consistency' with normal a 'lucky' result that just happened to make my point? The answer is No: A simulation with 10,000 such samples of size 20 showed that about 80% 'pass' as 'consistent with normal'.

Why do we care whether the population from which we are sampling is normal? Often because we wonder whether it is OK to use normal-based inferential procedures, such as a t test or t interval. Unless there is marked evidence that a small sample is pretty clearly not normal (such as remarkable outliers or obvious skewness), most texts say it is OK to use t procedures.

A 95% t confidence interval for the mean of the data above is $(0.39, 0.65),$ which is hardly a 'sharp' interval, but does include the true population mean $\mu = .5.$ Of course, if these were real data (not simulated using known parameters), we would never know for sure that the Ci contains the true value of $\mu.$ [A nonparametric Wilcoxon signed-rank 95% CI for the population median is $(.38, .66).$]

Large samples. For large samples histograms and Q-Q plots are often useful. However, various normality tests such as Shapiro-Wilk may to often reject a large sample, which we believe must be normal, as not 'consistent with normal'.

For example, here are results for a known sample of size $n = 1000.$

y = rnorm(1000, 100, 15);  shapiro.test(y)

        Shapiro-Wilk normality test

data:  y
W = 0.99716, p-value = 0.07436

The p-value 0.07 is still above 0.05, but small enough to make one wonder if the data may not be normal (if we hadn't just simulated it to be normal). Here is a histogram with the best-fitting normal density curve (not exactly an ideal fit) and a normal probability plot (points not as in quite as straight a line as one might prefer).

enter image description here

Why do we care about normality? If we are doing t procedures, the sample is large enough to expect very good results. For example, a 95% t confidence interval is $(99.6, 101.5),$ which is relatively short and contains the true mean $\mu = 100.$ However, if these are IQ scores of 1000 students, it may be worth noting that there are a few more students just below 100 than we might expect, and a few less just above 100.

Usually, samples of size 1000 generated to be normal are better behaved than the one in the example just above. I discarded three simulated samples in order to show in this Answer my fourth example that is not 'textbook perfect'.

Addendum per Comments: Consider a sample of size $n = 10,000$ from $\mathsf{Norm}(0,1).$

z = rnorm(10^4);  shapiro.test(z[1:5000])

        Shapiro-Wilk normality test

data:  z[1:5000]
W = 0.99962, p-value = 0.4764

In R, the Shapiro-Wilk test is limited to 5000 observations; here, the first half of the data are consistent with normal. The Shapiro-Wilk test uses some approximations; even so in 10,000 tests on normal samples of size $n = 5000,$ the number of false rejections was about 4.3% (near to 5%).

pv = replicate(10^4, shapiro.test(rnorm(5000))$p.value)
mean(pv < .05)
## 0.0432

A 95% t confidence interval for the mean is $(-.003, .036),$ which is very short and contains the population mean $\mu = 0.$

In the figure below: The left panel shows a histogram of the sample along with the standard normal density (dashed red) and the kernel density estimate (solid dark green). For such a large sample population density and KDE almost match. [Very roughly, you can think of KDE as a way to 'smooth' a histogram. You may want to google KDE and/or read Silverman's excellent book.] The center panel shows the empirical CDF (ECDF) of the sample along with the CDF of standard normal. Information is lost in reducing data to histogram bins, but not in making the ECDF, so the ECDF is generally a better match to the population CDF than is the histogram is to the population density. [The ECDF sorts the data and jumps up by $1/n$ at each data value.] The right panel shows an (essentially linear) normal probability plot (Q-Q plot). Roughly speaking, a Q-Q plot is an ECDF with the 'theoretical quantile' scale distorted to give a (theoretically) linear plot for normal data.

enter image description here