Hypothesis Testing – How to Test Fit of Discrete Data Distribution to Normal Distribution

chi-squared-testdiscrete datahypothesis testingnormal distributionr

Here are some grades. (They aren't test scores. They aren't a sample.) I want to test whether a normal distribution derived from the mean and variance of the grades provides a good fit for the grades. What other distribution might fit is, for now, an orthogonal issue. The question is whether the normal distribution provides a valid model.

> grades
  [1] NA 75 70 70 80 75 85 85 75 85 75 80 85 80 75 70 70 85 90 90 NA 85 NA 75 80
 [26] 80 85 80 85 85 75 NA 75 70 80 65 80 75 70 80 85 65 85 75 75 85 75 60 85 80
 [51] 85 85 70 70 80 75 80 80 NA 80 75 85 80 80 55 75 60 90 80 75 80 70 85 75 80
 [76] 75 80 75 75 70 85 85 80 80 75 70 NA 75 70 75 70 65 85 70 70 80 80 85 85 70
[101] 75 70 75 75 65 70 80 75 65 65
> t <- table(grades)
> t
grades
55 60 65 70 75 80 85 90
 1  2  6 18 27 25 22  3

The grades come from a discrete set of possible values from five to 100 in increments of five. Because of that, a Chi-squared test might be appropriate. First, the normal distribution probabilities that any of the discrete values will occur:

> p <- dnorm(seq(55,90,5), mean(grades, na.rm=TRUE), sd(grades, na.rm=TRUE))
> p
[1] 0.0004433536 0.0031786310 0.0136901824 0.0354207364 0.0550535301
[6] 0.0514034173 0.0288322176 0.0097150119

Why are the values of p so small? Why do they not add-up to one?

> tf <- as.data.frame(t)
> tf
  grades Freq
1     55    1
2     60    2
3     65    6
4     70   18
5     75   27
6     80   25
7     85   22
8     90    3
> chisq.test(tf$Freq, p=p, rescale.p=TRUE)

    Chi-squared test for given probabilities

data:  tf$Freq
X-squared = 7.0452, df = 7, p-value = 0.4242

Warning message:
In chisq.test(tf$Freq, p = p, rescale.p = TRUE) :
  Chi-squared approximation may be incorrect

Why does the chisq.test function produce the warning?

I need numbers, because I have a lot of these data sets and want fit results for a large sample of them. Visual inspection isn't proof. The closest relevant question I've found is https://stats.stackexchange.com/a/140576/59460 ("Testing Whether a Binomial Distribution Fits Data")

Best Answer

Data. You have $n = 103$ 'grades' with values $v = (55. 60, 65, \dots, 90)$ occurring with respective frequencies $f = (2, 6, 18, 27, 25, 22, 3).$ Treating this as a sample, we use $A =\bar X = \frac 1n \sum_{i=1}^7 f_iv_i,$ where $n = \sum_{i=1}^7 f_i = 103,$ so that the sample mean $A = \bar X = 77.039$ estimates the mean $\mu$ of a population.

v = c(60, 65, 70, 75, 80, 85, 90)
f = c( 2,  6, 18, 27, 25, 22,  3)
a = sum(v*f)/sum(f);  a
[1] 77.03883

Similarly, the sample variance is $S^2 = \frac{1}{n-1}\sum_{i=1}^7 f_i(v_i - a)^2 = 44.822$ so that the population variance $\sigma^2$ is estimated by $44.822.$ Also, the population standard deviation is estimated by $S = \sqrt{44.822} = 6.695.$

n = sum(f); sum(f*(v-a)^2)/(n-1)
[1] 44.82201

Shapiro-Wilk test of normality. The 103 observations are not normally distributed, because they have been rounded to only seven different values. For data not so severely rounded, an excellent test of normality would be the Shapiro-Wilk test. But your 103 observations fail this test with a P-value $0.00012$ far below $0.05.$

x = rep(v, times=f)
shapiro.test(x)

        Shapiro-Wilk normality test

data:  x
W = 0.9384, p-value = 0.0001217

Chi-squared GOF test. However, one can still ask whether the unrounded values might be consistent with the normal distribution $\mathsf{Norm}(\mu =77.039, \sigma=6.695).$ Here is a histogram where the bins are chosen to put the observed values at the bin centers. The fit isn't excellent, but a chi-squared goodness-of-fit (GOF) test may not reject this normal distribution as a possible fit. [See Addendum for R code.]

enter image description here

A difficulty with the GOF test is that it requires expected count larger than $5$ in each of the intervals, and this normal distribution doesn't assign enough probability to the first and last intervals. Thus we are left with five intervals: $(-\infty, 67.5],$ $(67.5, 72.5],$ $(72.5, 77,5],$ $(77.5, 82.5],$ $(82.5, \infty).$

We need to know the probability that $\mathsf{Norm}(77.039, 6.695)$ assigns to each of the five intervals. That can be found using the normal CDF, called pnorm in R. (This is where you had trouble in your Question.)

int.end = c(-Inf, 67.5, 72.5, 77.5, 82.5, Inf)
int.prb = diff(pnorm(int.end, 77.04, 6.695))
sum(int.prb)
[1] 1

Then, according to the normal model we can find the expected counts $E_i$ in each interval by multiplying by $n = 103.$ We combine observed counts $O_i$ for the first and last intervals.

int.exp = n*int.prb
int.obs = c(8, 18, 27, 25, 25)

Here is a table of the observed and expected counts, which are used to find the chi-squared GOF statistic.

cbind(int.obs, int.exp)
     int.obs  int.exp
[1,]       8  7.93993
[2,]      18 17.69146
[3,]      27 28.68967
[4,]      25 27.31845
[5,]      25 21.36049

The chi-squared GOF statistic

$$Q = \sum_{i=1}^5 \frac{(O_i - E_i)^2}{E_i} = 0.922$$

has approximately a chi-squared distribution with $\nu = 5 - 3$ degrees of freedom. The P-value 0.922 is far above 5% so we do not reject the normal model. Rejection would have required $Q$ to exceed the critical value $c = 5.99.$

q = sum((int.obs - int.exp)^2/int.exp);  q
[1] 0.9222296
1 - pchisq(q, 2)
[1] 0.6305803
qchisq(.95,2)
[1] 5.991465

If you have many of these GOF fit tests to do, you will want to find convenient R code for doing the whole process at once. Some of the code suggested in @Joe's Answer (+1) may be helpful. Just remember that you need to check that expected counts exceed 5, and combine bins of they don't.

Approximate methods. Finally, here is an approximate method for using the Shapiro-Wilk test. As we saw at the beginning of this Answer, this test rejected your normal model, likely because the observations are rounded to only a few integers.

If you add random noise to these observations, roughly speaking, to 'unround' them, then they pass the Shapiro-Wilk test. In your case random noise from the distribution $\mathsf{Unif}(-2.5, 2.5)$ should suffice. Then we have approximate P-value 0.318 and the normal model is not rejected. (You will get a somewhat different answer each time you try this random approximation: subsequent runs gave P-values 0.1856 and 0.4348.)

set.seed(1234)
shapiro.test(x + runif(n, -2.5, 2.5))

        Shapiro-Wilk normality test

data:  x + runif(n, -2.5, 2.5)
W = 0.98538, p-value = 0.318

If you have many of these tests to do, you might try this approximate procedure as a screening device, and then check cases where normality is rejected with the chi-squared GOF test.

Looking at histograms and normal probability plots might also give quick view. Here is a normal probability plot of your data.

 qqnorm(x); qqline(x, col="red")

enter image description here

If the 'clusters' touch the red line (except possibly for the smaller first and last ones), you might take that as some evidence of a fit to normal.

Addendum per Comments: Here is R code for the figure above with histogram of data and density curve. The first few lines make sure the data is in memory.

v = c(60, 65, 70, 75, 80, 85, 90)
f = c( 2,  6, 18, 27, 25, 22,  3)
x = rep(v, times=f)
cutp = seq(60-2.5, 90+2.5, by = 5)
hist(x, prob=T, br=cutp, col="skyblue2", xlim=c(40,110), ylim=c(0,.06))
 curve(dnorm(x, 77.03, 6.695), add=T, lwd=2, col="red")
Related Question