Hypothesis Testing – How to Test Fit of Discrete Data Distribution to Normal Distribution

chi-squared-testdiscrete datahypothesis testingnormal distributionr

Here are some grades. (They aren't test scores. They aren't a sample.) I want to test whether a normal distribution derived from the mean and variance of the grades provides a good fit for the grades. What other distribution might fit is, for now, an orthogonal issue. The question is whether the normal distribution provides a valid model.

> grades
  [1] NA 75 70 70 80 75 85 85 75 85 75 80 85 80 75 70 70 85 90 90 NA 85 NA 75 80
 [26] 80 85 80 85 85 75 NA 75 70 80 65 80 75 70 80 85 65 85 75 75 85 75 60 85 80
 [51] 85 85 70 70 80 75 80 80 NA 80 75 85 80 80 55 75 60 90 80 75 80 70 85 75 80
 [76] 75 80 75 75 70 85 85 80 80 75 70 NA 75 70 75 70 65 85 70 70 80 80 85 85 70
[101] 75 70 75 75 65 70 80 75 65 65
> t <- table(grades)
> t
grades
55 60 65 70 75 80 85 90
 1  2  6 18 27 25 22  3

The grades come from a discrete set of possible values from five to 100 in increments of five. Because of that, a Chi-squared test might be appropriate. First, the normal distribution probabilities that any of the discrete values will occur:

> p <- dnorm(seq(55,90,5), mean(grades, na.rm=TRUE), sd(grades, na.rm=TRUE))
> p
[1] 0.0004433536 0.0031786310 0.0136901824 0.0354207364 0.0550535301
[6] 0.0514034173 0.0288322176 0.0097150119

Why are the values of p so small? Why do they not add-up to one?

> tf <- as.data.frame(t)
> tf
  grades Freq
1     55    1
2     60    2
3     65    6
4     70   18
5     75   27
6     80   25
7     85   22
8     90    3
> chisq.test(tf$Freq, p=p, rescale.p=TRUE)

    Chi-squared test for given probabilities

data:  tf$Freq
X-squared = 7.0452, df = 7, p-value = 0.4242

Warning message:
In chisq.test(tf$Freq, p = p, rescale.p = TRUE) :
  Chi-squared approximation may be incorrect

Why does the chisq.test function produce the warning?

I need numbers, because I have a lot of these data sets and want fit results for a large sample of them. Visual inspection isn't proof. The closest relevant question I've found is https://stats.stackexchange.com/a/140576/59460 ("Testing Whether a Binomial Distribution Fits Data")

Best Answer

Data. You have $n = 103$ 'grades' with values $v = (55. 60, 65, \dots, 90)$ occurring with respective frequencies $f = (2, 6, 18, 27, 25, 22, 3).$ Treating this as a sample, we use $A =\bar X = \frac 1n \sum_{i=1}^7 f_iv_i,$ where $n = \sum_{i=1}^7 f_i = 103,$ so that the sample mean $A = \bar X = 77.039$ estimates the mean $\mu$ of a population.

v = c(60, 65, 70, 75, 80, 85, 90)
f = c( 2,  6, 18, 27, 25, 22,  3)
a = sum(v*f)/sum(f);  a
[1] 77.03883

Similarly, the sample variance is $S^2 = \frac{1}{n-1}\sum_{i=1}^7 f_i(v_i - a)^2 = 44.822$ so that the population variance $\sigma^2$ is estimated by $44.822.$ Also, the population standard deviation is estimated by $S = \sqrt{44.822} = 6.695.$

n = sum(f); sum(f*(v-a)^2)/(n-1)
[1] 44.82201

Shapiro-Wilk test of normality. The 103 observations are not normally distributed, because they have been rounded to only seven different values. For data not so severely rounded, an excellent test of normality would be the Shapiro-Wilk test. But your 103 observations fail this test with a P-value $0.00012$ far below $0.05.$

x = rep(v, times=f)
shapiro.test(x)

        Shapiro-Wilk normality test

data:  x
W = 0.9384, p-value = 0.0001217

Chi-squared GOF test. However, one can still ask whether the unrounded values might be consistent with the normal distribution $\mathsf{Norm}(\mu =77.039, \sigma=6.695).$ Here is a histogram where the bins are chosen to put the observed values at the bin centers. The fit isn't excellent, but a chi-squared goodness-of-fit (GOF) test may not reject this normal distribution as a possible fit. [See Addendum for R code.]

A difficulty with the GOF test is that it requires expected count larger than $5$ in each of the intervals, and this normal distribution doesn't assign enough probability to the first and last intervals. Thus we are left with five intervals: $(-\infty, 67.5],$ $(67.5, 72.5],$ $(72.5, 77,5],$ $(77.5, 82.5],$ $(82.5, \infty).$

We need to know the probability that $\mathsf{Norm}(77.039, 6.695)$ assigns to each of the five intervals. That can be found using the normal CDF, called pnorm in R. (This is where you had trouble in your Question.)

int.end = c(-Inf, 67.5, 72.5, 77.5, 82.5, Inf)
int.prb = diff(pnorm(int.end, 77.04, 6.695))
sum(int.prb)
[1] 1

Then, according to the normal model we can find the expected counts $E_i$ in each interval by multiplying by $n = 103.$ We combine observed counts $O_i$ for the first and last intervals.

int.exp = n*int.prb
int.obs = c(8, 18, 27, 25, 25)

Here is a table of the observed and expected counts, which are used to find the chi-squared GOF statistic.

cbind(int.obs, int.exp)
     int.obs  int.exp
[1,]       8  7.93993
[2,]      18 17.69146
[3,]      27 28.68967
[4,]      25 27.31845
[5,]      25 21.36049

The chi-squared GOF statistic

$$Q = \sum_{i=1}^5 \frac{(O_i - E_i)^2}{E_i} = 0.922$$

has approximately a chi-squared distribution with $\nu = 5 - 3$ degrees of freedom. The P-value 0.922 is far above 5% so we do not reject the normal model. Rejection would have required $Q$ to exceed the critical value $c = 5.99.$

q = sum((int.obs - int.exp)^2/int.exp);  q
[1] 0.9222296
1 - pchisq(q, 2)
[1] 0.6305803
qchisq(.95,2)
[1] 5.991465

If you have many of these GOF fit tests to do, you will want to find convenient R code for doing the whole process at once. Some of the code suggested in @Joe's Answer (+1) may be helpful. Just remember that you need to check that expected counts exceed 5, and combine bins of they don't.

Approximate methods. Finally, here is an approximate method for using the Shapiro-Wilk test. As we saw at the beginning of this Answer, this test rejected your normal model, likely because the observations are rounded to only a few integers.

If you add random noise to these observations, roughly speaking, to 'unround' them, then they pass the Shapiro-Wilk test. In your case random noise from the distribution $\mathsf{Unif}(-2.5, 2.5)$ should suffice. Then we have approximate P-value 0.318 and the normal model is not rejected. (You will get a somewhat different answer each time you try this random approximation: subsequent runs gave P-values 0.1856 and 0.4348.)

set.seed(1234)
shapiro.test(x + runif(n, -2.5, 2.5))

        Shapiro-Wilk normality test

data:  x + runif(n, -2.5, 2.5)
W = 0.98538, p-value = 0.318

If you have many of these tests to do, you might try this approximate procedure as a screening device, and then check cases where normality is rejected with the chi-squared GOF test.

Looking at histograms and normal probability plots might also give quick view. Here is a normal probability plot of your data.

 qqnorm(x); qqline(x, col="red")

If the 'clusters' touch the red line (except possibly for the smaller first and last ones), you might take that as some evidence of a fit to normal.

Addendum per Comments: Here is R code for the figure above with histogram of data and density curve. The first few lines make sure the data is in memory.

v = c(60, 65, 70, 75, 80, 85, 90)
f = c( 2,  6, 18, 27, 25, 22,  3)
x = rep(v, times=f)
cutp = seq(60-2.5, 90+2.5, by = 5)
hist(x, prob=T, br=cutp, col="skyblue2", xlim=c(40,110), ylim=c(0,.06))
 curve(dnorm(x, 77.03, 6.695), add=T, lwd=2, col="red")

Related Solutions

Hypothesis Testing – How to Test Identicality of Discrete Distributions Using Chi-Squared and Kolmogorov-Smirnov Tests in R

The issue with the Kolmogorov-Smirnov test and distributions that aren't continuous is that the possible permutations of the observations are not all equally likely, so the null distribution of the test statistic doesn't apply.

Indeed it's no longer distribution-free, and using the test "as is" is generally quite conservative (has a substantially lower type I error rate than the nominal rate - and correspondingly lower power).

One possibility is to use the statistic but actually compute the permutation distribution (in small samples) or sample from it (a randomization test).

The chi-square test tends to have low power against interesting alternatives because it ignores ordering. Smooth tests of goodness of fit (which in the simplest case can be treated as a partitioning of the chi-square into low-order components and an untested residual) don't ignore the ordering and tend to have better power. See, for example, the books by Rayner and Best (and others, in some cases).

To get the chi-square to work (though with ordered data I wouldn't do it this way, as I mentioned) you'll need to present it as a two-row (or -column) table of counts:

value:   0  1  2  3  4  5 
    X:   4  7  9  3  1  1 
    Y:   0  2  5  6 12  5

What you are doing is a test of homogeneity of proportions. For the chi-square, which conditions on both margins, this is identical to a test of independence.

So for this data frame, which I have called xycnt:

we just do this:

> chisq.test(xycnt)

    Pearson's Chi-squared test

data:  xycnt
X-squared = 20.6108, df = 5, p-value = 0.0009593

Warning message:
In chisq.test(xycnt) : Chi-squared approximation may be incorrect

In this case it complains because the expected counts in some cells are small. One solution is not to rely on the chi-square approximation to the test statistic but to simulate its distribution, obtaining a simulated p-value:

chisq.test(xycnt,simulate.p.value=TRUE,B=100000)

    Pearson's Chi-squared test with simulated p-value (based on 1e+05 replicates)

data:  xycnt
X-squared = 20.6108, df = NA, p-value = 0.00032

With such a small p-value, simulated estimates of it are a bit variable, but always small. You can always up the number of simulations further, it's pretty fast. (Ten million simulations generally give p-values between 0.00032 and 0.00033 and only take a few seconds)

Solved – Test if two samples follow the same distribution with Chi Squared in R

Given some set of cutpoints, the two-sample case becomes a chi-squared test of homogeneity of proportions (and this in turn is functionally identical to a test of independence in a $2\times k$ table).

How do I go about binning two samples using the same intervals?

choose some set of bins (if possible without reference to the data, though in practice that may be difficult to accomplish unless you know beforehand what the distribution is roughly going to be)
for each sample count the data in those bins

(in R you could use the cut function for setting up the bins and the table function for counting - but it's far from the only choice. If you really wanted to get hist to choose your bins then I'd combine the two samples into one for identifying your cut-offs, but then you still have to go back and do the counts for the individual samples; it may also leave you with some small expected counts, but if you work with just the marginal distribution you can at least combine bins that way without looking at how the individual counts would have split up)

A worked example:

set.seed(7687120)                # make sure we look at the same numbers
x=rgamma(40,6,1/6)               # generate some x,y data 
y=rgamma(30,9,1/5)               # from different distributions
xy=c(x,y)                        # combine into one sample
hist(xy)                         # default hist bins not really suitable 
summary(xy)
hist(xy),breaks=seq(15,105,15))  # some small category counts at the top end
bks=c(15,30,45,60,105)           # -- push together everything above 60

table(cut(xy,breaks=bks))        # marginal totals look reasonable to me 
xc=table(cut(x,breaks=bks))      # calc. individual counts in table for x
yc=table(cut(y,breaks=bks))      # corresponding counts for y
rbind(xc,yc)                     # what the table looks like
chisq.test(rbind(xc,yc))         # testing the result

Best Answer

Related Solutions

Hypothesis Testing – How to Test Identicality of Discrete Distributions Using Chi-Squared and Kolmogorov-Smirnov Tests in R

Solved – Test if two samples follow the same distribution with Chi Squared in R

Related Question