Solved – Relationship between chi-squared and the normal distribution

chi-squared-distributionnormal distribution

I am trying to understand the logic and application of the $\chi^2$ distribution. As far as I understand it, if we take a random variable $X$, which is normally distributed then the random variable $X^2$ follows a $\chi^2$ distribution. So my intuitive understanding of the $\chi^2$ distribution is that it shows the probability of obtaining some value $X$ from a normal distribution given a number of $k$ trials. From the above understanding it follows that the $\chi^2$ test is only applicable, if the random variable is normally distributed, however my textbook and various other examples use the distribution to estimate the probability of obtaining the random variable $X$ from other distributions, with the most common example being to test whether a die is loaded. The values obtained by rolling a die, however would follow a uniform distribution (given that the die is fair). Can someone please explain why using the $\chi^2$ test is this context is still valid, even though the data is not normally distributed.

Best Answer

Suppose we have a die that we think might not be fair. We roll it 600 times and get the following table.

Face   1   2   3   4   5   6 
Freq  44  97 102  99 105 153 

So we have observed frequencies $X: 44,\, 97,\, 102,\, 99,\ 105,\ 153$ for the respective faces. If the die is fair, we'd expect frequency $E = 100$ for each face.

If the die is fair, then the statistic $$Q - \sum_{i = 1}^6 \frac{(X_i - E)^2}{E} \stackrel{aprx}{\sim} \mathsf{Chisq}(\text{DF} = 5).$$

Very roughly, the rationale for the approximate chi-squared distribution is that we could look at the $X_i$ as being Poisson events each with mean $\mu = \lambda = 100$ and variance $\sigma^2= \lambda = 100.$ Standarizing, we have $Z_i = \frac{X_i - \mu}{\sigma} \stackrel{aprx}{\sim} \mathsf{Norm}(0,1).$ If the $Z_i$ were independent, then $Q = \sum_{i=1}^6 Z_i^2$ would be approximately chi-squared with $6$ degrees of freedom.

But the $Z_i$ aren't independent because the $X_i$ are constrained to add to $600$ rolls of the die. With some hand-waving we 'correct' for this by reducing the degrees of freedom for $Q$ from $6$ to $5.$ The language of the hand-waving is that we have 'lost' a degree of freedom due to a linear constraint. [Hand-waving aside, many simulation experiments have shown that, for a fair die, such values $Q$ are very nearly distributed as chi-squared with 5 degrees of freedom, provided that $E > 5.$ Because our $E = 100$ the approximation is quite good. One such simulation is shown in the Addendum.]

For the data above, one can show that $Q = 59.84.$ However, if we actually have $Q \sim \mathsf{Chisq}(5),$ then this observed value of $Q$ seems very unlikely, because only 5% of values from $\mathsf{Chisq}(5)$ should exceed the critical value $c =11.07.$ Put another way the probability that a value from this distribution exceeds $59.84$ is the P-value of the chi-squared test, which is much smaller than $0.0001.$

x = c(44, 97, 102, 99, 105, 153)
q = sum((x-100)^2/1  0 0); q 
[1] 59.84

qchisq(.95, 5)
[1] 11.0705          # critical value
1-pchisq(59.84, 5)
[1] 1.311595e-11     # P-value

The conclusion is that the data provide strong evidence that our die is unfair. [In fact, the values $X_i$ were simulated using probabilities $(\frac 1{12}, \frac 1 6, \frac 1 6, \frac 1 6,\frac 1 6,\frac 1 4),$ respectively, for the faces, instead of $\frac 1 6$ for each face, as for a truly fair die. So the chi-squared test has been able to detect that the die is unfair.]


Addendum: Shown below is a simulation of 100,000 values of $Q,$ each based on $600$ rolls of a fair die. Their histogram is plotted along with the density of $\mathsf{Chisq}(5)$ in order to illustrate that is the the approximate distribution of such values of $Q.$

By way of explaining the code, one experiment with $600$ rolls of a fair die is simulated and tallied using rle in the first three lines below.

set.seed(413)
rle(sort(sample(1:6, 600, rep=T)))$len
[1]  83 103 114  96 106  98

set.seed(2019);  E = 100
q = replicate(10^5,
      sum((rle(sort(sample(1:6,600,rep=T)))$len - E)^2/E))
hdr = "Simulated Values of Q with Density of CHISQ(5)"
hist(q, prob=T, br=30, col="skyblue2", main=hdr)
  curve(dchisq(x, 5), add=T, lwd=2, col="red")

enter image description here