I am trying to understand the logic and application of the $\chi^2$ distribution. As far as I understand it, if we take a random variable $X$, which is normally distributed then the random variable $X^2$ follows a $\chi^2$ distribution. So my intuitive understanding of the $\chi^2$ distribution is that it shows the probability of obtaining some value $X$ from a normal distribution given a number of $k$ trials. From the above understanding it follows that the $\chi^2$ test is only applicable, if the random variable is normally distributed, however my textbook and various other examples use the distribution to estimate the probability of obtaining the random variable $X$ from other distributions, with the most common example being to test whether a die is loaded. The values obtained by rolling a die, however would follow a uniform distribution (given that the die is fair). Can someone please explain why using the $\chi^2$ test is this context is still valid, even though the data is not normally distributed.
Solved – Relationship between chi-squared and the normal distribution
chi-squared-distributionnormal distribution
Related Question
- Chi-Squared Distribution – Why is Sampling Distribution of Variance Chi-Squared?
- Solved – Help with intuition about Chi-Squared distribution and its relation to Normal Distribution
- Solved – Does the square of the minimum of two correlated Normal variables have a chi-squared distribution
- Solved – Showing a Normal and a Chi square are independent
Best Answer
Suppose we have a die that we think might not be fair. We roll it 600 times and get the following table.
So we have observed frequencies $X: 44,\, 97,\, 102,\, 99,\ 105,\ 153$ for the respective faces. If the die is fair, we'd expect frequency $E = 100$ for each face.
If the die is fair, then the statistic $$Q - \sum_{i = 1}^6 \frac{(X_i - E)^2}{E} \stackrel{aprx}{\sim} \mathsf{Chisq}(\text{DF} = 5).$$
Very roughly, the rationale for the approximate chi-squared distribution is that we could look at the $X_i$ as being Poisson events each with mean $\mu = \lambda = 100$ and variance $\sigma^2= \lambda = 100.$ Standarizing, we have $Z_i = \frac{X_i - \mu}{\sigma} \stackrel{aprx}{\sim} \mathsf{Norm}(0,1).$ If the $Z_i$ were independent, then $Q = \sum_{i=1}^6 Z_i^2$ would be approximately chi-squared with $6$ degrees of freedom.
But the $Z_i$ aren't independent because the $X_i$ are constrained to add to $600$ rolls of the die. With some hand-waving we 'correct' for this by reducing the degrees of freedom for $Q$ from $6$ to $5.$ The language of the hand-waving is that we have 'lost' a degree of freedom due to a linear constraint. [Hand-waving aside, many simulation experiments have shown that, for a fair die, such values $Q$ are very nearly distributed as chi-squared with 5 degrees of freedom, provided that $E > 5.$ Because our $E = 100$ the approximation is quite good. One such simulation is shown in the Addendum.]
For the data above, one can show that $Q = 59.84.$ However, if we actually have $Q \sim \mathsf{Chisq}(5),$ then this observed value of $Q$ seems very unlikely, because only 5% of values from $\mathsf{Chisq}(5)$ should exceed the critical value $c =11.07.$ Put another way the probability that a value from this distribution exceeds $59.84$ is the P-value of the chi-squared test, which is much smaller than $0.0001.$
The conclusion is that the data provide strong evidence that our die is unfair. [In fact, the values $X_i$ were simulated using probabilities $(\frac 1{12}, \frac 1 6, \frac 1 6, \frac 1 6,\frac 1 6,\frac 1 4),$ respectively, for the faces, instead of $\frac 1 6$ for each face, as for a truly fair die. So the chi-squared test has been able to detect that the die is unfair.]
Addendum: Shown below is a simulation of 100,000 values of $Q,$ each based on $600$ rolls of a fair die. Their histogram is plotted along with the density of $\mathsf{Chisq}(5)$ in order to illustrate that is the the approximate distribution of such values of $Q.$
By way of explaining the code, one experiment with $600$ rolls of a fair die is simulated and tallied using
rle
in the first three lines below.