[Math] Confused on the relationship between Chi-square, its CDF, and p-value.

chi squaredstatistics

I have two questions on the $\chi^2$ statistic and $\chi^2$ distributions.

I think I understand the $\chi^2$ test, in that for a given p-value, one can find a cut-off $\chi^2$ value for a given k degrees of freedom. This cutoff concludes whether an observed hypothesis is statistically significant (by the magnitude of the p-value) from a null hypothesis. In general, the cutoff $\chi^2$ value is looked up in a table for a given p-value and k. However, this just comes from the CDF for the $\chi^2$ distribution, correct? Can I just use the CDF and calculate the p-value directly, based on my observed $\chi^2$ ? Just by using Mathematica, or something similar, and using CDF[x, k]?

In textbooks, I've only seen the $\chi^2$ test version method with a table lookup, but is this due to the difficulty of calculating the CDF directly in an intro course, or is there something wrong about assigning a p-value to an observed $\chi^2$ , instead of a pass-fail cutoff and binary significance?

Secondly, assuming some of the ideas above, how does the binning of observed data affect the $\chi^2$ and its p-value? For example, if I have 10 data points, each an independent observation. Say, cars observed driving past, per day. Also, assume the observed number is large enough to approximate Gaussian distribution instead of Poisson for each point. For simplicity, my null hypothesis is a constant fit over all 10 days. Number of degrees of freedom = k = 10 data points – 1 parameter in fit = 9. Fit to the constant, minimized $\chi^2$ can be compared to a cutoff $\chi^2$ for k = 9 and some p-value (or p-value calculated directly if I wasn't wrong in the first paragraph). But with the same data, I could bin by 12 hours. So k = 20 – 1 = 19, and a different p-value is found. Or bin by 48 hours, so k = 5 – 1 = 4? All with the same data and arbitrary binning.

Is this type of data incompatible with a $\chi^2$ test? Does binning just affect the fit in that way? Or am I missing a fundamental concept?

Best Answer

First question. You are right about being able to use software instead of tables of the chi-squared distribution. For example, if df = 9 and the chi-squared statistic is 20.16, you could look at a chi-squared table to see that $20.16 > 19.02,$ where 19.02 cuts area 0.025 from the upper tail of $Chisq(df = 9)$. You you would reject at that 2.5% level.

If you wanted a P-value, you could use software to find the probability of the chi-squared statistic being greater than 20.16. In R software this is computed as follows, where pchisq stands for the CDF of a chi-squared distribution:

 1 - pchisq(20.16, 9)
 ## 0.01695026

Thus the P-value (probability of a value more extreme than 20.16) is about 0.017. Some software will give you the P-value automatically.

Second question. As far as binning is concerned, you are right that in some instances there are alternate possible ways of binning. You do not want so many bins that the expected counts in each bin get less than about 5, otherwise the approximation of the chi-squared statistic to the chi-squared distribution is not good. Given that restriction, it is usually better to use more bins rather than fewer.

Also notice that the df of the chi-squared distribution depends directly on the number of $bins$ used, not on the overall number of $events$ counted. (I do not understand what you say about 'approximately Gaussian' in this context.)

Examples: Here is an example in which we simulate 60 rolls of a fair die, so that we expect 10 instances of each face. The observed numbers of each face are tabulated. Finally, a chi-squared test that the die is fair has a chi-squared goodness-of-fit statistic of 3.0, and a P-value of 70% (consistent with a fair die).

  face = sample(1:6, 60, rep=T)  # simulate 60 rolls of fair die
  table(face)
  ## face
  ## 1  2  3  4  5  6 
  ## 9  6 12 10 10 13 
  chisq.test(table(face))

  ##  Chi-squared test for given probabilities   # default is equal probabilities

  ## data:  table(face) 
  ## X-squared = 3, df = 5, p-value = 0.7

In the test, the default is that faces have equal probabilities unless some other probability vector is specified. The test procedure chisq.test finds the P-value as follows (and rounds):

 1 - pchisq(3, 5)
 ## 0.6999858

In our second example, we simulate 600 rolls of a die that is heavily biased in favor of faces 4, 5, and 6 (see prob vector). Here the null hypothesis that the die is fair is soundly rejected with an extremely small P-value.

  face = sample(1:6, 600, repl=T, prob=c(1,1,1,2,2,2)/9 )
  table(face)
  ## face
  ##  1   2   3   4   5   6 
  ## 59  67  80 123 135 136 
  chisq.test(table(face))

  ##  Chi-squared test for given probabilities  # default is test for 'fair' die

  ## data:  table(face) 
  ## X-squared = 62.2, df = 5, p-value = 4.263e-12

Related Solutions

[Math] Simple Q – How tonterpret Significance Levels in a Chi-Square Test

Chi-squared g00dness-of-fit (GOF) tests are widely used and often misinterpreted. Here are two examples that involve testing to judge whether a die is fair.

Example 1: Suppose we roll a die 60 times, and get the following summary table of results.

face:   1  2  3  4  5  6 
freq:  12  8 11 15  6  8

If the die is fair, then we say we would 'expect' each face to occur 10 times. Of course, that would be an 'average' result. In view of random variation, it would be a very rare outcome to see a frequency of exactly 10 for each of the six faces.

The question is how much different from the 'expected' results $E_i = 10$ can the actual results $X_i$ be before we reject the null hypothesis that each face has probability $p_i = 1/6?$

The usual way to measure departure from the idealized outcome is to compute the GOF statistic

$$Q = \sum_{i=1}^6 \frac{(X_i - E_i)^2}{E_i}.$$

For the data shown above, we have $Q = 5.4.$ Notice that if all six observed frequencies were 10's, we would have $Q = 0,$ so large values of $Q$ correspond to poor fit to the null hypothesis that the die is fair.

If the null hypothesis is true, $Q \stackrel{aprx}{\sim} \mathsf{Chisq}(\nu = 5),$ the chi-squared distribution with $\nu = 6 - 1 = 5$ degrees of freedom. This is an approximation, but with all expected values $E_i > 5,$ some theory and some simulation studies show that the approximation is good enough to use in testing the null hypothesis.

If we are testing the null hypothesis at the 5% level of significance, the 'critical value' above which we reject the null hypothesis is $c = 11.0705.$ Because $Q < c$ we do not reject the null hypothesis. We say that the data are consistent with behavior of a fair die. The value $c$ cuts 5% of the area from the upper tail of $\mathsf{Chisq}(5).$

In R statistical software, the test procedure looks like this, where face is the vector of the 60 outcomes tabled above. [Unless a vector of probabilities other than $p = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)$ is specified, the program assumes the 'given probabilities' are equally likely.]

chisq.test(table(face))

        Chi-squared test for given probabilities

data:  table(face)
X-squared = 5.4, df = 5, p-value = 0.369

The P-value is the probability a fair die would give a $Q$-value greater than our result $Q = 5.4.$ [Another way to test at the 5% level is to reject the null hypothesis if the P-value is smaller than 5%.]

The figure below shows the density curve of $\mathsf{Chisq}(5).$ The vertical dotted red line is at the critical value $c = 11.0705,$ the vertical solid black line is at the observed value $Q = 5.4,$ and the area beneath the curve to the right of the black line is the P-value.

Example 2: By placing a lead weight beneath the corner of a die where faces 4, 5, and 6 meet it would be possible to make an unfair die with probabilities $$p = (7/36, 7/36, 7/36, 5/36, 5/36, 5/36).$$ With $n = 60$ rolls of such an altered die, the expected counts would be $$E = \left(11\frac23, 11\frac23, 11\frac23, 8\frac13, 8\frac13, 8\frac13 \right).$$

Now we ask whether our data are also consistent with 60 rolls of such an unfair die. Again the 'null distribution' of $Q$ is $\mathsf{Chisq}(5)$ and the critical value is $c=11.0705.$ However, we must use the new expected values $E_i$ in the formula for the GOF statistic, so that $Q = 7.2 < c$ and the null hypothesis is (once again) not rejected.

chisq.test(table(face), p=c(7,7,7,5,5,5)/36)

        Chi-squared test for given probabilities

data:  table(face)
X-squared = 7.2, df = 5, p-value = 0.2062

So we cannot say in Example 1 that we have "proved" the die is fair. The data are also consistent with a die that is biased as described in the current example. With only $n = 60$ rolls of the die, we do not have enough information to distinguish between a fair die and a somewhat biased one.

If the die were truly biased as described and the number of rolls had been greater (perhaps 600 instead of 60), then we would very likely get data that are clearly not consistent with a fair die.

Note: The data for these examples resulted from 60 rolls of a die that I suppose is fair. (Transparent plastic and no signs of tampering.)

[Math] How to interpret a too small chi-square $\chi^2$ value

Usually chi-squared tests of goodness-of-fit are one-sided (because the squaring involved in computing the test statistic gives both negative and positive differences the effect of increasing the statistic). Thus one rejects the null hypothesis (data fit the model) if the test statistic is larger than some critical value. (P-value is small.) The test statistic is never negative.

However, these rules do not apply when vetting a pseudorandom number generator to be used in probability simulation because a fit that is "too good to be true" (test statistic near 0, P-value near $1)$ indicates the generator is giving nonrandom values as much as does a large value of the test statistic.

There are also cases, such as the famous one in @Henry's Comment, in which data fit a model "too closely" and the procedure of data collection or tabulation comes into doubt. If you ask someone to check whether a die is fair by rolling it 600 times, and the answer comes back that each of the six faces showed exactly 100 times, you would be entitled to wonder whether the 600-roll experiment was done faithfully.

"When the P-value is very small, doubt the null hypothesis; when the P-value is very near $1$, doubt the model or the data collection."

Note: Interpretation of goodness-of-fit (GOF) tests is often incorrect. In a two-sample test whether Drug A is better than a placebo, the experimenter may be wishing for a rejection of $H_0$ (drug has no effect). However, in GOF tests the experimenter is often hoping not to reject $H_0.$ This means that it is especially important to assess the power of a GOF test.

Best Answer

Related Solutions

[Math] Simple Q – How tonterpret Significance Levels in a Chi-Square Test

[Math] How to interpret a too small chi-square $\chi^2$ value

Related Question