[Math] Probability from chi square distribution

probability distributionsstatistics

How do I find a probability for a chi square distribution?

I have a continous random variable from which I've got the chi square with the formula:

$$\sum \frac{(o-e)^2}{e}$$ where $o$ is the observed value and $e$ the expected value (the mean).

with that value plus the degrees of freedom (the size of my sample – 1), I would like to get result in percentage. Which step should I take? find the z score and the standard normal distribution and then try to get the CDF?

probably off-topic (for mathematics @ stackexchange):

There is any R function to calculate it? Like giving a chi square + degrees of freedom it gives me a %?

Best Answer

Since you're probably performing a chi-squared test, then test-statistic $$ X^2=\sum_{i=1}^n \frac{(O_i-E_i)^2}{E_i} $$ follows a $\chi^2(p)$ distribution with $p=n-1$ degrees of freedom. Since large values are critical, the corresponding $p$-value is given by $$ P(\chi^2(p)\geq X^2)=1-F_{\chi^2(p)}(X^2), $$ i.e. the probability of a $\chi^2(p)$-variable being larger than what we have observed. Here $F_{\chi^2(p)}$ is the distribution function of a $\chi^2(p)$ distribution.

This is easily calculated in R with the command 1-pchisq(x,p), where x denotes the test-statistic $X^2$ and p is the number of degrees of freedom.

Related Solutions

[Math] Using Chi-Square to test normality.

The chi-squared test is pretty easy to use:

You are told that the engineer has determined that the mean is 10mm and stdev is 0.1 mm. This is all the information you need to specify a normal distribution.

The chi-squared test is based on turning your null hyopthesis into a multinomial distribution, which is a generalization of the binomial distribution (where there are only 2 outcomes). Here, the possibilities are your "bins", where the probability assigned to each "bin" $p_i$ is equal to the probability that a normally distirbuted random variable's value would fall in that bin. As a rule of thumb, if you have N data points, you want to size your bins such tat $Np_i \geq 5$.
Once you have your bins, you calculate the expected number of observations you would have in each bin $E_i$, assuming that your data actually come from a $Normal(10mm,(0.1mm)^2)$ distribution: $E_i=Np_i$.
Now you just need to count the actual number of observations in each bin ($O_i$).
For each bin, you want to form the statistic: $S_i=\frac{(O_i-E_i)^2}{E_i}$
Calculate the Chi-squared statistic $\chi^2 = \sum S_i$
Now here comes the theoretical part: It turns out that when the data actually do come from the multinoimial distribution you constructed (using the underlying hypothesized normal distribution) then the distribution of $\chi^2$ is asymptotically $\chi^2_{k-1}$ where k is the number of bins.
Now, just compare the value of the $\chi^2$ to the $1-\alpha$ percentile of the $\chi^2_{k-1}$ distirution to see if you will reject (i.e., $\chi^2 > 1-\alpha$ percentile of the $\chi^2_{k-1}$)

[Math] Confused on the relationship between Chi-square, its CDF, and p-value.

First question. You are right about being able to use software instead of tables of the chi-squared distribution. For example, if df = 9 and the chi-squared statistic is 20.16, you could look at a chi-squared table to see that $20.16 > 19.02,$ where 19.02 cuts area 0.025 from the upper tail of $Chisq(df = 9)$. You you would reject at that 2.5% level.

If you wanted a P-value, you could use software to find the probability of the chi-squared statistic being greater than 20.16. In R software this is computed as follows, where pchisq stands for the CDF of a chi-squared distribution:

 1 - pchisq(20.16, 9)
 ## 0.01695026

Thus the P-value (probability of a value more extreme than 20.16) is about 0.017. Some software will give you the P-value automatically.

Second question. As far as binning is concerned, you are right that in some instances there are alternate possible ways of binning. You do not want so many bins that the expected counts in each bin get less than about 5, otherwise the approximation of the chi-squared statistic to the chi-squared distribution is not good. Given that restriction, it is usually better to use more bins rather than fewer.

Also notice that the df of the chi-squared distribution depends directly on the number of $bins$ used, not on the overall number of $events$ counted. (I do not understand what you say about 'approximately Gaussian' in this context.)

Examples: Here is an example in which we simulate 60 rolls of a fair die, so that we expect 10 instances of each face. The observed numbers of each face are tabulated. Finally, a chi-squared test that the die is fair has a chi-squared goodness-of-fit statistic of 3.0, and a P-value of 70% (consistent with a fair die).

  face = sample(1:6, 60, rep=T)  # simulate 60 rolls of fair die
  table(face)
  ## face
  ## 1  2  3  4  5  6 
  ## 9  6 12 10 10 13 
  chisq.test(table(face))

  ##  Chi-squared test for given probabilities   # default is equal probabilities

  ## data:  table(face) 
  ## X-squared = 3, df = 5, p-value = 0.7

In the test, the default is that faces have equal probabilities unless some other probability vector is specified. The test procedure chisq.test finds the P-value as follows (and rounds):

 1 - pchisq(3, 5)
 ## 0.6999858

In our second example, we simulate 600 rolls of a die that is heavily biased in favor of faces 4, 5, and 6 (see prob vector). Here the null hypothesis that the die is fair is soundly rejected with an extremely small P-value.

  face = sample(1:6, 600, repl=T, prob=c(1,1,1,2,2,2)/9 )
  table(face)
  ## face
  ##  1   2   3   4   5   6 
  ## 59  67  80 123 135 136 
  chisq.test(table(face))

  ##  Chi-squared test for given probabilities  # default is test for 'fair' die

  ## data:  table(face) 
  ## X-squared = 62.2, df = 5, p-value = 4.263e-12

Best Answer

Related Solutions

[Math] Using Chi-Square to test normality.

[Math] Confused on the relationship between Chi-square, its CDF, and p-value.

Related Question