Solved – Chi Squared Results in R and Python

chi-squared-testpythonr

Based on this answer, Python requires expected values in a chi square test to be absolute frequencies. Consider the following in Python:

import numpy
import scipy.stats
# chisquare function requires (observed, expected)
scipy.stats.chisquare(numpy.array([0,0,23,0]), numpy.array([1,1,1794,1]))
(1751.2948717948718, 0.0)

results in a p-value of 0 (whatever that means).

The same calculation in R, which requires that the expected values be proprotions:

chisq.test(c(0, 0, 23, 0), p=c(1/1797,1/1797,1794/1797, 1/1797))

        Chi-squared test for given probabilities

data:  c(0, 0, 23, 0)
X-squared = 0.0385, df = 3, p-value = 0.998

resulting in a p-value of 0.998.

Which is correct?

Best Answer

These two seem to be testing different things. The Python code looks like it is a two way chi square test (but a p value of 0 makes no sense here), while the R code is one way. I am not sure which you want.

To do the two way test in R use

x1 <- c(0, 0, 23, 0)
x2 <- c(1, 1, 1794, 1)
chisq.test(x1, x2)

Which gives a p value of 0.5.

However, since a lot of the expected values are less than 5, Rcorrectly gives a warning. You can simulate using

chisq.test(x1, x2, simulate = TRUE)

which gives a p of 0.25

Your code also gives a warning, but this

chisq.test(c(0, 0, 23, 0),
           p=c(1/1797,1/1797,1794/1797, 1/1797),
           simulate = TRUE)

gives a p of 1.

This certainly makes sense.

I don't have Python so I can't say for sure what is going on there.

A two way chi square tests whether two categorical variables are associated with each other; a one way tests whether one categorical variable is distributed equal to a certain set of proportions.

Related Solutions

R – Using Pearson’s Chi-Square (N-1) in R Programming

According to this page the N-1 correction is very simple; just multiply $\chi^2$ by (N-1)/N. You could then use the pchisq function in R to get the right p value (the exact code would be, I believe, something like

newchisq = ((N-1)/N) * oldchisq
newp <- 1 - pchisq(newchisq, df)

Solved – Chi-squared test with scipy: what’s the difference between chi2_contingency and chisquare

Probably you have solved it, but I let this here to help anyone that is lost, like I was. The difference is the Null Hypothesis.

scipy.stats.chi2_contingency, from Scipy:

"Chi-square test of independence of variables in a contingency table"

In this test you are testing if there is there is relationship between two or more variable. This is called chi-square test for independence, also called Pearson's chi-square test or the chi-square test of association. In this test you are testing the association between two or more variable. The null hupothesis, in your example, is "there is no effect of group in choosing the equipment to use".

In scipy.stats.chisquare from Scipy

"The chi square test tests the null hypothesis that the categorical data has the given frequencies."

Here you are comparing if there is difference between an observation and an expected frequency. So, the null hupothesis, is that "there isn't any difference between observed and the expected". Here, the test is used to compare the observed sample distribution with the expected probability distribution. This is named Chi-Square goodness of fit test

Best Answer

Related Solutions

R – Using Pearson’s Chi-Square (N-1) in R Programming

Solved – Chi-squared test with scipy: what’s the difference between chi2_contingency and chisquare

Related Question