Based on this answer, Python requires expected values in a chi square test to be absolute frequencies. Consider the following in Python:
import numpy
import scipy.stats
# chisquare function requires (observed, expected)
scipy.stats.chisquare(numpy.array([0,0,23,0]), numpy.array([1,1,1794,1]))
(1751.2948717948718, 0.0)
results in a p-value of 0 (whatever that means).
The same calculation in R, which requires that the expected values be proprotions:
chisq.test(c(0, 0, 23, 0), p=c(1/1797,1/1797,1794/1797, 1/1797))
Chi-squared test for given probabilities
data: c(0, 0, 23, 0)
X-squared = 0.0385, df = 3, p-value = 0.998
resulting in a p-value of 0.998.
Which is correct?
Best Answer
These two seem to be testing different things. The
Python
code looks like it is a two way chi square test (but a p value of 0 makes no sense here), while theR
code is one way. I am not sure which you want.To do the two way test in
R
useWhich gives a p value of 0.5.
However, since a lot of the expected values are less than 5,
R
correctly gives a warning. You can simulate usingwhich gives a p of 0.25
Your code also gives a warning, but this
gives a p of 1.
This certainly makes sense.
I don't have Python so I can't say for sure what is going on there.
A two way chi square tests whether two categorical variables are associated with each other; a one way tests whether one categorical variable is distributed equal to a certain set of proportions.