Solved – Chi Squared Results in R and Python

chi-squared-testpythonr

Based on this answer, Python requires expected values in a chi square test to be absolute frequencies. Consider the following in Python:

import numpy
import scipy.stats
# chisquare function requires (observed, expected)
scipy.stats.chisquare(numpy.array([0,0,23,0]), numpy.array([1,1,1794,1]))
(1751.2948717948718, 0.0)

results in a p-value of 0 (whatever that means).

The same calculation in R, which requires that the expected values be proprotions:

chisq.test(c(0, 0, 23, 0), p=c(1/1797,1/1797,1794/1797, 1/1797))

        Chi-squared test for given probabilities

data:  c(0, 0, 23, 0)
X-squared = 0.0385, df = 3, p-value = 0.998

resulting in a p-value of 0.998.

Which is correct?

Best Answer

These two seem to be testing different things. The Python code looks like it is a two way chi square test (but a p value of 0 makes no sense here), while the R code is one way. I am not sure which you want.

To do the two way test in R use

x1 <- c(0, 0, 23, 0)
x2 <- c(1, 1, 1794, 1)
chisq.test(x1, x2)

Which gives a p value of 0.5.

However, since a lot of the expected values are less than 5, Rcorrectly gives a warning. You can simulate using

chisq.test(x1, x2, simulate = TRUE)

which gives a p of 0.25

Your code also gives a warning, but this

chisq.test(c(0, 0, 23, 0),
           p=c(1/1797,1/1797,1794/1797, 1/1797),
           simulate = TRUE)

gives a p of 1.

This certainly makes sense.

I don't have Python so I can't say for sure what is going on there.

A two way chi square tests whether two categorical variables are associated with each other; a one way tests whether one categorical variable is distributed equal to a certain set of proportions.