Solved – Rules to apply Monte Carlo simulation of p-values for chi-squared test

chi-squared-testmonte carlor

I'd like to understand the use of Monte Carlo simulation in the chisq.test() function in R.

I have a qualitative variable which has 128 levels / classes. My sample size is 26 (I was not able to sample more "individuals"). So obviously, I will have some levels with 0 "individuals". But the fact is that I have only a very small number of classes represented out of the 127 possible. As I have heard that to apply chi-squared test we should have at least 5 individuals in each level (I do not completely understand the reason for that), I thought I had to use the simulate.p.value option to use Monte Carlo simulation to estimate the distribution and compute a p-value. Without Monte Carlo simulation, R gives me a p-value < 1e-16. With Monte Carlo simulation, it gives me a p-value at 4e-5.

I tried to compute the p-value with a vector of 26 ones and 101 zeros, and with Monte-Carlo simulation, I get a p-value at 1.

Is it OK to state that, even if my sample size is small compared with the number of possible classes, the observed distribution is such that it is very unlikely that all possible classes exist at the same probability (1/127) in the real population?

Best Answer

By searching, it seems that the point of Monte-Carlo Simulation is to produce a reference distribution, based on randomly generated samples which will have the same size as the tested sample, in order to compute p-values when test conditions are not satisfied.

This is explained in Hope A. J Royal Stat Society Series B (1968) which can be found on JSTOR.

Here is a relevant quote from the Hope paper:

Monte-Carlo significance test procedures consist of the comparison of the observed data with random samples generated in accordance with the hypothesis being tested. ... It is preferable to use a known test of good efficiency instead of a Monte-Carlo test procedure assuming that the alternative statistical hypothesis can be completely specified. However, it is not always possible to use such a test because the necessary conditions for applying the test may not be satisfied, or the underlying distribution may be unknown or it may be difficult to decide on an appropriate test criterion.