I'd like to understand the use of Monte Carlo simulation in the chisq.test()
function in R.
I have a qualitative variable which has 128 levels / classes. My sample size is 26 (I was not able to sample more "individuals"). So obviously, I will have some levels with 0 "individuals". But the fact is that I have only a very small number of classes represented out of the 127 possible. As I have heard that to apply chi-squared test we should have at least 5 individuals in each level (I do not completely understand the reason for that), I thought I had to use the simulate.p.value
option to use Monte Carlo simulation to estimate the distribution and compute a p-value. Without Monte Carlo simulation, R gives me a p-value < 1e-16
. With Monte Carlo simulation, it gives me a p-value at 4e-5
.
I tried to compute the p-value with a vector of 26 ones and 101 zeros, and with Monte-Carlo simulation, I get a p-value at 1.
Is it OK to state that, even if my sample size is small compared with the number of possible classes, the observed distribution is such that it is very unlikely that all possible classes exist at the same probability (1/127) in the real population?
Best Answer
By searching, it seems that the point of Monte-Carlo Simulation is to produce a reference distribution, based on randomly generated samples which will have the same size as the tested sample, in order to compute p-values when test conditions are not satisfied.
This is explained in Hope A. J Royal Stat Society Series B (1968) which can be found on JSTOR.
Here is a relevant quote from the Hope paper: