Solved – Test if two samples follow the same distribution with Chi Squared in R

chi-squared-testrself-study

For an assignment I am supposed to check whether two samples follow the same distribution. The task is basically "Generate two samples with n=40 from the same distribution (1000 times). Use KS and Chi Squared to test if they both follow the same distribution. Calculate alpha and beta."
The KS test is straightforward, I simply feed both samples to ks.test and check the p-values against the significance level. I have problems using Chi squared though. We can either use chisq.test or implement our own function using information from the lecture.
To my understanding, both approaches require me to split the range of possible values into intervals (bins, to use histogram terminology) and calculate the probabilities of values hitting a certain bin. I then use those probabilities for the formula from the lecture or pass them to chisq.test together with the "counts" of the other sample.

I can't seem to figure out how to automatically bin both samples using identical intervals (which at least the lecture states I have to do).
Also, I am not 100% sure I understood everything right, although the formula given in the lecture for comparing two samples with chi squared does not seem too complicated and makes sense to me in a chi squared context (compare expected frequency to actual frequency, but for two samples).

So I would like to know:

  1. Does my explanation of the concept of the test reveal some
    misunderstandings concerning the chi squared test?
  2. How do I go about binning two samples using the same intervals? The suggested approach seems to be using histograms, but this usually leaves me with different intervals for each sample. The following does not work as well:

    h1<-hist(sample1)
    h2<-hist(sample2,breaks=h1$breaks)
    

since at some point I am confronted with values which do not fit into the intervals specified by h1.

Since the lecture is in Russian, which is not my native language, something might very well simply have gone past me. Please let me know if you have the impression I missed a crucial point.

P.S.
The code here tries to accomplish the same thing. When I run it, I immediately have the same problem as before.

  'x' and 'p' must have the same number of elements

Best Answer

Given some set of cutpoints, the two-sample case becomes a chi-squared test of homogeneity of proportions (and this in turn is functionally identical to a test of independence in a $2\times k$ table).

How do I go about binning two samples using the same intervals?

  • choose some set of bins (if possible without reference to the data, though in practice that may be difficult to accomplish unless you know beforehand what the distribution is roughly going to be)

  • for each sample count the data in those bins

    (in R you could use the cut function for setting up the bins and the table function for counting - but it's far from the only choice. If you really wanted to get hist to choose your bins then I'd combine the two samples into one for identifying your cut-offs, but then you still have to go back and do the counts for the individual samples; it may also leave you with some small expected counts, but if you work with just the marginal distribution you can at least combine bins that way without looking at how the individual counts would have split up)


A worked example:

set.seed(7687120)                # make sure we look at the same numbers
x=rgamma(40,6,1/6)               # generate some x,y data 
y=rgamma(30,9,1/5)               # from different distributions
xy=c(x,y)                        # combine into one sample
hist(xy)                         # default hist bins not really suitable 
summary(xy)
hist(xy),breaks=seq(15,105,15))  # some small category counts at the top end
bks=c(15,30,45,60,105)           # -- push together everything above 60

table(cut(xy,breaks=bks))        # marginal totals look reasonable to me 
xc=table(cut(x,breaks=bks))      # calc. individual counts in table for x
yc=table(cut(y,breaks=bks))      # corresponding counts for y
rbind(xc,yc)                     # what the table looks like
chisq.test(rbind(xc,yc))         # testing the result