Solved – chi-squared to test if two variables have the same frequency distribution

chi-squared-testdistributionsfrequencynormal distributionr

I want to test if x and y have the same frequency distributions using chi-squared. In my code below, I've concluded that because the P-value of the chi-squared is >0.05, then I found no evidence that x and y have different frequency distributions. Is my conclusion correct?

set.seed(1)

x <- rnorm(100, 3, 2)
y <- rnorm(100, 3, 2)

x_counts <- with(hist(x, plot = FALSE), data.frame(breaks = breaks[-1], counts = counts))
y_counts <- with(hist(y, plot = FALSE), data.frame(breaks = breaks[-1], counts = counts))

y_counts <- rbind(data.frame(breaks = -1, counts = 0), y_counts)

x_probs <- x_counts$counts/sum(x_counts$counts)

chisq.test(x=y_counts$counts, p=c(x_probs), simulate.p.value = TRUE)

#   Chi-squared test for given probabilities with simulated p-value (based on 2000 replicates)

# data:  y_counts$counts
# X-squared = 3.3808e-31, df = NA, p-value = 1

Best Answer

It is quite clear that something went wrong with your experiment. Specifically, it is likely that what went wrong was that the support of the two distributions you generated are not the same, and therefore, you are getting a pretty weird result from the chi-square test. Another issue is that the chi-square test is meant to compare a distribution to a unknown set of frequencies, when comparing two distributions the variability due to the estimation of the frequencies of $X$ are not accounted for.

In general, whenever you get degrees of freedom of NA and a p-value of 1 you should be pretty certain that something went wrong.

If your two random variables are indeed continuous, then the tests proposed in the comments (Kolmogorov-Smirnoff etc...) are best. If the distributions of interest are discrete then a generalized-likelihood ratio test might work well. Assume that we observe two random variables taking $p$ values each. Denote by $x_i$, $i\in\{1,...,p\}$ the number of observation of type $i$ observed from $x$, $y_i$ the number of observation of type $i$ from $y$ and by $n_x$ and $n_y$ the total number of observations. Then the likelihood ratio statistic is given by: $$ 2\log \frac{\prod_{i=1}^{p} \left(\frac{x_i}{n_x}\right)^{x_i} \left(\frac{y_i}{n_y}\right)^{y_i}} {\prod_{i=1}^{p} \left(\frac{y_i + x_i}{n_y + n_x}\right)^{y_i + x_i}} \sim \chi^{2}_{p-1}. $$

This test statistic essentially compares the fit of the model assuming two different distributions to the model assuming a single distribution for the two sets of observations. The proposed test is implemented in the following code:

discreteLR <- function(x, y, exact = FALSE) {
  if(length(x) != length(y)) stop("Length of x and y must be equal!")

  nx <- sum(x)
  ny <- sum(y)

  loglikX <- sum(x * log(x / nx))
  loglikY <- sum(y * log(y / ny))
  joint <- sum((y + x) * log((y + x)/ (nx + ny)))

  chisq <- 2 * (loglikX + loglikY - joint)
  pval <- pchisq(chisq, length(x) - 1, lower.tail = FALSE)

  return(c(chisqStat = chisq, pval = pval))
}

set.seed(1)
p <- runif(5)
p <- p / sum(p)
nx <- 100
ny <- 150
reps <- 10^3
x <- rmultinom(reps, nx, p)
y <- rmultinom(reps, ny, p)

pvalues <- numeric(reps)
exact <- numeric(reps)
for(i in 1:reps) {
  pvalues[i] <- discreteLR(x[, i], y[, i])[2]
}

qqplot(qunif(ppoints(reps)), pvalues,
       xlab = "uniform quantiles",
       ylab = "simulated p-values")
abline(a = 0, b = 1)

Related Solutions

Probability – Understanding the Chi-Squared Test and Distribution

We could as well use a binomial distribution but it is not the point of the question…

Nevertheless, it is our starting point even for your actual question. I'll cover it somewhat informally.

Let's consider with the binomial case more generally:

$Y\sim \text{Bin}(n,p)$

Assume $n$ and $p$ are such that $Y$ is well approximated by a normal with the same mean and variance (some typical requirements are that $\min(np,n(1-p))$ is not small, or that $np(1-p)$ is not small).

Then $(Y-E(Y))^2/\text{Var}(Y)$ will be approximately $\sim\chi^2_1$. Here $Y$ is the number of successes.

We have $E(Y) = np$ and $\text{Var}(Y)=np(1-p)$.

(In the testing case, $n$ is known and $p$ is specified under $H_0$. We don't do any estimation.)

So if $H_0$ is true $(Y-np)^2/np(1-p)$ will be approximately $\sim\chi^2_1$.

Note that $(Y-np)^2 = [(n-Y)-n(1-p)]^2$. Also note that $\frac{1}{p} + \frac{1}{1-p} = \frac{1}{p(1-p)}$.

Hence $\frac{(Y-np)^2}{np(1-p)} = \frac{(Y-np)^2}{np}+\frac{(Y-np)^2}{n(1-p)}\\ \quad= \frac{(Y-np)^2}{np}+\frac{[(n-Y)-n(1-p)]^2}{n(1-p)} \\ \quad= \frac{(O_S-E_S)^2}{E_S}+\frac{(O_F-E_F)^2}{E_F}$

Which is just the chi-square statistic for the binomial case.

So in that case the chi-square statistic should have the distribution of the square of an (approximately) standard-normal random variable.

Solved – Test if two samples follow the same distribution with Chi Squared in R

Given some set of cutpoints, the two-sample case becomes a chi-squared test of homogeneity of proportions (and this in turn is functionally identical to a test of independence in a $2\times k$ table).

How do I go about binning two samples using the same intervals?

choose some set of bins (if possible without reference to the data, though in practice that may be difficult to accomplish unless you know beforehand what the distribution is roughly going to be)
for each sample count the data in those bins

(in R you could use the cut function for setting up the bins and the table function for counting - but it's far from the only choice. If you really wanted to get hist to choose your bins then I'd combine the two samples into one for identifying your cut-offs, but then you still have to go back and do the counts for the individual samples; it may also leave you with some small expected counts, but if you work with just the marginal distribution you can at least combine bins that way without looking at how the individual counts would have split up)

A worked example:

set.seed(7687120)                # make sure we look at the same numbers
x=rgamma(40,6,1/6)               # generate some x,y data 
y=rgamma(30,9,1/5)               # from different distributions
xy=c(x,y)                        # combine into one sample
hist(xy)                         # default hist bins not really suitable 
summary(xy)
hist(xy),breaks=seq(15,105,15))  # some small category counts at the top end
bks=c(15,30,45,60,105)           # -- push together everything above 60

table(cut(xy,breaks=bks))        # marginal totals look reasonable to me 
xc=table(cut(x,breaks=bks))      # calc. individual counts in table for x
yc=table(cut(y,breaks=bks))      # corresponding counts for y
rbind(xc,yc)                     # what the table looks like
chisq.test(rbind(xc,yc))         # testing the result

Best Answer

Related Solutions

Probability – Understanding the Chi-Squared Test and Distribution

Solved – Test if two samples follow the same distribution with Chi Squared in R

Related Question