Solved – chi-squared to test if two variables have the same frequency distribution

chi-squared-testdistributionsfrequencynormal distributionr

I want to test if x and y have the same frequency distributions using chi-squared. In my code below, I've concluded that because the P-value of the chi-squared is >0.05, then I found no evidence that x and y have different frequency distributions. Is my conclusion correct?

set.seed(1)

x <- rnorm(100, 3, 2)
y <- rnorm(100, 3, 2)

x_counts <- with(hist(x, plot = FALSE), data.frame(breaks = breaks[-1], counts = counts))
y_counts <- with(hist(y, plot = FALSE), data.frame(breaks = breaks[-1], counts = counts))

y_counts <- rbind(data.frame(breaks = -1, counts = 0), y_counts)

x_probs <- x_counts$counts/sum(x_counts$counts)

chisq.test(x=y_counts$counts, p=c(x_probs), simulate.p.value = TRUE)

#   Chi-squared test for given probabilities with simulated p-value (based on 2000 replicates)

# data:  y_counts$counts
# X-squared = 3.3808e-31, df = NA, p-value = 1

Best Answer

It is quite clear that something went wrong with your experiment. Specifically, it is likely that what went wrong was that the support of the two distributions you generated are not the same, and therefore, you are getting a pretty weird result from the chi-square test. Another issue is that the chi-square test is meant to compare a distribution to a unknown set of frequencies, when comparing two distributions the variability due to the estimation of the frequencies of $X$ are not accounted for.

In general, whenever you get degrees of freedom of NA and a p-value of 1 you should be pretty certain that something went wrong.

If your two random variables are indeed continuous, then the tests proposed in the comments (Kolmogorov-Smirnoff etc...) are best. If the distributions of interest are discrete then a generalized-likelihood ratio test might work well. Assume that we observe two random variables taking $p$ values each. Denote by $x_i$, $i\in\{1,...,p\}$ the number of observation of type $i$ observed from $x$, $y_i$ the number of observation of type $i$ from $y$ and by $n_x$ and $n_y$ the total number of observations. Then the likelihood ratio statistic is given by: $$ 2\log \frac{\prod_{i=1}^{p} \left(\frac{x_i}{n_x}\right)^{x_i} \left(\frac{y_i}{n_y}\right)^{y_i}} {\prod_{i=1}^{p} \left(\frac{y_i + x_i}{n_y + n_x}\right)^{y_i + x_i}} \sim \chi^{2}_{p-1}. $$

This test statistic essentially compares the fit of the model assuming two different distributions to the model assuming a single distribution for the two sets of observations. The proposed test is implemented in the following code:

discreteLR <- function(x, y, exact = FALSE) {
  if(length(x) != length(y)) stop("Length of x and y must be equal!")

  nx <- sum(x)
  ny <- sum(y)

  loglikX <- sum(x * log(x / nx))
  loglikY <- sum(y * log(y / ny))
  joint <- sum((y + x) * log((y + x)/ (nx + ny)))

  chisq <- 2 * (loglikX + loglikY - joint)
  pval <- pchisq(chisq, length(x) - 1, lower.tail = FALSE)

  return(c(chisqStat = chisq, pval = pval))
}

set.seed(1)
p <- runif(5)
p <- p / sum(p)
nx <- 100
ny <- 150
reps <- 10^3
x <- rmultinom(reps, nx, p)
y <- rmultinom(reps, ny, p)

pvalues <- numeric(reps)
exact <- numeric(reps)
for(i in 1:reps) {
  pvalues[i] <- discreteLR(x[, i], y[, i])[2]
}

qqplot(qunif(ppoints(reps)), pvalues,
       xlab = "uniform quantiles",
       ylab = "simulated p-values")
abline(a = 0, b = 1)

enter image description here

Related Question