Solved – Confidence interval for chi-square

chi-squared-testconfidence intervalr

I am trying to find a solution to compare two "goodness-of-fit chi-square" tests.
More precisely, I want to compare results from two independent experiments. In these experiments the authors used the goodness-of-fit chi-square to compare random guessing (expected frequencies) with observed frequencies. The two experiments got the same number of participants and experimental procedures are identical, only the stimuli changed. The two experiments results indicated a significant chi-square (exp. 1 : X²(18)=45; p<.0005 and exp. 2 : X²(18)=79; p<.0001).

Now, what I want to do is to test if there is a difference between these two results. I think a solution could be the use of confidence intervals but I don't know how to calculate these confidence intervals only with these results. Or maybe a test to compare effect size (Cohen's w)?

Anyone have a solution?

Thanks a lot!

F.D.

Best Answer

The very limited information you have is certainly a severe constraint! However, things aren't entirely hopeless.

Under the same assumptions that lead to the asymptotic $\chi^2$ distribution for the test statistic of the goodness-of-fit test of the same name, the test statistic under the alternative hypothesis has, asymptotically, a noncentral $\chi^2$ distribution. If we assume the two stimuli are a) significant, and b) have the same effect, the associated test statistics will have the same asymptotic noncentral $\chi^2$ distribution. We can use this to construct a test - basically, by estimating the noncentrality parameter $\lambda$ and seeing whether the test statistics are far in the tails of the noncentral $\chi^2(18, \hat{\lambda}) $ distribution. (That's not to say this test will have much power, though.)

We can estimate the noncentrality parameter given the two test statistics by taking their average and subtracting the degrees of freedom (a methods of moments estimator), giving an estimate of 44, or by maximum likelihood:

x <- c(45, 79)
n <- 18

ll <- function(ncp, n, x) sum(dchisq(x, n, ncp, log=TRUE))
foo <- optimize(ll, c(30,60), n=n, x=x, maximum=TRUE)
> foo$maximum
[1] 43.67619

Good agreement between our two estimates, not actually surprising given two data points and the 18 degrees of freedom. Now to calculate a p-value:

> pchisq(x, n, foo$maximum)
[1] 0.1190264 0.8798421

So our p-value is 0.12, not sufficient to reject the null hypothesis that the two stimuli are the same.

Does this test actually have (roughly) a 5% reject rate when the noncentrality parameters are the same? Does it have any power? We'll attempt to answer these questions by constructing a power curve as follows. First, we fix the average $\lambda$ at the estimated value of 43.68. The alternative distributions for the two test statistics will be noncentral $\chi^2$ with 18 degrees of freedom and noncentrality parameters $(\lambda-\delta, \lambda+\delta)$ for $\delta = 1, 2, \dots, 15$. We'll simulate 10000 draws from these two distributions for each $\delta$ and see how often our test rejects at, say, the 90% and 95% level of confidence.

nreject05 <- nreject10 <- rep(0,16)
delta <- 0:15
lambda <- foo$maximum
for (d in delta)
{
  for (i in 1:10000)
  {
    x <- rchisq(2, n, ncp=c(lambda+d,lambda-d))
    lhat <- optimize(ll, c(5,95), n=n, x=x, maximum=TRUE)$maximum
    pval <- pchisq(min(x), n, lhat)
    nreject05[d+1] <- nreject05[d+1] + (pval < 0.05)
    nreject10[d+1] <- nreject10[d+1] + (pval < 0.10)
  }
}
preject05 <- nreject05 / 10000
preject10 <- nreject10 / 10000

plot(preject05~delta, type='l', lty=1, lwd=2,
     ylim = c(0, 0.4),
     xlab = "1/2 difference between NCPs",
     ylab = "Simulated rejection rates",
     main = "")
lines(preject10~delta, type='l', lty=2, lwd=2)
legend("topleft",legend=c(expression(paste(alpha, " = 0.05")),
                          expression(paste(alpha, " = 0.10"))),
       lty=c(1,2), lwd=2)

which gives the following:

enter image description here

Looking at the true null hypothesis points (x-axis value = 0), we see that the test is conservative, in that it doesn't appear to reject as often as the level would indicate, but not overwhelmingly so. As we expected, it doesn't have much power, but it's better than nothing. I wonder if there are better tests out there, given the very limited amount of information you have available.

Related Question