First, let's see if there are differences in the proportion working across
the four groups A, B, C, D. (Data similar to yours.)
w = c(90, 32, 9, 3)
nw = c(46, 7 , 8, 5)
TBL = rbind(w, nw)
chisq.test(TBL)
Pearson's Chi-squared test
data: TBL
X-squared = 8.7062, df = 3, p-value = 0.03346
Warning message:
In chisq.test(TBL) :
Chi-squared approximation may be incorrect
The low cell counts in groups C and D, trigger a warning message, putting
the validity of the P-value into doubt. The version of 'chisq.test` implemented in R, allows for simulation of a more accurate P-value, showing a significant effect at the 5% level.
chisq.test(TBL, sim=T)$p.val
[1] 0.03098451
Significance barely at the 5% level does not invite extensive ad hoc
tests. To avoid false discovery they should show significance at lower levels.
Furthermore, it is not clear just which confidence intervals would be of interest. A look at the Pearson residuals to see if there groups that
are strikingly different, possibly suggests comparing groups A and B. However, the level of significance there is unimpressive, especially if we
protect against false discovery.
chisq.test(TBL)$resi
[,1] [,2] [,3] [,4]
w -0.1173306 1.148334 -0.7081676 -1.019365
nw 0.1671828 -1.636247 1.0090588 1.452480
chisq.test(TBL[,c(1,2)], cor=F)
Pearson's Chi-squared test
data: TBL[, c(1, 2)]
X-squared = 3.6176, df = 1, p-value = 0.05717
You have already said you know how to use 'prop.test' to get a 95%
confidence interval for the difference of proportions in A and B.
I don't see a point in looking at other pairs of groups---especially not, in view of the low
counts there. Maybe you would like to compare group A with the other three groups combined, but 'prop.test' can handle that.
If you had additional kinds of analyses in mind using confidence intervals, please be more specific, and maybe one of us can help.
Best Answer
The very limited information you have is certainly a severe constraint! However, things aren't entirely hopeless.
Under the same assumptions that lead to the asymptotic $\chi^2$ distribution for the test statistic of the goodness-of-fit test of the same name, the test statistic under the alternative hypothesis has, asymptotically, a noncentral $\chi^2$ distribution. If we assume the two stimuli are a) significant, and b) have the same effect, the associated test statistics will have the same asymptotic noncentral $\chi^2$ distribution. We can use this to construct a test - basically, by estimating the noncentrality parameter $\lambda$ and seeing whether the test statistics are far in the tails of the noncentral $\chi^2(18, \hat{\lambda}) $ distribution. (That's not to say this test will have much power, though.)
We can estimate the noncentrality parameter given the two test statistics by taking their average and subtracting the degrees of freedom (a methods of moments estimator), giving an estimate of 44, or by maximum likelihood:
Good agreement between our two estimates, not actually surprising given two data points and the 18 degrees of freedom. Now to calculate a p-value:
So our p-value is 0.12, not sufficient to reject the null hypothesis that the two stimuli are the same.
Does this test actually have (roughly) a 5% reject rate when the noncentrality parameters are the same? Does it have any power? We'll attempt to answer these questions by constructing a power curve as follows. First, we fix the average $\lambda$ at the estimated value of 43.68. The alternative distributions for the two test statistics will be noncentral $\chi^2$ with 18 degrees of freedom and noncentrality parameters $(\lambda-\delta, \lambda+\delta)$ for $\delta = 1, 2, \dots, 15$. We'll simulate 10000 draws from these two distributions for each $\delta$ and see how often our test rejects at, say, the 90% and 95% level of confidence.
which gives the following:
Looking at the true null hypothesis points (x-axis value = 0), we see that the test is conservative, in that it doesn't appear to reject as often as the level would indicate, but not overwhelmingly so. As we expected, it doesn't have much power, but it's better than nothing. I wonder if there are better tests out there, given the very limited amount of information you have available.