Chi-Squared Test – Understanding Relationship with Test of Equal Proportions

chi-squared-testcontingency tablesproportion;z-test

Suppose that I have three populations with four, mutually exclusive characteristics. I take random samples from each population and construct a crosstab or frequency table for the characteristics that I am measuring. Am I correct in saying that:

If I wanted to test whether there is any relationship between the populations and the characteristics (e.g. whether one population has a higher frequency of one of the characteristics), I should run a chi-squared test and see whether the result is significant.
If the chi-squared test is significant, it only shows me that there is some relationship between the populations and characteristics, but not how they are related.
Furthermore, not all of the characteristics need to be related to the population. For example, if the different populations have significantly different distributions of characteristics A and B, but not of C and D, then the chi-squared test may still come back as significant.
If I wanted to measure whether or not a specific characteristic is affected by the population, then I can run a test for equal proportions (I have seen this called a z-test, or as prop.test() in R) on just that characteristic.

In other words, is it appropriate to use the prop.test() to more accurately determine the nature of a relationship between two sets of categories when the chi-squared test says that there is a significant relationship?

Best Answer

Very short answer:

The chi-Squared test (chisq.test() in R) compares the observed frequencies in each category of a contingency table with the expected frequencies (computed as the product of the marginal frequencies). It is used to determine whether the deviations between the observed and the expected counts are too large to be attributed to chance. Departure from independence is easily checked by inspecting residuals (try ?mosaicplot or ?assocplot, but also look at the vcd package). Use fisher.test() for an exact test (relying on the hypergeometric distribution).

The prop.test() function in R allows to test whether proportions are comparable between groups or does not differ from theoretical probabilities. It is referred to as a $z$-test because the test statistic looks like this:

$$ z=\frac{(f_1-f_2)}{\sqrt{\hat p \left(1-\hat p \right) \left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} $$

where $\hat p=(p_1+p_2)/(n_1+n_2)$, and the indices $(1,2)$ refer to the first and second line of your table. In a two-way contingency table where $H_0:\; p_1=p_2$, this should yield comparable results to the ordinary $\chi^2$ test:

> tab <- matrix(c(100, 80, 20, 10), ncol = 2)
> chisq.test(tab)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab 
X-squared = 0.8823, df = 1, p-value = 0.3476

> prop.test(tab)

    2-sample test for equality of proportions with continuity correction

data:  tab 
X-squared = 0.8823, df = 1, p-value = 0.3476
alternative hypothesis: two.sided 
95 percent confidence interval:
 -0.15834617  0.04723506 
sample estimates:
   prop 1    prop 2 
0.8333333 0.8888889

For analysis of discrete data with R, I highly recommend R (and S-PLUS) Manual to Accompany Agresti’s Categorical Data Analysis (2002), from Laura Thompson.

The Chi-squared Test

In these data the total count is $45$ for $5\times 2=10$ table entries, producing a mean count of $4.5$ spread through two columns of roughly equal totals ($23$ and $22$). This is starting to get into the range where rules of thumb suggest the $\chi^2$ statistic--which is just a number measuring a discrepancy between the two ethnicity distributions--may have an approximate $\chi^2$ distribution. Let us therefore begin by computing the statistic and its associated p-value. (I am using R for these calculations.)

x <- cbind(A=c(1,3,1,3,15), B=c(2,0,0,8,12))
chisq.test(x)

The output is

X-squared = 6.9206, df = 4, p-value = 0.1401

along with a warning that "Chi-squared approximation may be incorrect." Fair enough. But since the reported p-value is not extreme--so we're not reaching far into the tails of the distribution of the statistic--we can expect this p-value to be fairly accurate. Let's see.

Simulating the Chi-squared P-value

One way to check is to simulate the true distribution of the $\chi^2$ statistic. R offers a "Monte Carlo" test.

chisq.test(x, simulate.p.value=TRUE, B=1e5)

Using $100,000$ iterations (and repeating that several times), this test reports a p-value consistently near $0.130$: reasonably close to the original p-value of $0.1401$.

(If I am reading the R source code for chisq.test correctly, in each Monte-Carlo iteration it computes a $\chi^2$ statistic comparing the simulated data to the estimates obtained from the original data (rather than to estimates obtained from the marginals of the simulated data, as is performed in a true $\chi^2$ test). It is difficult to see how this is directly applicable to the original hypothesis. The R manual refers us to Hope, A. C. A. (1968) A simplified Monte Carlo significance test procedure. J. Roy. Statist. Soc. B 30, 582–598. I cannot find in that paper any justification for what R is doing; in particular, the paper uses independent tests of each simulated sample to assess goodness of fit for continuous distributions, whereas the R software conducts a series of dependent tests to assess independence among samples involving discrete distributions.)

Going Deeper

Another approach is to bootstrap the test. This procedure uses the data to estimate the parameters under the null hypothesis (that the two samples are from the same population), then repeatedly replicates the data-collection process by drawing new values according to that distribution. By studying the distribution of $\chi^2$ statistics that arise, we can see where the actual $\chi^2$ statistic fits--and decide whether it is sufficiently extreme to warrant rejection of the null hypothesis.

The row marginals let us estimate the relative proportions of each ethnicity under the null hypothesis: Ethnicity_1 was observed $(2+1)/45$ of the time, etc. Each bootstrap iteration draws two independent samples from this hypothesized distribution, one of size $23$ and another of size $22$, and computes the $\chi^2$ statistic for these two samples.

When you try that, you will stumble upon a very interesting phenomenon: because ethnicities 2 and 3 were observed rarely, in many simulated samples they are not observed at all. This makes it impossible to calculate a $\chi^2$ statistic based on all five ethnicities! (It would require you to divide by zero.) What to do?

You could just compute the $\chi^2$ statistic based on the ethnicities actually observed, even when only three or four different ones appear among the two samples. When I do this with $10,000$ iterations, I obtain a p-value of $0.086$.
You could compute the $\chi^2$ statistic only in those simulations where all five ethnicities were observed. This time I compute a p-value of $0.108$. (Less than $60\%$ of all simulations included all five ethnicities.)

Conclusions

We have obtained a range of p-values from $0.086$ through $0.140$, some more legitimately applicable than others. (The Fisher Exact test p-value of $0.119$, by the way, fits within this range.) If your criterion for a significant result is more stringent than $8.6\%$, there is no problem: you will not reject the null hypothesis and so you needn't worry over which tests really are applicable. But if your criterion lies within this range (such as $10\%$), then your choice of test matters.

As the preceding efforts at simulation showed so clearly, which test to use depends on your application. Do you know that only five ethnicities could have been observed? Or are you tracking only the ethnicities that happened to appear in your samples? From the gap in numbering between 2 and 4 I would guess that Ethnicity_3 might be possible but was not observed. As such, if you choose to use a $\chi^2$ statistic based only on the ethnicities observed, then you are in situation (1) and you should report a p-value of $0.086$. If you had collected the data differently--for example, by augmenting the sample sizes until at least one of each ethnicity appeared in the dataset--then an approach comparable to (2) would be more appropriate. The key is to reproduce faithfully all details of your actual sampling procedure within the simulation so that you obtain an honest representation of the distribution of your test statistic.

Planning Follow-on Studies

It may be worth remarking that even if you view this range of results as being immaterial--you would make the same decision regardless--the choice of test can nevertheless make a big difference if you plan to conduct additional experiments in the hope of demonstrating an effect. Under that assumption, by using a p-value of $0.086$ (and adopting a significance threshold of $0.05$) you would need a dataset approximately $(Z_{0.05}/Z_{0.086})^2 = 1.45$ times as great as the current one, whereas by using a p-value of $0.140$ you would want to collect $2.32$ times as much data, which will cost $60\%$ more to do.

(The "$Z_{*}$" are quantiles of a standard Normal distribution, invoked here as a rough approximation to a $\chi^2$ power and sample size analysis. The point is not to do an accurate power analysis, but only to observe that it takes relatively few additional data to lower a p-value that is near $0.05$ to below $0.05$ -- assuming the effect is real! -- compared to the amount of data needed to lower a p-value that is far from $0.05$ to below $0.05$.)

Best Answer

Related Solutions

Solved – What to do when I have expected count <5 warning for a chi squared test

Solved – Is the chi-squared test appropriate with many small counts in a 5×2 table

The Chi-squared Test

Simulating the Chi-squared P-value

Going Deeper

Conclusions

Planning Follow-on Studies

Related Question