Solved – Correlation or independence on contingency table for large N

categorical datachi-squared-testcorrelationlarge datar

I have a dataset with about 35,000 individuals described by around 15 categorical variables.

I'm trying to study the independence / correlation between these 15 categorical variables. My first idea was to, for each pair of variables, create a contingency table and calculate the $\chi^2$. Then, study the overall difference in the statistic. However, because the population is so large, $\chi^2$ is always significant. I'm having difficulty interpreting and comparing the results for each pair of variables.

So, I can summarize my question as follows:

  1. For large datasets, when I know $\chi^2$ will almost always be significant, is there an alternative test that will give more reasonable results?

I have two ideas, as well

  1. I was thinking of taking many bootstrap samples of say 1K individuals. On each sample calculate the correlation, then average over all the bootstrap samples. The average should be a good representation of the overall sample, but I feel like I'm somehow cheating.
  2. Can I simply compare the magnitudes of the $\chi^2$ test between the different pairs of variables? The degrees of freedom are different (the categories are of different sizes), which leads me to think this won't make sense.

Best Answer

Answering my own question (because no one gave an answer) based on another post.

Unless your observations have a cost benefit tradeoff of some kind (e.g. paying subjects) then there's not really any such things as too many. More observations give better parameter estimates.

The test you used handled 10,000,000 observations just fine.

Your "problem" isn't a problem at all. The estimate of a parameter becomes very good when N is very large to the point that any measurable deviation from no difference becomes a statistically significant difference. That doesn't mean the difference is meaningful or practically significant. That's a judgment call you'll have to make.

One way to help you is to calculate an effect size. Typically Cramer's V (ϕc) is used. Note that for goodness of fit you use the rows instead of smallest dimension and it is interpreted as the tendency toward a singular outcome. Cramer's V for your experiment is going to be an extraordinarily tiny number, suggesting the effect is very small.

But in this case, the numbers so obviously tell a tale of a very small effect I think that just showing the numbers is sufficient. What you would say is that the expected probabilities are nearly identical to the observed and leave it at that.

In summary: The Chi-squared will show significant differences because N is large. In this case it is best to look at the size of the test statistic rather than the p-value. Random sampling can reduce N, but it is an un-satisfactory solution.

Alternatively, calculate the confidence intervals.

Related Question