I have a dataset with about 35,000 individuals described by around 15 categorical variables.
I'm trying to study the independence / correlation between these 15 categorical variables. My first idea was to, for each pair of variables, create a contingency table and calculate the $\chi^2$. Then, study the overall difference in the statistic. However, because the population is so large, $\chi^2$ is always significant. I'm having difficulty interpreting and comparing the results for each pair of variables.
So, I can summarize my question as follows:
- For large datasets, when I know $\chi^2$ will almost always be significant, is there an alternative test that will give more reasonable results?
I have two ideas, as well
- I was thinking of taking many bootstrap samples of say 1K individuals. On each sample calculate the correlation, then average over all the bootstrap samples. The average should be a good representation of the overall sample, but I feel like I'm somehow cheating.
- Can I simply compare the magnitudes of the $\chi^2$ test between the different pairs of variables? The degrees of freedom are different (the categories are of different sizes), which leads me to think this won't make sense.
Best Answer
Answering my own question (because no one gave an answer) based on another post.
In summary: The Chi-squared will show significant differences because N is large. In this case it is best to look at the size of the test statistic rather than the p-value. Random sampling can reduce N, but it is an un-satisfactory solution.
Alternatively, calculate the confidence intervals.