Solved – Chi-square independence test – grouping/subdividing variables

chi-squared-test

This question is about the effects of grouping (or equally, subdividing) data on the outcome of a Chi-Square test. I'm relatively new to all this so please be gentle.

Anyway: I've been through the example below from this tutorial Manufacturing Defects; and got the same answers as the tutorial; so far so good.

enter image description here

I've been trying to digest what the test result means in human terms. In real life situations it is entirely possible that (say) defects C and D are hard to distinguish, or perhaps even defects B, C and D are similarly hard (or any other grouping of columns, I just chose these arbitrarily). For example in the latter case I would re-run the test with a different table, below, where the right hand column is the sum of the original columns B, C and D. The blue background cells are the row/col/grand totals.

enter image description here

We are getting near the question at last; here, below, is a summary of the results of three sets of analysis. The top row is the case shown in the tutorial, the lower two are my invention based on grouping the same data.

enter image description here

Finally! How should I understand these results? I was initially surprised to see that the last case changed from being significant to not. After a bit of reflection I realise that the C + D group of the second test (and the B+C+D group of the third case) has more data per cell than in the original analysis. Typically I'd expect the outcome to be different if the sample size is changed.

Q1 Is "all" that is happening is that the test is bound to give different results with the data being less sub-divided?

Q2 Does this still follow in the 3rd case where the significance test result changes?

Q3 Is there any other way of looking at it?

Best Answer

a) These tests test essentially different things, so you shouldn't be surprised that they give different results. If you look at the data, defect B has a particular pattern, namely a minimum in shift 3 where the others have their maximum (by some distance). This may be the main source for the significant results in the first two tests. In the third test B is aggregated with C and D, and that special shape is no longer visible. Test 1 tests the null hypothesis that "all four defects have the same behaviour over shifts", and B clearly has not. Test 3 doesn't test anything anymore about (isolated) B, B is lumped together with C and D, and together they don't look that different over the shifts from A, so the test doesn't find anything significant anymore. Such effects happen generally when aggregating categories, although it may not make much of a difference as long as the aggregated categories are statistically similar.

b) Testing too much isn't always advisable. For example, if you're running one test at 5% level, in case the null hypothesis is true, you have a 5% error probability to find significance. If you're running three tests, the probability to find significance at least in one test in case all null hypotheses are true is obviously larger than 5%, as every single test has a 5% (type I) error probability already on its own. It can be up to 15%, but this depends on how the different tests depend on each other. I would normally only run one test, and in case of significance look at the numbers to see what is the most striking thing to cause dependence between rows and columns. (B's minimum at shift 3 is the most striking thing for me here.) Admittedly I sometimes run more tests for curiosity as well, but I take the results with a grain of salt then. Certainly this is not how big scientific discoveries are made!

c) The 5% cutoff is pretty artificial and looking at p-values can't hurt. Your results seem fairly unambiguous, but for example if the 5% significance threshold is 5.99 and you observe 5.92 or something, despite not being significant one can take this as some (albeit weak) indication that something may be going on, about as weak or strong as 6.01, say. I'd only pay strong attention to such a cutoff if a black or white decision is really needed.

Related Question