Solved – Choice of grouping in Chi-Squared test

chi-squared-test

Let's say I have a categorical variable. I try to test the null hypothesis that each category has the same count (of something) using a Pearson's Chi-Squared test. I may not be able to reject null hypothesis using just the categorical variables, but if I group the categories together in the right way, I can reject the null hypothesis. (For example $\{a,b,c\}$ have a higher count than $\{d,e,f\}$.) It seems though, that if I choose my groupings based on my sample distribution, then I'm overfitting. In simulations, I've been able to group categories of counts from a uniform distribution in the correct way to reject the null hypothesis too often for my significance level. However, I want to be quantitative about this error/abuse I'm committing. For example, I may be willing to group $\{a,d,e\},\{b,c,f\}$ but no other partition would make sense in my context. In this case I would be more confident in making the choice to group or not group then if I considered all possible partitions.

Is there some way to quantify this type of overfitting? I thought it might be hiding in the degrees of freedom, or maybe it's a type of parameter and something like AIC or BIC might be useful.

Best Answer

This procedure is basically the idea behind "CHi-squared Automated Interaction Detection", or "CHAID" described by G.V. Kass in 1980. The general setting is very similar to your television watching prediction example: You want to best predict the occurrence of a categorical variable by a combination of other categorical variables. You do this by finding the split with the maximal $\chi^2$ value.

A description of the algorithm and the issues around adjusting for statistical significance are given in (Kass, 1980). In that paper the Bonferroni correction is used to adjust for the selection of the maximal $\chi^2$ value.

Some actual theory is available for the case of reduction to a $2\times2$ table (Kass, 1975).

There is an R package called CHAID which implements the algorithm and is available on R-Forge.

Although it is a little different from your question, there is a similar situation that arises when dichotomizing a continuous variable to predict another dichotomous variable. Namely, where should you put the cut-point? This is discussed in (Miller and Siegmund, 1980) and (Halpern, 1982), among others.

Yet another setting where this type of question comes up is in change-point estimation or segmentation, though it has been too long since I looked at those papers to recall authors.

References:

Halpern, J. (1982). Maximally selected chi square statistics for small samples. Biometrics, 1017-1023.

Kass, G. V. (1975). Significance testing in automatic interaction detection (AID). Applied Statistics, 178-189.

Kass, G.V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 29(2), 119-127.

Miller, R. and Siegmund, D. (1980). Maximally Selected Chi-Squares. Technical Report 64. Stanford, Calif, Division of Biostatistics, Stanford University.

Related Question