Solved – Chi-Squared significance test for stopping criteria in decision tree

cartchi-squared-testcorrelation

Going through the paper of BFTree(Best First Decision Tree) from (Haijian Shi (2007). Best-first decision tree learning. Hamilton, NZ). I read for pre-pruning do a local attribute selection. And the goal is to find whether the attribute is significantly correlated with the class.(stop further splitting if no attributes are significantly correlated with the class otherwise do the splitting). For that matter chi-squared test is used as the Statistical significant test.

Can some one please help to understand how the test is performed.
I see the formula to be used to get the chi squared statistics is

$\sum((obs_{freq}-exp_{freq})^2/exp_{freq})$

but how its used here as to find any correlation between the attribute and the class.

Best Answer

I wasn't familiar with this algorithm but looked through Shi's MSc thesis now. It seems that chi-squared tests are mentioned in the discussion of Section 2.2 (Pruning in standard decision trees) but are not used in the best-first decision tree learning. Instead Section 3.4 (Pruning) uses cross-validation of different impurity measures.

The classic reference for using chi-squared tests in decision tree induction is the CHAID algorithm of Kass (1980); it is also referenced in Shi's thesis. Moreover, chi-squared tests are used for unbiased splitting variable selection in the work of Loh and co-workers on the QUEST (Loh & Shih 1997) and GUIDE (Loh 2002) algorithms (among others). However, they use the tests for variable selection and not for pre-pruning. Instead, additional cost-complexity-type post-pruning is used. The CTree (conditional inference tree) framework of Hothorn et al. (2006) uses a pre-pruning approach based on statistical significance tests which encompasses chi-squared-type tests, correlation-type tests or maximally selected statistics. So if you're interested in using chi-squared tests for decision tree induction, I would recommend that you look at these references.

Related Question