Solved – What methods exist for finding optimal splits to discretize continuous data with respect to a target variable

cartcontinuous datadiscrete dataoptimizationregression-strategies

I'm doing some research into methods for discretizing a continuous variable coupled with a binary target variable to find the optimal split points to maxamise a measure of impurity (gini/entropy).

First off I'm having some trouble coming up with google-able terms. "Optimal Split" seems to relate to choosing which variable to split on in a decision tree as opposed to how to split that variable up. Is there a defined title on this problem?

I imagine to convert a continuous variable into a binary variable I could potentially try setting a split point at each distinct value in the data and take the one that returns the maximum entropy or gini.

But to extend it beyond binary splits the search space would grow to make that method pretty expensive.

Are there any popular methods for solving this?

Best Answer

Your question raises so many issues that it is difficult to know where to start. First of all you need to make sure that the accuracy score you wish to optimize is a proper scoring rule, i.e., that it is not optimized by a bogus model using the wrong features. Second, it is rare in nature to have discontinuities in predictors other than time. Third, when there are no true discontinuities in predictors, any algorithm that attempts to find such cutpoints will yield answers that other analytical methods or other datasets will surely disagree with.

The literature on the horrendous problems of dichotomizing continuous variables is now rather mature. One overview may be found at http://biostat.mc.vanderbilt.edu/CatContinuous . There are many methods for keeping continuous variable continuous, e.g., regression splines, random forests.

Related Question