I'm looking for optimal binning method (discretization) of a continuous variable with respect to a given response (target) binary variable and with maximum number of intervals as a parameter.
example: I have a set of observations of people with "height" (numeral continuous) and "has_back_pains" (binary) variables. I want do discretize height into 3 intervals (groups) at most with different proportion of people with back aches, so that the algorithm maximizes the difference between the groups (with given restrictions for instance, that each interval has at least x observations).
The obvious solution to this problem would be to use decision trees(a simple one-variable model) , but I can't find any function in R that would have "maximal number of branches" as a parameter – all of them divide the variable into 2 gropus (<=x and >x).
SAS miner has a "maximum branch" parameter but I'm looking for a non commercial solution.
some of my variables have just a few unique values (and could be treated as discrete variables) but I want to discretize them as well into a smaller number of intervals.
The closest solution to my problem is implemented in the smbinning package in R (which relies on ctree function from party package) but it has two drawbacks: it's impossible to set the number of intervals (however, you can find a way around it by changing the p parameter) and it doesn't work when data vector has less than 10 unique values. Anyway, you can see the example output here(Cutpoint and Odds columns are crucial):
Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec BadRate Odds LnOdds WoE IV
1 <= 272 9081 169 8912 9081 169 8912 0.1874 0.9814 0.0190 -3.9653 -0.6527 0.0596
2 <= 311 8541 246 8295 17622 415 17207 0.1762 0.9712 0.0297 -3.5181 -0.2055 0.0068
3 <= 335 2986 163 2823 20608 578 20030 0.0616 0.9454 0.0577 -2.8518 0.4608 0.0163
4 Missing 27852 1125 26727 48460 1703 46757 0.5747 0.9596 0.0421 -3.1679 0.1447 0.0129
5 Total 48460 1703 46757 NA NA NA 1.0000 0.9649 0.0364 -3.3126 0.0000 0.0956
Oh, I'm fully aware that binning results in information loss and that there are better methods, but I'm going to use it for data visualization and treat those variables as a factor.
Best Answer
While reading this book here (Nagarajan, 2103 [1]), I came across this valuable information that I am shamelessly citing here:
Using prior knowledge on the data. The boundaries of the intervals are defined, for each variable, to correspond to significantly different real-world scenarios, such as the concentration of a particular pollutant (absent, dangerous, lethal) or age classes (child, adult, elderly).
Using heuristics before learning the structure of the network. Some examples are: Sturges, Freedman-Diaconis, or Scott rules (Venables and Ripley, 2002).
Choosing the number of intervals and their boundaries to balance accuracy and information loss (Kohavi and Sahami, 1996), again one variable at a time and before the network structure has been learned. A similar approach considering pairs of variables is presented in Hartemink (2001).
Performing learning and discretization iteratively until no improvement is made (Friedman and Goldszmidt, 1996).
These strategies represent different trade-offs between the accuracy of the discrete representation of the original data and the computational efficiency of the transformation.
This information is provided, in case you want to justify the binning method you wish to use and not just use a package directly.
[1]: Nagarajan R. (2013),
Bayesian Networks in R, with Applications in Systems Biology
Springer