Solved – Optimal Binning with respect to a given response variable

binningdatasetdiscrete dataoptimizationr

I'm looking for optimal binning method (discretization) of a continuous variable with respect to a given response (target) binary variable and with maximum number of intervals as a parameter.

example: I have a set of observations of people with "height" (numeral continuous) and "has_back_pains" (binary) variables. I want do discretize height into 3 intervals (groups) at most with different proportion of people with back aches, so that the algorithm maximizes the difference between the groups (with given restrictions for instance, that each interval has at least x observations).

The obvious solution to this problem would be to use decision trees(a simple one-variable model) , but I can't find any function in R that would have "maximal number of branches" as a parameter – all of them divide the variable into 2 gropus (<=x and >x).
SAS miner has a "maximum branch" parameter but I'm looking for a non commercial solution.

some of my variables have just a few unique values (and could be treated as discrete variables) but I want to discretize them as well into a smaller number of intervals.

The closest solution to my problem is implemented in the smbinning package in R (which relies on ctree function from party package) but it has two drawbacks: it's impossible to set the number of intervals (however, you can find a way around it by changing the p parameter) and it doesn't work when data vector has less than 10 unique values. Anyway, you can see the example output here(Cutpoint and Odds columns are crucial):

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec BadRate   Odds  LnOdds     WoE     IV
1   <= 272   9081     169   8912      9081        169      8912 0.1874  0.9814 0.0190 -3.9653 -0.6527 0.0596
2   <= 311   8541     246   8295     17622        415     17207 0.1762  0.9712 0.0297 -3.5181 -0.2055 0.0068
3   <= 335   2986     163   2823     20608        578     20030 0.0616  0.9454 0.0577 -2.8518  0.4608 0.0163
4  Missing  27852    1125  26727     48460       1703     46757 0.5747  0.9596 0.0421 -3.1679  0.1447 0.0129
5    Total  48460    1703  46757        NA         NA        NA 1.0000  0.9649 0.0364 -3.3126  0.0000 0.0956

Oh, I'm fully aware that binning results in information loss and that there are better methods, but I'm going to use it for data visualization and treat those variables as a factor.

Best Answer

While reading this book here (Nagarajan, 2103 [1]), I came across this valuable information that I am shamelessly citing here:

  • Using prior knowledge on the data. The boundaries of the intervals are defined, for each variable, to correspond to significantly different real-world scenarios, such as the concentration of a particular pollutant (absent, dangerous, lethal) or age classes (child, adult, elderly).

  • Using heuristics before learning the structure of the network. Some examples are: Sturges, Freedman-Diaconis, or Scott rules (Venables and Ripley, 2002).

  • Choosing the number of intervals and their boundaries to balance accuracy and information loss (Kohavi and Sahami, 1996), again one variable at a time and before the network structure has been learned. A similar approach considering pairs of variables is presented in Hartemink (2001).

  • Performing learning and discretization iteratively until no improvement is made (Friedman and Goldszmidt, 1996).

These strategies represent different trade-offs between the accuracy of the discrete representation of the original data and the computational efficiency of the transformation.

This information is provided, in case you want to justify the binning method you wish to use and not just use a package directly.

[1]: Nagarajan R. (2013),
Bayesian Networks in R, with Applications in Systems Biology
Springer