Solved – How to to find and evaluate optimal discretization for continuous variable with $\chi^2$ criterion

chi-squared-testdiscrete datamachine learningrsupervised learning

I have a data set with continuous variable and a binary target variable (0 and 1).

I need to discretize the continuous variables (for logistic regression) with respect to the target variable and with the constrained that the frequency of observation in each interval should be balanced. I tried machine learning algorithms like Chi Merge, decision trees. Chi merge gave me intervals with very unbalanced numbers in each interval (an interval with 3 observations and another one with 1000). The decision trees were hard to interpret.

I came to the conclusion that an optimal discretization should maximise the $\chi^2$ statistic between the discretized variable and the target variable and should have intervals containing roughly the same amount of observations.

Is there an algorithm for solving this?

This how it could look like in R (def is the target variable and x the variable to be discretized). I calculated Tschuprow's $T$ to evaluate the "correlation" between the transformed and the target variable because $\chi^2$ statistics tends to increase with the number of intervals. I'm not certain if this is the right way.

Is there another way of evaluating if my discretization is optimal other than Tschuprow's $T$ (increases when number of classes decreases)?

chitest <- function(x){
  interv <- cut(x, c(0, 1.6,1.9, 2.3, 2.9, max(x)), include.lowest = TRUE)
  X2 <- chisq.test(df.train$def,as.numeric(interv))$statistic
  #Tschuprow
  Tschup <- sqrt((X2)/(nrow(df.train)*sqrt((6-1)*(2-1))))
  print(list(Chi2=X2,freq=table(interv),def=sum.def,Tschuprow=Tschup))
}

Best Answer

There are many possible ways to discretise a continuous variable: see [Garcia 2013]

On page 739 I could see at least 5 methods based on chi-square. The optimality of the discretization is actually dependent on the task you want to use the discretised variable in. In your case logistic regression. And as discussed in Garcia2013, finding the optimal discretization given a task is NP-complete.

There are lots of heuristics though. In this paper they discuss at least 50 of them. Given my machine learning background (I guess people in statistics prefer other things) I am often biased toward the Fayyad and Irani's Minimum Description Length (MDL) method. I see it is available in the R package discretization

As you said, Chi-square is biased towards high number of intervals and many other statistics (as the information gain used in the MDL method) are. However, MDL tries to find a good trade-off between the information gain of the discretized variable and the class and the complexity (number of intervals) of the discretised variable. Give it a try.