Solved – How to discretise continuous attributes while implementing the ID3 algorithm

cartdata miningdiscrete datathreshold

I am trying to implement the ID3 algorithm on a data set. However, all attributes are continuous and can have values between 1-10. I found that we have to specify the bin intervals for discretization but couldn't understand how to do this exactly.

Can some one explain on how to do this? The data set I am using is Breast Cancer Data from Wisconsin hospitals.

Best Answer

ID3 is an algorithm for building a decision tree classifier based on maximizing information gain at each level of splitting across all available attributes. It's a precursor to the C4.5 algorithm.

With this data, the task is to correctly classify each instance as either benign or malignant. Since each attribute takes on whole integer values in the range 1-10, strictly speaking the values aren't continuous in that they can't take decimal values. For each integer value of each attribute, you'll need to calculate which split provides the most homogenous grouping of instances at each level of splitting. This is done by calculating the information gain for each possible split and selecting the greatest (ID3 is known as a greedy algorithm).

You can do this by hand, but it's obviously better to run the algorithm in a tool such as Weka or R. If you're creating your own implementation, then you'll need to test each possible split and select the one with the greatest information gain, assuming you don't already have a homogenous group (in which case you'd assign the class attribute and change the node to a leaf).