Solved – C4.5 How to select the split point (threshold) for a Continuous Attribute

cartthreshold

Using the "play golf" or "play ball" data (listed at the bottom), to pick the root node we look at Outlook, Temperature, Humidity, and Wind, to see which has the highest GainRatio.

Now, Outlook will be chosen as the attribute with the highest GainRatio. However, I am confused that Humidity (a Continuous Attribute) selects the split point 80 having a GainRatio=0.1087, while 65 has a higher GainRatio=0.1285. The split point 80 does have a higher Gain, but not GainRatio.

I have seen literature say roughly "pick the split point for a continuous attribute to be the one giving the most gain"… this seems counterintuitive to me that the split point is based on Gain alone, opposed to when comparing all the attributes you select the highest GainRatio to be the next decision node.

I hope to gain some clarity here.

Thanks.

EDIT:
The crux of the question is: what is the appropriate method for selecting the threshold split point of a continuous attribute? Is it (1) the Gain or (2) the Gain Ratio?

The calculations are as follows:

OUTLOOK:
Gain = 0.2467
SplitInfo = 1.5774
Gain Ratio = 0.1564

TEMPERATURE:
Gain = 0.0292
SplitInfo = 1.5566
Gain Ratio = 0.0187

HUMIDITY:
Possible split points = { 65, 70, 75, 78, 80, 85, 90, 95, 96 }

Split 65:
Gain = 0.0477
SplitInfo = 0.3712
Gain Ratio = 0.1285

Split 80:
Gain = 0.1022
SplitInfo = 0.9402
Gain Ratio = 0.1087

WIND:
Gain = 0.0481
SplitInfo = 0.9852
Gain Ratio = 0.0488

DATA:
Outlook Temperature Humidity Wind Play
sun hot 85 low no
sun hot 90 high no
overcast hot 78 low yes
rain sweet 96 low yes
rain cold 80 low yes
rain cold 70 high no
overcast cold 65 high yes
sun sweet 95 low no
sun cold 70 low yes
rain sweet 80 low yes
sun sweet 70 high yes
overcast sweet 90 high yes
overcast hot 75 low yes
rain sweet 80 high no

  • sorry, could not format data nicely

Best Answer

More sophisticated capabilities for handling continuous attributes are covered by Quinlan. We run into a conundrum here because the gain ratio will also be influenced by the actual threshold used by the continuous-valued attribute.

In particular, if the threshold apportions the instances nearly equally, then the gain ratio is minimal (since the entropy of the variable falls in the denominator). Therefore, Quinlan advocates going back to the regular information gain for choosing a threshold but continuing the use of the gain ratio for choosing the attribute in the first place.