Solved – Random Forests and Information gain

cartinformation theorymachine learningself-study

Suppose you are building random forest model, which split a node on the attribute, that has highest information gain. In the below image, select the attribute which has the highest information gain?

A) Outlook
B) Humidity
C) Windy
D) Temperature

The Solution mentions
"Solution: A

Information gain increases with the average purity of subsets. So option A would be the right answer."

What does average purity of subsets mean here ?
enter image description here

Best Answer

Average purity of subsets means the average of purity metrics for each subset after the split. In your example, you split on Outlook and get 3 subsets, then you calculate information gain using the formula which takes into account sizes of subsets:

$$\text{Gain}(S, \text{Outlook}) = H(S) - \sum_{v\in Values(\text{Outlook})} \frac{|S_v|}{|S|} H(S_v) $$ $H(S)$ is entropy of the whole set, $H(S_v)$ is entropy of a subset,$\frac{|S_v|}{|S|}$ is length of a subset divided by length of the whole set . $$H(S)= -9/14 * \log_29/14 - 5/14*\log_2 5/14 = 0.94 $$ $$H(S_{\text{Sunny}}) = -2/5 * \log_22/5 - 3/5*\log_2 3/5 = 0.97$$ $$H(S_{\text{Overcast}}) = 0 $$ $$H(S_{\text{Rainy}}) = -3/5 * \log_23/5 - 2/5*\log_2 2/5 = 0.97 $$ $$\text{Gain}(S, \text{Outlook})= 0.94 - 0.97*5/14 - 0*4/14 - 0.97*5/14 = 0.94 - 0.347 -0.347 = 0.246$$ Average entropy (i.e. impurity): $\frac{0.97+0.97+0}{3}=0.6466$

If you split on Wind: $$H(S_{\text{false}}) = -6/8 * \log_26/8 - 2/8*\log_2 2/8 = 0.81$$ $$H(S_{\text{true}}) = -3/6 * \log_23/6 - 3/6*\log_2 3/6 = 1 $$ $$\text{Gain}(S, \text{Wind})= 0.94 - 0.81*8/14 - 1*6/14 = 0.049$$ Average entropy: $\frac{0.81+1}{2}=0.905$

And so on.