Suppose you are building random forest model, which split a node on the attribute, that has highest information gain. In the below image, select the attribute which has the highest information gain?
A) Outlook
B) Humidity
C) Windy
D) Temperature
The Solution mentions
"Solution: A
Information gain increases with the average purity of subsets. So option A would be the right answer."
Best Answer
Average purity of subsets means the average of purity metrics for each subset after the split. In your example, you split on
Outlook
and get 3 subsets, then you calculate information gain using the formula which takes into account sizes of subsets:$$\text{Gain}(S, \text{Outlook}) = H(S) - \sum_{v\in Values(\text{Outlook})} \frac{|S_v|}{|S|} H(S_v) $$ $H(S)$ is entropy of the whole set, $H(S_v)$ is entropy of a subset,$\frac{|S_v|}{|S|}$ is length of a subset divided by length of the whole set . $$H(S)= -9/14 * \log_29/14 - 5/14*\log_2 5/14 = 0.94 $$ $$H(S_{\text{Sunny}}) = -2/5 * \log_22/5 - 3/5*\log_2 3/5 = 0.97$$ $$H(S_{\text{Overcast}}) = 0 $$ $$H(S_{\text{Rainy}}) = -3/5 * \log_23/5 - 2/5*\log_2 2/5 = 0.97 $$ $$\text{Gain}(S, \text{Outlook})= 0.94 - 0.97*5/14 - 0*4/14 - 0.97*5/14 = 0.94 - 0.347 -0.347 = 0.246$$ Average entropy (i.e. impurity): $\frac{0.97+0.97+0}{3}=0.6466$
If you split on Wind: $$H(S_{\text{false}}) = -6/8 * \log_26/8 - 2/8*\log_2 2/8 = 0.81$$ $$H(S_{\text{true}}) = -3/6 * \log_23/6 - 3/6*\log_2 3/6 = 1 $$ $$\text{Gain}(S, \text{Wind})= 0.94 - 0.81*8/14 - 1*6/14 = 0.049$$ Average entropy: $\frac{0.81+1}{2}=0.905$
And so on.