I would like to know if Breiman's random forest (random forest in R randomForest package) uses as a splitting criterion (criterion for attribute selection) information gain or Gini index? I tried to find it out on http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm and in documentation for randomForest package in R. But the only thing I found is that Gini index can be used for variable importance computing.
Solved – Does Breiman’s random forest use information gain or Gini index
entropyginirrandom forest
Related Solutions
Ensemble methods (such as random forests) require some element of variation in the datasets that the individual base classifiers are grown on (otherwise random forests would end up with a forest of trees that are too similar). As decision trees are highly sensitive to the observations in the training set, varying the observations (using the bootstrap) was, I suppose, a natural approach to getting the required diversity. The obvious alternative is to vary the features that are used, e.g. train each tree on a subset of the original features. Using the bootstrap samples also allows us to estimate the out-of-bag (OOB) error rate and variable importance.
2 is essentially another way of injecting randomness into the forest. It also has an impact on reducing the correlation among the trees (by using a low mtry value), with the trade-off being (potentially) worsening the predictive power. Using too large a value of mtry will cause the trees to become increasingly similar to one another (and in the extreme you end up with bagging)
I believe that the reason for not pruning is more due to the fact that its not necessary than anything else. With a single decision tree you would normally prune it since it's highly susceptible to overfitting. However, by using the bootstrap samples and growing many trees random forests can grow trees that are individually strong, but not particularly correlated with one another. Basically, the individual trees are overfit but provided their errors are not correlated the forest should be reasonably accurate.
The reason it works well is similar to Condorcet's jury theorem (and the logic behind methods such as boosting). Basically you have lots of weak learners that only need to perform marginally better than random guessing. If this is true you can keep adding weak learners, and in the limit you would get perfect predictions from your ensemble. Clearly this is restricted due to the errors of the learners becoming correlated, which prevents the ensemble's performance improving.
Given that your model exhibits good accuracy you can just use it to predict the class labels of records in the unlabeled dataset. However, you cannot evaluate the performances on unlabeled data.
Be careful that you should assess the quality of your model on the labeled data by cross-validation. It is not enough to check the training error rate.
If your model is not accurate enough you might think about semi-supervised learning. The unlabeled data is used in order to improve the quality of your model via inductive learning. The accuracy should always be computed by cross-validation on your labeled data.
Have a look at [ Crimisini et al. Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning ] Chapter 7 about semi-supervised learning and 7.4 about induction with semi-supervised learning.
Best Answer
The randomForest package in R by A. Liaw is a port of the original code being a mix of c-code(translated) some remaining fortran code and R wrapper code. To decide the overall best split across break points and across mtry variables, the code uses a scoring function similar to gini-gain:
$GiniGain(N,X)=Gini(N)-\frac{\lvert N_{1} \rvert }{\lvert N \rvert }Gini(N_{1})-\frac{\lvert N_{2} \rvert }{\lvert N \rvert }Gini(N_{2})$
Where $X$ is a given feature, $N$ is the node on which the split is to be made, and $N_{1}$ and $N_{2}$ are the two child nodes created by splitting $N$. $\lvert . \rvert $ is the number of elements in a node.
And $Gini(N)=1-\sum_{k=1}^{K}p_{k}^2$, where $K$ is the number of categories in the node
But the applied scoring function is not the exactly same, but instead a equivalent more computational efficient version. $Gini(N)$ and |N| are constant for all compared splits and thus omitted.
Also lets inspect the part if the sum of squared prevalence in a node(1) is computed as $\frac{\lvert N_{2} \rvert }{\lvert N \rvert }Gini(N_{2}) \propto |N_2| Gini(N_{2}) = |N_2| (1-\sum_{k=1}^{K}p_{k}^2 ) = |N_2| \sum \frac{nclass_{2,k}^2}{|N_2|^2}$
where $nclass_{1,k}$ is the class count of target-class k in daughter node 1. Notice $|N_2|$ is placed both in nominator and denominator.
removing the trivial constant $1-$ from equation such that best split decision is to maximize nodes size weighted sum of squared class prevalence...
score= $|N_1| \sum_{k=1}^{K}p_{1,k}^2 + |N_2| \sum_{k=1}^{K}p_{2,k}^2 = |N_1|\sum_{k=1}^{K}\frac{nclass_{1,k}^2}{|N_1|^2} + |N_2|\sum_{k=1}^{K}\frac{nclass_{2,k}^2}{|N_2|^2}$ $ = \sum_{k=1}^{K}\frac{nclass_{2,k}^2}{1} |N_1|^{-1} + \sum_{k=1}^{K}\frac{nclass_{2,k}^2}{1} |N_1|^{-2} $ $= nominator_1/denominator_1 + nominator_2/denominator_2$
The implementation also allows for classwise up/down weighting of samples. Also very important when the implementation update this modified gini-gain, moving a single sample from one node to the other is very efficient. The sample can be substracted from nominators/denominators of one node and added to the others. I wrote a prototype-RF some months ago, ignorantly recomputing from scratch gini-gain for every break-point and that was slower :)
If several splits scores are best, a random winner is picked.
This answer was based on inspecting source file "randomForest.x.x.tar.gz/src/classTree.c" line 209-250