Stratification seeks to ensure that each fold is representative of all strata of the data. Generally this is done in a supervised way for classification and aims to ensure each class is (approximately) equally represented across each test fold (which are of course combined in a complementary way to form training folds).
The intuition behind this relates to the bias of most classification algorithms. They tend to weight each instance equally which means overrepresented classes get too much weight (e.g. optimizing F-measure, Accuracy or a complementary form of error). Stratification is not so important for an algorithm that weights each class equally (e.g. optimizing Kappa, Informedness or ROC AUC) or according to a cost matrix (e.g. that is giving a value to each class correctly weighted and/or a cost to each way of misclassifying). See, e.g.
D. M. W. Powers (2014), What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes. http://arxiv.org/pdf/1503.06410
One specific issue that is important across even unbiased or balanced algorithms, is that they tend not to be able to learn or test a class that isn't represented at all in a fold, and furthermore even the case where only one of a class is represented in a fold doesn't allow generalization to performed resp. evaluated. However even this consideration isn't universal and for example doesn't apply so much to one-class learning, which tries to determine what is normal for an individual class, and effectively identifies outliers as being a different class, given that cross-validation is about determining statistics not generating a specific classifier.
On the other hand, supervised stratification compromises the technical purity of the evaluation as the labels of the test data shouldn't affect training, but in stratification are used in the selection of the training instances. Unsupervised stratification is also possible based on spreading similar data around looking only at the attributes of the data, not the true class. See, e.g.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.469.8855
N. A. Diamantidis, D. Karlis, E. A. Giakoumakis (1997),
Unsupervised stratification of cross-validation for accuracy estimation.
Stratification can also be applied to regression rather than classification, in which case like the unsupervised stratification, similarity rather than identity is used, but the supervised version uses the known true function value.
Further complications are rare classes and multilabel classification, where classifications are being done on multiple (independent) dimensions. Here tuples of the true labels across all dimensions can be treated as classes for the purpose of cross-validation. However, not all combinations necessarily occur, and some combinations may be rare. Rare classes and rare combinations are a problem in that a class/combination that occurs at least once but less than K times (in K-CV) cannot be represented in all test folds. In such cases, one could instead consider a form of stratified boostrapping (sampling with replacement to generate a full size training fold with repetitions expected and 36.8% expected unselected for testing, with one instance of each class selected initially without replacement for the test fold).
Another approach to multilabel stratification is to try to stratify or bootstrap each class dimension separately without seeking to ensure representative selection of combinations. With L labels and N instances and Kkl instances of class k for label l, we can randomly choose (without replacement) from the corresponding set of labeled instances Dkl approximately N/LKkl instances. This does not ensure optimal balance but rather seeks balance heuristically. This can be improved by barring selection of labels at or over quota unless there is no choice (as some combinations do not occur or are rare). Problems tend to mean either that there is too little data or that the dimensions are not independent.
I think you got it quite right but not exactly. Here my suggestion:
- Separate the dataset into a test and a train+validation set.
- Perform a grid search using cross validation on the train set to find optimal hyperparameters optimized on the validation set (for random forest, this would be defining your mtry).
- Use the entire train+validation with the optimal hyperparameters and report the error using the test set.
Only this way, you ensure that the performance is measured on a part of the data the model has never seen. I recommend splitting Nr. 1 a few times, for example with the 10-fold CV, to make the performance measure less prone to variance.
The mlr package has good explanations on this nested cross-validation https://mlr-org.github.io/mlr-tutorial/devel/html/nested_resampling/index.html
Best Answer
Yes, out-of-bag performance for a random forest is very similar to cross validation. Essentially what you get is leave-one-out with the surrogate random forests using fewer trees. So if done correctly, you get a slight pessimistic bias. The exact bias and variance properties will be somewhat different from externally cross validating your random forest.
Like for the cross validation, the crucial point for correctness (i.e. slight pessmistic bias, not large optistic bias) is the implicit assumption that each row of your data is an independent case. If this assumption is not met, the out-of-bag estimate will be overoptimistic (as would be a "plain" cross validation) - and in that situation it may be much easier to set up an outer cross validation that splits into independent groups than to make the random forest deal with such dependence structures.
Assuming you have this independence between rows, you can use the random forest's out-of-bag performance estimate just like the corresponding cross validation estimate: either as estimate of generalization error or for model tuning (the parameters mentioned by @horaceT or e.g. boosting). If you use it for model tuning, as always, you need another independent estimate of the final model's generalization error.
That being said, the no of trees and no of variables are reasonably easy to fix, so random forest is one of the models I consider with sample sizes that are too small for data-driven model tuning.
Prediction error will not increase with higher number of trees - it just won't decrease any further at some point. So you can just throw in a bit more computation time and be OK.
number of variates to consider in each tree will depend on your data, but IMHO isn't very critical, neither (in the sense that you can e.g. use experience on previous applications with similar data to fix it).
leaf size (for classification) is again typically left at 1 - this again doesn't cost generalization performance, but just computation time and memory.