From section 7.10.2 of Elements of Statistical Learning(free online, and it's great):
Consider a classification problem with a large number of predictors, as may
arise, for example, in genomic or proteomic applications. A typical strategy
for analysis might be as follows:
- Screen the predictors: find a subset of “good” predictors that show
fairly strong (univariate) correlation with the class labels
- Using just this subset of predictors, build a multivariate classifier.
- Use cross-validation to estimate the unknown tuning parameters and
to estimate the prediction error of the final model.
Is this a correct application of cross-validation? Consider a scenario with
N = 50 samples in two equal-sized classes, and p = 5000 quantitative
predictors (standard Gaussian) that are independent of the class labels.
The true (test) error rate of any classifier is 50%. We carried out the above
recipe, choosing in step (1) the 100 predictors having highest correlation
with the class labels, and then using a 1-nearest neighbor classifier, based
on just these 100 predictors, in step (2). Over 50 simulations from this
setting, the average CV error rate was 3%. This is far lower than the true
error rate of 50%.
What has happened? The problem is that the predictors have an unfair
advantage, as they were chosen in step (1) on the basis of all of the samples.
Leaving samples out after the variables have been selected does not cor-rectly mimic the application of the classifier to a completely independent
test set, since these predictors “have already seen” the left out samples.
We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors
with the class labels over just these 10 samples (top panel). We see that
the correlations average about 0.28, rather than 0, as one might expect
I had exactly the same question as you, and was a bit sad to find out no answers were posted on your topic...
That said, I found this paper : One-Vs-All Binarization Technique in the
Context of Random Forest (https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2015-5.pdf) published in 2015.
The authors are showing better classification performances with one-versus-rest Random Forest classifiers compared to standard multiclass Random Forest ones.
The authors are not giving many clues on why it works so well, except that the trees generated in the one-versus-rest context are simpler.
I am wondering if you found some answers yourself since you posted your question?
Best Answer
Generally, it's not a great idea to try to meddle with feature weights - RF (and machine learning algorithms in general) works out the importance of features by itself.
See also: https://stackoverflow.com/questions/38034702/how-to-put-more-weight-on-certain-features-in-machine-learning