Solved – Should random forests based on same data but different random seeds be compared

machine learningrandom forestrandom-generation

I know that if you re-run a random forest with a different random seed you will fit a different model. I'm wondering whether it's acceptable to compare different random forest models (run under different random seeds) and to take the model with the highest accuracy on the training data (using 10-fold CV) for downstream work.

As an example, below is the distribution of accuracies based on 10-fold CV in a dataset with 147 samples and 278 features (all two class features) used to predict disease state (two classes: healthy / diseased). This distribution is based on 100 RF replicates with different random seeds:

Is it wrong to take the model with the highest accuracy for feature selection and to fit my test data?

I'm also interested in comparing the chosen model's 10-fold CV accuracy to additional RF models' accuracies fit to the same dataset with randomized disease states (as an alternative way to evaluate significance since I think my sample size might be too small to split it into test/training data). I'm concerned that this approach might be biased if I choose the model with the highest accuracy.

Best Answer

Accuracy is just another random variable that depends on your model, the seed, the train / test split, quality of your current data etc. - maximizing this random variable does not automatically lead to the best possible generalization of your model.

Besides looking at metrics like accuracy, logloss, auc roc etc. you might also want to look at other learning characteristics like the learning curves of your train / testdata while adding more data, the difference between the train and test error etc., as finally all you are facing is the bias-variance-tradeoff that lives in every model.

See https://en.m.wikipedia.org/wiki/Bias–variance_tradeoff

To answer your question, you should not only rely on comparing the random seed.

Related Question