Reporting the best test accuracy in a research paper

accuracycross-validationmachine learningrandom forestreporting

I am running a random forest classifier for binary classification. I have split the dataset into training and testing. From training data, I select 70% of the data for hyperparameter tuning, I used 5 fold cross validation for that. Then I use the hyperparameters that produce high mean cross-validation accuracy (let's call it validation accuracy) and train the model on full training data and check the model on testing data. I get the testing accuracy.

My question is-

i) Should I report the best validation accuracy or the best testing accuracy?

I can use the seed (by proving random_state = xyz in random forest classifier) to fix the samples used for bootstrapping, which gives the best test accuracy.

If I need to report both the accuracies, can I select the seeds that give good validation accuracy and a good testing accuracy? What if I use the best test accuracy I got and report that in the paper?

Best Answer

If I need to report both the accuracies, can I select the seeds that give good validation accuracy and a good testing accuracy? What if I use the best test accuracy I got and report that in the paper?

No. The idea of held out test set is that you are not allowed to look at the test set till your model is ready, so you can use it to get the final metric, so you can take the model or leave it. If you used the test set performance to tune the model, this would be a straightforward way to overfit to test set. Using different seeds and picking the best result is exactly that: you are choosing the model that best fits the test set, though you have no guarantees whatsoever that this is the best model that would generalize best. You would be cheating yourself and your readers.

On another hand, if you used something like $k$-fold cross-validation to assess the performance of the model (not to tune it), then you should report the average of the metrics, best if accompanied with some measure of variability like standard deviation.

Best Answer

Related Solutions

Solved – Random forest low score on testing data (scikit-learn)

Solved – Should random forests based on same data but different random seeds be compared

Related Question