RandomForest Models – Statistical Validation of RandomForest Models

cross-validationmachine learningrandom forest

I am currently working on a RandomForest based prediction method using protein sequence data. I have generated two models first model (NF) using standard set of features and the second model (HF) using hybrid features. I have done Mathews Correlation Coefficient (MCC) and Accuracy calculation and the following are my results:

Model 1 (NF): Training Accuracy – 62.85% Testing Accuracy – 56.38 MCC – 0.1673

Model 2 (HF): Training Accuracy – 60.34 Testing Accuracy – 61.78 MCC – 0.1856

The testing data is an independent dataset (means not included in the training data).

Since there is a trade-off in accuracy and MCC between the models am confused about the prediction power of the models. Could you please share your thoughts on which model I should consider for further analysis? Apart from Accuracy and MCC is there any other measure that I should consider for validation?

Thanks in advance.

Best Answer

I like the idea of parsimony- the smaller the number of variables in the model, the better. Unless you are driven theoretically of course. Feature selection refers to the process of choosing which variables to use in the model (getting the best combination of variables). There are lots of different options for feature selection (worth a read). With that said, there should be inbuilt within the rf algorithm, a variable importance measure that you can generate as a starting point (with that said, be very very careful with this because there are noted biases in this) - see Strobl et al in the R journal.

I trust you have varied the number of variables randomly sampled at each node (this is mtry in R) and the depth of the trees and splitting criteria etc.

In terms of appearance, the second model looks slighly better to me, simply because of the reproduced accuracy in the test and train results. It always concerns me that if my test set accuracy is notably lower, there may be something wrong with the model. I trust you have made sure that your test and train set are balanced, at least on the dependent variable you are looking to classify. If this is binary (0,1) your models are not really doing much better than chance (50,50).

An very important thing to look at is the sensitivity (the number of true positives in a binary task 0,1 that are correctly classified) and specificity (the number of true negatives in a binary task 0,1) that are both correctly classified.

If possible, I would compare this model against other machine learning algorithms such as boosted trees, support vector machines (which do ok in gene data) etc.

I am not sure what package you are using - hope that helps if

If you are using r - look up caret in cran (really good intro to some of the ideas here and great for getting out some alternative measures of performance).

Paul D

Related Question