I like the idea of parsimony- the smaller the number of variables in the model, the better. Unless you are driven theoretically of course. Feature selection refers to the process of choosing which variables to use in the model (getting the best combination of variables). There are lots of different options for feature selection (worth a read). With that said, there should be inbuilt within the rf algorithm, a variable importance measure that you can generate as a starting point (with that said, be very very careful with this because there are noted biases in this) - see Strobl et al in the R journal.
I trust you have varied the number of variables randomly sampled at each node (this is mtry in R) and the depth of the trees and splitting criteria etc.
In terms of appearance, the second model looks slighly better to me, simply because of the reproduced accuracy in the test and train results. It always concerns me that if my test set accuracy is notably lower, there may be something wrong with the model. I trust you have made sure that your test and train set are balanced, at least on the dependent variable you are looking to classify. If this is binary (0,1) your models are not really doing much better than chance (50,50).
An very important thing to look at is the sensitivity (the number of true positives in a binary task 0,1 that are correctly classified) and specificity (the number of true negatives in a binary task 0,1) that are both correctly classified.
If possible, I would compare this model against other machine learning algorithms such as boosted trees, support vector machines (which do ok in gene data) etc.
I am not sure what package you are using - hope that helps if
If you are using r - look up caret in cran (really good intro to some of the ideas here and great for getting out some alternative measures of performance).
Paul D
Best Answer
This result does not mean that you have overfitting.
First of all, CV is more reliable than test set -- you can have (bad) luck in selecting test, what results in (pessimistic) optimistic bias with respect to true accuracy. CV effectively smooths this problem by repeating the procedure of selecting test. What's worse for using the test, RF is a stochastic algorithm and so two runs with different seed will give you different test accuracies, and the difference may be even bigger than that between CV and test.
Second, you may use standard deviation of accuracy from all CV runs to test whether: