Random Forest is a bagging algorithm rather than a boosting algorithm.
They are two opposite way to achieve a low error.
We know that error can be composited from bias and variance. A too complex model has low bias but large variance, while a too simple model has low variance but large bias, both leading a high error but two different reasons. As a result, two different ways to solve the problem come into people's mind (maybe Breiman and others), variance reduction for a complex model, or bias reduction for a simple model, which refers to random forest and boosting.
Random forest reduces variance of a large number of "complex" models with low bias. We can see the composition elements are not "weak" models but too complex models. If you read about the algorithm, the underlying trees are planted "somewhat" as large as "possible". The underlying trees are independent parallel models. And additional random variable selection is introduced into them to make them even more independent, which makes it perform better than ordinary bagging and entitle the name "random".
While boosting reduces bias of a large number of "small" models with low variance. They are "weak" models as you quoted. The underlying elements are somehow like a "chain" or "nested" iterative model about the bias of each level. So they are not independent parallel models but each model is built based on all the former small models by weighting. That is so-called "boosting" from one by one.
Breiman's papers and books discuss about trees, random forest and boosting quite a lot. It helps you to understand the principle behind the algorithm.
I guess you have tunneled in on tuning too many non-useful hyper parameters, because an easy to use grid-search functionality allowed you to do so.
Notice all your explained variances only differ on the fourth digit. You have found, what appears to be a negligible better model settings. But even that you cannot be sure off because:
- the RF model is non-deterministic and performance will vary slightly
- a CV only estimates future model performance with a limited precision
- nfold CV is not perfect reproducible and should be repeated to increase precision
- Grid tuning should be performed with nested CV, but that is not your problem here I think.
Only "grid-tune" max_features. It has only 6 possoble values. You can run each 5 times and plot it. Check if some setting is repetitively better, probably you find anything from 2-4 perform fine. Max_depth is by default unlimited and that is optimal as long data is not very noisy. You set it to 25, which in practice is unlimited because already $2^{15}$=32000 and you "only" have 26000 samples. Changing these other hyper parameter will only give you shorter training times(useful) and/or more robust models. Thumb-rule: as explained variance is way above 50%, you do not need to make your model more robust by limiting depth of trees (max_depth, min_samples_split) to e.g. 3. Max_depth 15 is quite deep, and probably plenty deep enough, just as 2000 are trees enough. So raising and lowering number of trees and depth within the quite fine range does not change anything, and it will be really hard and non-rewarding to find the true best setting.
So you have performed a grid search and learned that RF will have the same performance in the parameter space you have tested.
If you obtain a testset from a different source you should expect a drop in performance. Your CV only estimate the model performance, if the future test set was drawn from the exactly same population.
With 1400 tests, the sample error alone could swing the measured performance +/- 0.03, I guess.
If your swapped e.g. to boosting algorithms grid-tuning of multiple parameters would be a more rewarding tool.
To improve your model maybe you can refine your features. Look to variable importance, to see what features work well. Could you maybe derive new features with an even higher variable importance? Since your explained variance is quite high(low noise), you may benefit from swapping to xgboost. You may also spend time wondering if this chase of a better model performance of some target by some metric (explained variance) is useful specifically for your purpose. Maybe you don't need the model being that accurate when predicting large values, so you log transpose your target e.g. Maybe you only want to rank your predictions so explained variance could be replace with Spearman rank coefficient.
happy modelling:)
Best Answer
You typically plot a confusion matrix of your test set (recall and precision), and report an F1 score on them.
If you have your correct labels of your test set in
y_test
and your predicted labels inpred
, then your F1 score is:These are the scores you likely want to plot.
You can also use accuracy:
However, you get more insight from a confusion matrix.
You can plot a confusion matrix like so, assuming you have a full set of your labels in
categories
:and you end up with something like this (example from LiblinearSVC model), where you look for a darker green for better performance, and a solid diagonal for overall good performance. Labels missing from the test set show as empty rows. This also gives you a good visual of what labels are being misclassified as. For example, take a look at the "Music" column. You can see along the diagonal that 75.7% of the items that were predicted to be "Music" where actually "Music". Travel along the column and you can see what the other labels really were. There was clearly some confusion with music-related labels, like "Tuba", "Viola", "Violin", indicating that perhaps "Music" is too general to try and predict if we can be more specific.