Solved – Classification score for Random Forest

classificationpythonrandom forestvalidation

I'm learning about the Decision Tree and Random Forests. But there is something I don't really understand.

I have a training set and a cross-validation set. I need to train different Random Forests, each with a different number of trees. For each forest, I need to plot the classification score for the training set and the cross-validation set (a validation curve).

But what is the classification score for a Random Forest? Do I need to count the number of misclassifications? And how do I plot this?

PS: I use the Python SciKit Learn package

Best Answer

You typically plot a confusion matrix of your test set (recall and precision), and report an F1 score on them.
If you have your correct labels of your test set in y_test and your predicted labels in pred, then your F1 score is:

from sklearn import metrics
# testing score
score = metrics.f1_score(y_test, pred, pos_label=list(set(y_test)))
# training score
score_train = metrics.f1_score(y_train, pred_train, pos_label=list(set(y_train)))

These are the scores you likely want to plot.

You can also use accuracy:

pscore = metrics.accuracy_score(y_test, pred)
pscore_train = metrics.accuracy_score(y_train, pred_train)

However, you get more insight from a confusion matrix.

You can plot a confusion matrix like so, assuming you have a full set of your labels in categories:

import numpy as np, pylab as pl
# get overall accuracy and F1 score to print at top of plot
pscore = metrics.accuracy_score(y_test, pred)
score = metrics.f1_score(y_test, pred, pos_label=list(set(y_test)))
# get size of the full label set
dur = len(categories)
print "Building testing confusion matrix..."
# initialize score matrices
trueScores = np.zeros(shape=(dur,dur))
predScores = np.zeros(shape=(dur,dur))
# populate totals
for i in xrange(len(y_test)-1):
  trueIdx = y_test[i]
  predIdx = pred[i]
  trueScores[trueIdx,trueIdx] += 1
  predScores[trueIdx,predIdx] += 1
# create %-based results
trueSums = np.sum(trueScores,axis=0)
conf = np.zeros(shape=predScores.shape)
for i in xrange(len(predScores)):
  for j in xrange(dur):
    conf[i,j] = predScores[i,j] / trueSums[i]
# plot the confusion matrix
hq = pl.figure(figsize=(15,15));
aq = hq.add_subplot(1,1,1)
aq.set_aspect(1)
res = aq.imshow(conf,cmap=pl.get_cmap('Greens'),interpolation='nearest',vmin=-0.05,vmax=1.)
width = len(conf)
height = len(conf[0])
done = []
# label each grid cell with the misclassification rates
for w in xrange(width):
  for h in xrange(height):
      pval = conf[w][h]
      c = 'k'
      rais = w
      if pval > 0.5: c = 'w'
      if pval > 0.001:
        if w == h:
          aq.annotate("{0:1.1f}%\n{1:1.0f}/{2:1.0f}".format(pval*100.,predScores[w][h],trueSums[w]), xy=(h, w), 
                  horizontalalignment='center',
                  verticalalignment='center',color=c,size=10)
        else:
          aq.annotate("{0:1.1f}%\n{1:1.0f}".format(pval*100.,predScores[w][h]), xy=(h, w), 
                  horizontalalignment='center',
                  verticalalignment='center',color=c,size=10)
# label the axes
pl.xticks(range(width), categories[:width],rotation=90,size=10)
pl.yticks(range(height), categories[:height],size=10)
# add a title with the F1 score and accuracy
aq.set_title(lbl + " Prediction, Test Set (f1: "+"{0:1.3f}".format(score)+', accuracy: '+'{0:2.1f}%'.format(100*pscore)+", " + str(len(y_test)) + " items)",fontname='Arial',size=10,color='k')
aq.set_ylabel("Actual",fontname='Arial',size=10,color='k')
aq.set_xlabel("Predicted",fontname='Arial',size=10,color='k')
pl.grid(b=True,axis='both')
# save it
pl.savefig("pred.conf.test.png")

and you end up with something like this (example from LiblinearSVC model), where you look for a darker green for better performance, and a solid diagonal for overall good performance. Labels missing from the test set show as empty rows. This also gives you a good visual of what labels are being misclassified as. For example, take a look at the "Music" column. You can see along the diagonal that 75.7% of the items that were predicted to be "Music" where actually "Music". Travel along the column and you can see what the other labels really were. There was clearly some confusion with music-related labels, like "Tuba", "Viola", "Violin", indicating that perhaps "Music" is too general to try and predict if we can be more specific.

Confusion Matrix example