Solved – ROC Curves for different classifiers

classificationrandom forestrocsvm

I am trying to compare the classification performance of different classifiers. So far, I am using SVM, Random forest, Adaboost.M1, and Naive Bayes. 70% of the data is used for training (and then plotting the ROC curve), while 30% is used for testing (a ROC curve again). Being fairly new to ML and more specifically to ROC curves, I have the following questions:

  1. Can we compare classifiers using the ROC curve? If so, the discussion will be on AUC? In this case, should I discuss accuracy, f-measure…

  2. Should I evaluate using a ROC curve based on the training data or on the test data? Which is more meaningful? And why?

  3. In my simulation, my testing ROC curve is different from training (specifically for SVM classifier). SVM's training ROC curve is very good as compared to testing data curve (see below). Is this correct? If so, how should I analyze/present it? Pointers would be really helpful. The top one is based on the training data, where SVM's curve is very good. The bottom one is based on testing data, where SVM is not so good, and is being outperformed by random forests. So should I say that random forests is doing a good job?

Traing ROC Curve - See how good SVM (rbf) is

Testing - Specifically SVM curve

Best Answer

  1. Yes, you can compare the AUC's of different classifiers. Including more measures such as accuracy, f-score, precision, recall etc. to the analysis might be good too.

  2. Testing measures should be better, they show you the out-of-sample performance of your system which is closer to the "real life" scenario, i.e. the performance on previously unseen data. Training measures can be helpful too, but they depend on the training algorithm itself, one algorithm might be using regularization techniques that decrease the training accuracy on purpose, while the other doesn't, but they might generalize the same, which is -I think- the case on your first figure.

  3. Yes, this is possible, I would just ignore the training curves. An improvement might be k-fold cross validation instead of dividing the dataset into %70 train - %30 test. Then you use all your dataset to measure the performance.