Solved – Evaluation of classifiers: learning curves vs ROC curves

accuracyclassificationmachine learningroc

I would like to compare 2 different classifiers for a multiclass text classification problem that use large training datasets. I am doubting whether I should use ROC curves or learning curves to compare the 2 classifiers.

On one hand, learning curves are useful for deciding the size of the training dataset, since you can find the size of the dataset at which the classifier stops learning (and maybe degrades). So the best classifier in this case might be the one reaching the highest accuracy with the smallest dataset size.

On the other hand, ROC curves let you find a point with the right trade-off between between sensitivity/specificity. The best classifier in this case is just the one closer to the top-left part, with the highest TPR for any FPR.

Should I use both evaluation methods? Is it possible for a method with a better learning curve to have a worse ROC curve, and vice-versa?

Best Answer

Learning curve is only a diagnosing tool, telling you how fast your model learns and whether your whole analysis is not stuck in a quirky area of too small sets / too small ensemble (if applies). The only part of this plot that is interesting for model assessment is the end of it, i.e. the final performance -- but this does not need a plot to be reported.
Selecting a model based on a learning curve as you sketched in your question is rather a poor idea, because you are likely to select a model that is best at overfitting on a too small sample set.

About ROCs... ROC curve is a method to assess binary models that produce a confidence score that an object belongs to one class; possibly also to find them best thresholds to convert them into an actual classifiers.
What you describe is rather an idea to plot your classifiers' performance as a scatterplot of TPR/FPR in the ROC space and use closest-to-top-left-corner criterion to select this which is best balanced between generating false alarms and misses -- this particular aim can be more elegantly achieved by simply selecting model with a best F-score (harmonic mean of precision and recall).

Update about bootstapping / cross validation

It is certainly possible to use (iterated/repeated) cross validation or out-of-bootrap testing to measure the learning curve. Using resampling schemes instead of a proper independent test set is sensible if you are in a small sample size situation, i.e. you do not have enough independent samples for training of a good classifier and properly measuring its performance. According to the question, this is not the case here.

Data-driven model optimization

One more general point: choosing a "working point" (i.e. training sample size here) from the learning curve is a data-driven decision. This means that you need to do another independent validation of the "final" model (trained with that samples size) with another independen test set. However, if your test data for measuring the learining curve was independent and had huge (really large) sample size, then your risk to overfit to that test set is minute. I.e. if you find a drop in performance for the final test data, that indicates either too small test sample size for determining the learning curve or a problem in your data analysis set up (data not independent, training data leaking into test data).

Update 2: limited test sample size
is a real problem. Comparing many classifiers (each $n_{train}$ you evaluate ultimately leads to one classifier!) is a multiple testing problem from a statistics point of view. That means, judging by the same test set "skims" the variance uncertainty of the testing. This leads to overfitting.
(This is just another way to express the danger of cherry-picking Dikran commented about)

You really need to reserve an independent test set for final evaluation, if you want to be able to state the accuracy of the finally chosen model.
While it is hard to overfit to a test set of millions of instances, it it much easier to overfit to 350 samples per class.

Therefore, the paper I linked above may be of more interest for you than I initially thought: it also shows how to calculate how much test samples you need to show e.g. superioriority of one classifier (with fixed hyperparameters) over another. As you can test all models with the same test set, you may be lucky so that you are able to somewhat reduce the required test sample size by doing paired tests here. For paired comparison of 2 classifiers, McNemar test would be a keyword.

Solved – Comparing two Machine learning Models using ROC curves

As Marc Claesen points out, some kind of certainty measure is needed. Below I have showed two approaches of how to form ROC curves.

If the classifier can output a probabilistic measure, such one can be used in e.g. 5-fold cross validation to form a ROC plot.
If the classifier only outputs predicted labels, then the certainty of predictions can be estimated with bagging. The training set is bootstrapped and modeled e.g. 100 times and the cross validated out-of-bag predictions are used for ROC curves.

(1,probabilistic svm, black curve) and (2,bagged svm, red curve)

For multi-class ROC curves use e.g. "1 vs. rest" method, check out this post

rm(list=ls())
set.seed(1)
library(e1071)
library(AUC)

data(iris)
iris = iris[1:100,] #remove one species, to simplify to a 2-class problem
iris[1:4] = lapply(iris[1:4],jitter,amount=2) #add noise, otherwise too easy
#NB ROC PLOT will change for each new random noise component (jitter)
X = iris[1:100,names(iris)!="Species"]
y = iris[1:100,"Species"]

#cross-validated SVM-probability plot
folds = 5
test.fold = split(sample(1:length(y)),1:folds) #ignore warning
all.pred.tables = lapply(1:folds,function(i) {
  test = test.fold[[i]]
  Xtrain = X[-test,]
  ytrain = y[-test ]
  sm = svm(Xtrain,ytrain,prob=T) #some tuning may be needed
  prob.benign = attr(predict(sm,X[test,],prob=T),"probabilities")[,2]
  data.frame(ytest=y[test],ypred=prob.benign) #returning this
})
full.pred.table = do.call(rbind,all.pred.tables)
plot(roc(full.pred.table[,2],full.pred.table[,1]))


#bagged OOB-cross validated SVM AUC plot
n.bootstraps=100 #how many models to train
inbag.matrix = replicate(n.bootstraps,sample(1:length(y),replace=T))
all.preds = sapply(1:n.bootstraps,function(i) {
  inbag = inbag.matrix[,i]
  outOfBag = which(!1:length(y) %in% inbag)
  Xtrain = X[inbag,]
  ytrain = y[inbag ]
  sm = svm(Xtrain,ytrain) #some tuning may be needed
  pred.label = rep(NA,length(y))
  pred = predict(sm,X[outOfBag,])
  pred.label[outOfBag] = levels(pred)[as.numeric(pred)]
  addNA(factor(pred.label))
})

bag.prob = apply(all.preds,1,function(aRow){
  inbag = which(is.na(aRow))
  mean(aRow[-inbag] == levels(y)[2])
})
plot(roc(bag.prob,y),col="red",add=TRUE)

Best Answer

Related Solutions

Solved – Comparing 2 classifiers with unlimited training data

Update about bootstapping / cross validation

Data-driven model optimization

Solved – Comparing two Machine learning Models using ROC curves

Related Question