Solved – Comparing 2 classifiers with unlimited training data

classificationdatasetmachine learningunsupervised learning

I would like to compare 2 text classifiers C1 and C2, which can be trained with "unlimited" noisy training datasets, meaning that you can use as much data as you want for training, such data being very noisy. I am thinking of the 2 options below. Which one would be better?

1) Set up a datasize size interval 0-z, z being a very large number. Identify dataset size x at which C1 scores maximum accuracy (eg using a learning curve). Then train C2 with such dataset size x.

2) Identify dataset size y at which C1 stops learning (eg very small slope of the learning curve). Then train C2 with such dataset size y.

Please not that I am not asking about method for comparing classifiers, but how to decide the training dataset size at which 2 classifiers should be compared.

Edit:
Below is the learning curve for one classifier. X axis is training dataset size, Y axis is accuracy. For the purpose of comparing two classifiers, should I pick the size at which highest accuracy has been found (red), or the size at which the accuracy remains (more or less) stable (green)?

enter image description here

Edit2: @cbeleites @Dikran Marsupial I am sorry for the lack of information:

  • I have 3 classes in my problem.
  • The above learning curve was generated by increasing training dataset
    size, starting from 1,000 examples/class(3,000 in total) until 140,000
    examples/class(420,000 in total); for each increase step 1,000 new
    examples/class(3,000 in total) are being added, and the trained model
    is tested for every iteration with the same test dataset, composed of
    350 examples/class (1,050 examples in total)
  • Training dataset samples were automatically labeled, test samples
    were manually labeled

Best Answer

Do I understand you correctly that you want to measure whether C1 is a faster/slower learner than C2?

With unlimited training data, I'd definitively construct (measure) the learning curves. That allows you to discuss both questions you pose.

As Dikran already hints, the learning curve does have a variance as well as a bias component: training on smaller data gives systematically worse models but there is also higher variance between different models trained with smaller $n_{train}$ which I'd also include in a discussion which classifier is better.

Make sure you test with large enough test sample size: proportions of counts (such as classifier accuracy) suffer from high variance which can mess up your conclusions. As you have an unlimited data source, you are in the very comfortable situation that it is actually possible to measure the learning curves without too much additional testing error on them.

I just got a paper accepted that summarizes some thoughts and findings about Sample Size Planning for Classification Models. The DOI does not yet function, but anyways here's the accepted manuscript at arXiv.

Of course computation time is your consideration now. Some thoughts on this

  • how much computer time you are willing to spend will depend on what you need your comparison for.

  • if it's just about finding a practically working set-up, I'd be pragmatic also about the time to get to a decision.

  • if it's a scientific question, I'd quote my old supervisor " Computer time is not a scientific argument". This is meant in the sense that saving a couple of days or even a few weeks of server time by compromising the conclusions you can draw is not a good idea*.
    The more so, as having better calculations doesn't necessarily require more of your time here: your time to set up the calculations will take roughly the same time whether you calculate on a fine grid of training sample sizes or a rough one, or whether you measure variance by 1000 iterations or just by 10. This means that you can do calculations in an order that allows to get a "sneak-preview" on the results quite fast, then you can sketch the results, and at the end pull in the fine-grained numbers.

(*) I may add that I come from an experimental field where you easily spend months or years on sample collection and weeks or months measurements which don't do themselves in the way a simulation runs on a server, neither.


Update about bootstapping / cross validation

It is certainly possible to use (iterated/repeated) cross validation or out-of-bootrap testing to measure the learning curve. Using resampling schemes instead of a proper independent test set is sensible if you are in a small sample size situation, i.e. you do not have enough independent samples for training of a good classifier and properly measuring its performance. According to the question, this is not the case here.

Data-driven model optimization

One more general point: choosing a "working point" (i.e. training sample size here) from the learning curve is a data-driven decision. This means that you need to do another independent validation of the "final" model (trained with that samples size) with another independen test set. However, if your test data for measuring the learining curve was independent and had huge (really large) sample size, then your risk to overfit to that test set is minute. I.e. if you find a drop in performance for the final test data, that indicates either too small test sample size for determining the learning curve or a problem in your data analysis set up (data not independent, training data leaking into test data).


Update 2: limited test sample size
is a real problem. Comparing many classifiers (each $n_{train}$ you evaluate ultimately leads to one classifier!) is a multiple testing problem from a statistics point of view. That means, judging by the same test set "skims" the variance uncertainty of the testing. This leads to overfitting.
(This is just another way to express the danger of cherry-picking Dikran commented about)

You really need to reserve an independent test set for final evaluation, if you want to be able to state the accuracy of the finally chosen model.
While it is hard to overfit to a test set of millions of instances, it it much easier to overfit to 350 samples per class.

Therefore, the paper I linked above may be of more interest for you than I initially thought: it also shows how to calculate how much test samples you need to show e.g. superioriority of one classifier (with fixed hyperparameters) over another. As you can test all models with the same test set, you may be lucky so that you are able to somewhat reduce the required test sample size by doing paired tests here. For paired comparison of 2 classifiers, McNemar test would be a keyword.