Classification Accuracy – Comparing Impact of Training Data Size on Testing Data Size

accuracyclassificationdatasettrain

I am training a classifier using BERT and want to check how the accuracy changes with increasing training data size. Up until now, I have 1k annotated training samples and tested the accuracy for different subsizes of this set (200, 400, 600, 800, 1000) and divided the training and test data with a 80:20 ratio.
The problem that occurred to me is that in my case I was always using different testing samples in order to assess the accuracy. However, if I understood correctly, the best approach would be to keep a constant test data set across all subsets of testing.
My question now:

Is this thinking correct?
If yes, would I then choose 20% of the whole dataset (e.g. 1000*0.2 = 200) for all 5 training sizes (200, 400, 600, 800, 1000) when reporting accuracy?

Best Answer

There are two important things to consider here.

First, unless you have a very high signal-to-noise ratio, your sample size is too small for reliable use of split-sample validation. See Frank Harrell's blog post specifically on that subject, where he suggests that 20,000 or more cases are needed for that approach.

The classic train/test split does implicitly assume a single held-out test sample, so that you evaluate the performance of a particular model developed with a separate training (and perhaps validation) set. Otherwise, in an approach such as you describe, you are evaluating the performance of multiple models each trained on different sets. Even putting aside the small-sample problem, it's not clear what to report in that case. Which are you defining as the model to report?

Repeated cross-validation and bootstrapping provide better approaches to a data sample of 1000. Strictly, those methods evaluate the modeling process rather than the specific model. You report the model based on the entire sample, but estimate how well that type of modeling would work if you repeatedly applied it to new data samples.

Second, accuracy is not a good measure of model performance. That's typically evaluated at a probability cutoff of 0.5 (often a hidden assumption of the software), which is only useful if you know that false positive and false negative classifications have the same costs. This site has many pages devoted to the inadequacies of accuracy and the superiority of strictly proper scoring rules. Once you have a well calibrated probability model you can apply it more precisely for particular classification-cost tradeoffs.

Related Question