Solved – Increasing the sample size does not help the classification performance

classificationdata miningmachine learningsvm

I am training a SVM classifier based on a given document collections. I started from using 500 documents for training, then I add another 500 for training, and so on. In other words, I have three training sets, 500, 1000, 1500. And the smaller training set is a subset of the sequential larger set. I validate the model against the same test set.

SampleSize  Precision   Recall      Accuracy      AUC
  500       79.62%          67.49%      77.65%    0.854
  1000       82.49%         77.94%       82.67%   0.890
  1500       81.64%         78.08%       82.28%   0.888

The performance gets the best when we use 1000 training set. Looks like the extra 500 training samples for constructing the 1500 training set in fact does harm to the model. How can I explain this observation.

Best Answer

Increasing training size does not neccessarily help the classifier and rather, may lead to a degradation in the generalization ability.

Regarding your own experiment, the factor of such unexpected degradation in performance given the increase in the training size could be one of the following:

1- Randomness: Simply, if you run the experiment again, you may see a different result from the one you have. This is only if the classifier is using any random approach in training.

2- Parameter Optimization: For example, in SVM, while increasing the training size, if the data is not linearly separable, you may need to increase the values of the slack variables (@Douglas). This parameter optimization helps in accounting for any new training point that violates the linear separability of the space.

3- Overfitting: Training some classifiers for longer time or using extra training points, may lead to a good performance on the training data but a worse one on the testing part. This is because your classifier could be so much fitting the training points to an extent in which it is difficult for to predict new points of different characteristics.

4- Experiment Design: It would be more indicative to run the experiment you have more than once on different parts of the data (cross-validation) and report the scores. In this case, we will have a MEAN accuracy value and an STDEV which would be more realistic indicators of the observation you have.

My advise for you is to run the same expariment again within the same setting. If you get a different result, then, you check the random part in your code. Then, even if you get the same result, use cross-validation. Finally, you may tune some of the parameters of the SVM.