Solved – Training data set size and SVM classifier

classificationpattern recognitionsvm

I want to do a multi-class classification of human action recognition. I plan to collect data. So, How can I estimate the minimum data set size. What are the important parameters?

Best Answer

Bit late to the party, but we had a look into this for multivariate data (with some 100s of variates, spectroscopic data):

Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33. DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323

The bottomline is that for small patient number situations (below a few 100 patients), testing is more difficult than training as it requires absolute numbers of individuals (patients) in each class if e.g. sensitivity or something the like needs to be estimated. For training, very powerful regularization or aggregation strategies exist, so we cope with rather small training sample sizes by reducing complexity (required training sample size is relative to complexity) or measuring and averaging unstable models (aggregation).

We found that for typical study sizes in our field (biospectroscopy), typically learning curves cannot be estimated with any certainty that allows conclusions which are better than the domain/application specific rules of thumb.

Another point is that while you may not be sure that you train the best possible model by deciding complexity beforehand, you can still give a fair evaluation how the model you got does perform. And by this you can avoid wasting precious patients that could be used for either training or testing on the inner loop of a nested validation design for data-driven optimization of model complexity - which may be totally useless because you don't have enough patients to allow comparisons, anyways.

One important rule of thumb from medical stats is: in order to estimate a single proportion (e.g. sensitivity, or percentage affected), you need to have at least 100 patients in the denominator.

Related Question