Solved – Is this training dataset enough for training and testing classification model

machine learningsample-sizesvmtrainweka

My training dataset contains just 2 classes with 40 features.

In case 1, class 1 has 35 samples and class 2 has 700 samples.

In case 2, class 1 has 65 samples and class 2 has the same value as above.

Is my training dataset enough for constructing the model using SVM classifier or some other classifiers?

I'm using WEKA. Testing options are 10-fold cross-validation and %66 and i get very good results.

Best Answer

From my experience, I would answer "Yes" to your question although a wise one would be "It depends". You may refer to the following other threads for extended relevant discussions: How large a training set is needed? SVM with unequal group sizes in training data What is the minimum training set size required for a given number of features for document classification?

The answer to the question is not straight-forward since many factors including size of features, model parameter and experimental setup all play a role. Normally, there is a preference to have more training samples to realize a rich representation of the class of interest in the classification problem. Yet, if the features you have discriminate the two classes well, then, an acceptable number of samples (i.e. relatively small) can suffice.

In your case, depending on the performance you achieve, you may want to apply feature selection and filter out some features. In a way, this can lead to better exploitation of the fewer number of samples you have.

Now, another very critical factor you need to consider is the degree of imbalance between the classes. In the cases provided, for each positive training sample (assuming the smaller class to the positive one), there are around 5 to 9 samples from the negative one. For such scenario, be careful with performance measure you use to evaluate the method as accuracy would be misleading. Consider looking at sensitivity and precision more closely (e.g. Precision-Recall curves).

Finally, SVM with tuning of its parameter might be robust in your case but again depending on how good the features are. Other methods to consider would be ensemble classifiers like random forest or Adaboost. They are less prune to this issue of imbalance.

Related Question