Solved – How does size of test set affect the performance of a model

accuracyclassificationmachine learningmodel-evaluationsupervised learning

My data set is divided into 80:20 train and test…i have performed 10 fold cross validation on the train data set and tested the 20 % dataset on each iteration ( so that test set is not touched while training). Finally i get the scores by averaging the scores in each iterations.
I have been trying classification on a 7 class problem. I have 8 sensors generated data .(8features). Every time the classifier misclassifies the last class. I tried decreasing the number of classes, still the last class got miss classified.
Finally i started decreasing the test set to increase training. I got good results (90%accuracy) when test data is only 8%.

Is any other way around or any scope of increasing scores without further decreasing the size of the test set?
following are the snipets of the two casesenter image description here

Best Answer

Generally speaking, with more training data, the model will learn the underlying distribution of the real data better. Since a larger training set in your case improves the performance of on training and test set, you should get more data if you can. The performance on the test set might not be that reliable if your test set is small. In other words, your performance might be different if you change to another test set. That is one of the reasons that you perform cross validation. I suggest you also take a look at the CV accuracy.

You should also take a look at the class distribution in your training and test set. If there are only a few data points for the last class, the classifier will not able to learn well. You can upsample the minority class to make the classifier work better.

Related Question