Solved – Should I get 100% classification accuracy on training data

accuracyclassificationmachine learningtrainvalidation

I've been getting inconsistent results with a binary classification problem I'm trying to solve using a linear classifier and a custom feature extraction pipeline, and decided to do a quick check of my code for bugs by training and testing my classifier on the same dataset. I expected this to yield a very high (100%?) accuracy/recall and precision stats, but to my surprise, I got results comparable to or even lower than the ones I normally get on distinct training and testing sets (~70% recall).

Should a classifier be very accurate when applied to its own training data, or do I just have a bug in my code? I'm not very experienced in ML so any help at all would be greatly appreciated! Thanks!!

Best Answer

No, your data may not be perfectly classifiable especially by a linear classifier and this is not always because of the classifier or the features you are using. None of the features may contain sufficent differences to provide a clear line.

You may try non-linear models which can provide better classification as well as higher risk of over-fitting. Using a validation set can help you identify whether you need a different model or the problem lies in the nature of your data.

Related Question