Solved – Is it better to pre-filter the entire data set or just the training sub-set

cross-validationdatasetmachine learning

I am currently working on a classifier for the qualitative spectral analysis of alloys.

One of the problems that I faced is preparation of samples for the classifier training. Samples have to me machined into chips that can physically fit into the spectrometer's sample chamber.

During cross-validation I have noticed a few very peculiar miss-classifications that simply should not happen (the elemental composition is too different), esp. taking into consideration the overall performance of the classifier. Upon careful examination I found something that looks like contaminated (or possible mislabeled, although this is very unlikely) sample. Anomaly detector confirms that and makes it very simple to find such samples and remove them.

My question is, what would considered the a better practice in the machine learning:

  • Filter entire data set so neither training nor cross-validation data has any outliers.
  • Filter only the training set and leave outliers in the cross-validation set.

Real world samples for which this classifier is prepared are unlikely to have this sort of contamination.

Best Answer

You pretty much answered your question with that last sentence. Assuming you are building a model to be used in a real world application, you want its training and evaluation setup to be as close as possible to said application.

If you have no outliers in reality, then get rid of them in your own data set too. Having no outliers in a real world application sounds too good to be true, though, so be sure about that or you will be way too optimistic about the model's performance.