Solved – Naive bayes with duplicated data

naive bayes

In the training set for naive bayes, there are some duplicate samples. Should we train the naive bayes with duplicate samples, or should we eliminate all the duplicates and then train the naive bayes.

I have points for both for and against eliminating the duplicates.

For:
Since the duplicate sample does not add any new knowledge to the system, we should eliminate it.
If there are large number of duplicate samples, the model we build will be biased towards the duplicated sample. Hence duplicates should be avoided.

Against:
If a sample is occurring multiple times, it is natural to have a bias to this sample. Hence duplicates should not be filtered.

Please suggest which way is the right way, if at all a right way exists.

Best Answer

I think you've mostly answered the question. If the duplicates are part of the natural sampling process, and you're going to see them when you actually apply the classifier at test time, then you should include them. In this case they do add information, since they are telling the classifier about the distribution of the inputs that it is going to encounter.

If the duplicates are due to some reason that's not going to appear at test time (e.g. they come from overlapping data samples that were combined together) then you may have more serious issues in general with your training set, since it's not reflecting the true distribution of samples.