Solved – Why does training naive Bayes on a data set in which all the features are repeated increase the confidence of the naive Bayes probability estimates

classificationnaive bayestext mining

I am looking for a toy example to understand this behavior. Preferebly a text classification one

I read the following from http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf at page 7.

Furthermore, even when naive Bayes has good classification accuracy, its probability estimates tend to be poor. To understand why, imagine training naive Bayes on a data set in which all the features are repeated, that is, x = (x1,x1,x2,x2,…,xK,xK). This will increase the confidence of the naive Bayes probability estimates, even though no new information has been added to the data.

I am able to produce any toy example for which $$p(y) * \sum_{i=1}^k( p(x_i|y))$$ changes by duplicating the features.

Best Answer

I'm not sure what kind of example you are looking for. But to understand this behavior you simply need to consider this: For duplicated variables you have $$p(x_1=k, x_2=k|y)=p(x_1=k|y)=p(x_2=k|y)$$ Yet naive bayes models this as $$p_{NaiveBayes}(x_1=k,x_2=k|y)=p(x_1=k|y)p(x_2=k|y)=p(x_1=k,x_2=k|y)^2$$

Related Question