Solved – Why does training naive Bayes on a data set in which all the features are repeated increase the confidence of the naive Bayes probability estimates

classificationnaive bayestext mining

I am looking for a toy example to understand this behavior. Preferebly a text classification one

I read the following from at page 7.

Furthermore, even when naive Bayes has good classification accuracy, its probability estimates tend to be poor. To understand why, imagine training naive Bayes on a data set in which all the features are repeated, that is, x = (x1,x1,x2,x2,…,xK,xK). This will increase the confidence of the naive Bayes probability estimates, even though no new information has been added to the data.

I am able to produce any toy example for which $$p(y) * \sum_{i=1}^k( p(x_i|y))$$ changes by duplicating the features.

Best Answer

I'm not sure what kind of example you are looking for. But to understand this behavior you simply need to consider this: For duplicated variables you have $$p(x_1=k, x_2=k|y)=p(x_1=k|y)=p(x_2=k|y)$$ Yet naive bayes models this as $$p_{NaiveBayes}(x_1=k,x_2=k|y)=p(x_1=k|y)p(x_2=k|y)=p(x_1=k,x_2=k|y)^2$$

Related Question