Solved – Poor classification performance with naiveBayes

classificationnaive bayesr

I started to play with naiveBayes function from e1071 package. It looks simple on small test examples. But it performs poor on my actual task. I have ~4000 observation belonging to 2 classes and described by ~19000 numeric variables. The entire dataset was split on training and test sets, and naiveBayes model have been produced for the training set.

The prediction performance of the model on training and test sets was poor (~55%), much less than Random Forest model for example (~80%). Almost all observation were classified as class 2, while the dataset is balanced and ratio of class 1 to class 2 is almost 1:1.

What can be the reason for such a bad performance? How can I improve the model and perdiction results? As I'm relatively new in Bayes approach any suggestion will be helpful.

As I understood initial assumption of Bayes approach is independence of variables. But I'm sure that many variables in my dataset are highly correlated. Can it be the source of the problem? Should I use variable selection with Naive Bayes?

Best Answer

Indeed, as you have mentioned it yourself, the lack of independence (and relevance) of the explanatory variables is crucial. Also, it is not a surprise at all that Random Forest is behaving in a much better way than a Naive Bayes classifier since it is much more robust to overfitting, especially in your situation where you have almost five times more explanatory variables than observations. Virtually any 'ensemble method' will do better than a simple Naive Bayes classifier. You could try to do an ensembling of Naives Bayes classifier, in the spririt of what is described in this short text.

Related Question