Naive Bayes – Understanding the ‘Learning’ in Naive Bayes

algorithmsclassificationmachine learningnaive bayes

As I recall algorithms like nearest neighbor don't build a model based on training data and then apply that model to test data. It just takes each new instance and compares it to all the data to find the closest one, etc.

What about Naive Bayes? It seems to be similar. For example neural networks learn parameters and then use the corresponding model on test data. But for Naive Bayes I don't see where the learning takes place. There are no learned parameters. It seems to again look at the entire dataset for each prediction. Can anyone comment on this?

Additionally, then what is the use of a training/test split. I can see that we would want a test set because we want it to be labeled but beyond that I don't see why we need test/train?

Best Answer

Different from the nearest neighbor algorithm, the Naive Bayes algorithm is not a lazy method; A real learning takes place for Naive Bayes. The parameters that are learned in Naive Bayes are the prior probabilities of different classes, as well as the likelihood of different features for each class. In the test phase, these learned parameters are used to estimate the probability of each class for the given sample.

In other words, in Naive Bayes, for each sample in the test set, the parameters determined during training are used to estimate the probability of that sample belonging to different classes. For example, $P(c|x)\propto P(c)P(x_1|c)P(x_2|c)...p(x_n|c)$ where $c$ is a class and $x$ is a test sample. All quantities $P(c)$ and $P(x_i|c)$ are parameters which are determined during training and are used during testing. This is similar to NN, but the kind of learning and the kind of applying the learned model is different.

As an example, take a look at the Naive Bayes implementation in nltk. See the train and prob_classify methods. In the train method, label_probdist and feature_probdist are computed, and in the prob_classify method, these parameters are used to estimate the probability of different class for a test sample. Just note that _label_probdist and _feature_probdist are respectively initialized to label_probdist and feature_probdist in the constructor.

About your second question (the final paragraph), even for the lazy methods such as the nearest neighbor method, we need to split data into train/test. This is because we want to evaluate the performance of the model obtained based on the training data on some samples that are not seen during training to obtain a reasonable measure of the model generalization.