Solved – Test/training data set split for Naive Bayes classifier after model finalized

machine learningnaive bayes

I've been learning about Naive Bayes classifiers using the nltk package in Python. I'm working on a gender classification model. I have some labeled data for names with male/female probabilities, and to create the model I used a 80:20 split between training and testing sets.

I understand the importance of keeping these sets separate when you are optimizing your model, but once you've determined the features you want to include, doesn't it make sense to shift all of your existing labeled data into the training set when you're actually implementing the model on new data? My intuition is that this way when I apply the model to new, unseen names in the context of a real-world application, my model will have been trained on a larger data set. Is this correct, or are you supposed to keep your split and use only part of the data for training even after you've settled on the features you want to include?

If you do maintain the split, do you have to always use the same exact data in the training set or can you shuffle which data is in training and testing sets each time you run the model? (My intuition here is that the training set should stay fixed, but not sure)

Best Answer

  1. The split of training and test sets is used to evaluate your algorithms such as classifiers regarding their accuracy. When you want to apply your algorithm to real data, you should then use a labelled data to train it - after systematically evaluating parameters of classifiers using the split data.

  2. Shuffeling the test and training sets is the concept of crossvalidation. Crossvalidation is such a helpful technique that it became the name of this forum. There are many sources in the internet providing detailled explanations of crossvalidation. If you have any concrete questions to crossvalidation, feel free to ask.

Some questions in this forum deal with details of crossvalidation:

Choice of K in K-fold cross-validation

Further, I think your question partly is answered here:

Training with the full dataset after cross-validation?