Solved – Test/training data set split for Naive Bayes classifier after model finalized

I've been learning about Naive Bayes classifiers using the nltk package in Python. I'm working on a gender classification model. I have some labeled data for names with male/female probabilities, and to create the model I used a 80:20 split between training and testing sets.

I understand the importance of keeping these sets separate when you are optimizing your model, but once you've determined the features you want to include, doesn't it make sense to shift all of your existing labeled data into the training set when you're actually implementing the model on new data? My intuition is that this way when I apply the model to new, unseen names in the context of a real-world application, my model will have been trained on a larger data set. Is this correct, or are you supposed to keep your split and use only part of the data for training even after you've settled on the features you want to include?

If you do maintain the split, do you have to always use the same exact data in the training set or can you shuffle which data is in training and testing sets each time you run the model? (My intuition here is that the training set should stay fixed, but not sure)

Best Answer

The split of training and test sets is used to evaluate your algorithms such as classifiers regarding their accuracy. When you want to apply your algorithm to real data, you should then use a labelled data to train it - after systematically evaluating parameters of classifiers using the split data.
Shuffeling the test and training sets is the concept of crossvalidation. Crossvalidation is such a helpful technique that it became the name of this forum. There are many sources in the internet providing detailled explanations of crossvalidation. If you have any concrete questions to crossvalidation, feel free to ask.

Some questions in this forum deal with details of crossvalidation:

Choice of K in K-fold cross-validation

Further, I think your question partly is answered here:

Training with the full dataset after cross-validation?

Best Answer

Related Solutions

Solved – Why does training naive Bayes on a data set in which all the features are repeated increase the conﬁdence of the naive Bayes probability estimates

Solved – When to *not* split up your data into training and testing

Related Question

Solved – When to not split up your data into training and testing