Solved – When to *not* split up your data into training and testing

machine learningtrain

So, I was thinking of a situation of when to not split up your data into training and testing and to just train on the entire dataset, at the risk of "overfitting". If my dataset has let's say 10 columns of which 9 are categorical (which have no particular order, and one is numeric). It could happen that your training set split may not capture observations that include some of the original categorical observations. When you then go and try to evaluate your model on the test set, which will have these new categorical variables that were not trained on, your model would potentially not know what to do, and in R, or wherever your model is implemented, you would get an error.

Another situation could be where you are not interested in predicting "new data", rather, just getting a feel for the relationships/patterns between the predictors and the response variable.

Would these not be situations that even though your model will potentially be overfitted, that you would want to train on the entire dataset?

Thanks!

Best Answer

I agree with your first point: if the amount of data is limited, then we can use the entire data for model building, instead of split the data into two sets.

However, I do not completely agree with your second point: even there are no new data to predict, and all we want is trying to discover patterns. Sometimes testing data is still needed.

For example, we want to use K-means to "discover patterns" in data. How many clusters to choose can be derived using a testing data set.

Always keep in mind, data can contain noise, and if we just do whatever we can to overfit one data set. The discovery can be false and not generalized well.