Machine-Learning – Is it Legitimate to Refit the Best Model with Test Data in Model Building?

machine learningmodel selectionpredictive-models

For model building I typically apply the following process (which I am simplifying somewhat here for brevity):

  1. Split the data into test and training
  2. Use cross-validation on my training data to find the best model parameters relative to some performance metric (accuracy, ROC etc) and pick this as my best model
  3. Evaluate my model on the test data

So far so good.

One problem that arises is that my test data is often valuable to me. Suppose I am predicting the value of a stock. On the one hand, I want my test data to be the most recent data so that I can see how the model performs on recent stock prices. On the other hand, I want to fit my model on my latest stock price data because that is the most relevant data for predicting tomorrows stock price.

But After applying 3. above, I have a model that is not using the test data to fit the model.

My question is, is it legitimate (or standard practice) to refit the model with the final tuning parameters to the entire data set as a final fourth step. For example, suppose in 2. my best parameter for a random forest model is mtry = 5. So at the end of 3. I refit the entire model with mtry = 5. Note, using seeds in R I am able to ensure the best model I picked in 2. will be identical in form to the model I fit in this final step i.e. I do not want to refit the final model with 5 new random features – I want to use the same features as the model I picked in step 2.

Thoughts please. Is this good / bad? Does it violate some fundamental principal in machine learning approach?

Best Answer

Yes,

Of course, the accuracy you should report is from training with the test set held out. But now it's time to make predictions and you want them to be as good as they can be. This is certainly legitimate and important for exactly the reasons you mention. Your most recent data is what you want to test against and what you want final predictions to be made based on. There are some things worth being careful of.

In some machine learning algorithms, parameters are sensitive to the size of your training data for instance, in k-Nearest-Neighbour Regression, the optimal $k$ grows with $n$. So you may want to cross validate with several training sizes and extrapolate to find the best $k$ for your full training data.

Another example is if you are using a regularized GLM, with an objective function of the form $f(\theta) = L(\theta) + \lambda R(\theta)$ where $L$ is your loss and $R$ is your regularization. If you Total Logloss as $L$ this will be sensitive to training set size, whereas Average Logloss will not. So using average logloss makes the $\lambda$ you establish in CV usable with more data.

Related Question