Solved – When imputing missing values in a test set, should the new values come from the training set or be recalculated from the test set

data-imputationdata-leakagemissing datamodel-evaluation

Both answers to this question on imputing missing values note that, when imputing missing values in a test set for model evaluation, the replacement values should be the ones calculated and used in the training process (not calculated anew on the test data).

The author of Hands-On Machine Learning with Scikit-Learn & TensorFlow also suggests the same:

…you should compute the median value on the training set, and use it to fill the missing values in the training set, but also don't forget to save the median value that you have computed. You will need it later to replace missing values in the test set when you want to evaluate your system, and also once the system goes live to replace missing values in new data.

For instance, if missing values are being replaced with the median, the process for test set evaluation should be:

  1. Create the train/test split
  2. Calculate the median values for numerical variable(s) in the training set and save this value
  3. Train the model on the training set
  4. Use the median value(s) saved in step 2 to fill the missing value(s) in the test set
  5. Evaluate the model's performance on the test set

This strikes me as counter-intuitive. Wouldn't the goal be to replicate the entire process/pipeline (including imputation process, variable selection, outlier detection/removal, etc.) on the test set to avoid data leakage? It seems this would more closely approximate the process when applied to new data since the evaluation would be "blind" to the values from the training set.

Quoting ESL (7.10.2 "The Wrong and Right Way to Do Cross-validation):

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be "left out" before any selection or filtering steps are applied.

While the context for the the above is (1) feature selection and (2) cross-validation, wouldn't the same rationale apply to (1) imputing missing variables and (2) a single train/test split? If not, why is it bad practice to re-compute imputed values on a training set?

Best Answer

It depends on your model and the data. You can calculate the median globally if you have all the data at hand anyway. For example, in a Kaggle competition, you have training and test set fixed and usually at hand. So, you can calculate the median of the features for the whole dataset at once. It might improve your competition-score, which in this case may be seen as the purpose of the model.

Though, that does not work if new data is coming in, because you cannot calculate the median of data that you don't yet have. So, a model that has to evaluate new samples (to which you did not have access to during the time of the model building process) should be built and validated with that in mind. Here, it is important to validate the model with the same method as it is being used or tested. So, if you do cross-validation you should use the same technique as in production: That means, in this case, you would calculate the median for each training fold and impute those values in training and test fold. As well as you would calculate the median for the whole training set and impute it into the test set.

On the other hand, you could have a time series where you calculate a rolling median over a given time span. I am thinking about a data stream where you have access to all data, but dependent on time. Here you could impute in a time-dependent manner e.g. imputing the median of the past 24h. Here the imputed values are calculated from the training set and test set mainly independently - with a little overlap in the timespan where training and test set are close to each other.

The leakage problem is more concerning when data leaks from the test or validation set into the training set. Or from the validation fold into the training fold. You may end up with overly optimistic validation/testing scores, but in production, the model does not work. Using medians from the training set in the test set is not problematic at all.

Related Question