Data Imputation Techniques Using Testing Data – A Research Design Perspective

data-imputationresearch-design

My supervisor has instructed another person in my lab to use both training and testing data to impute missing values for use building a machine learning model. The results of the analysis haven't been put into a publication but my feeling was A) this is wrong, the model should be trained without receiving any information from the testing set and B) if this were to go into a publication it would be dodgey at best and potentially illegal if you didn't report it. I would be surprised if the results were published if it was reported.

Are my suspicions correct? Is there a rigorous reason why?

Best Answer

The out-of-sample data mimic the real situation of applying the model to unseen data, such as expecting Siri or Alexa to understand speech that has yet to be uttered, perhaps even by people who have yet to be born. When you are modeling, you treat the out-of-sample data as if they do not exist.$^{\dagger}$ Consequently, this approach is unacceptable.

I like the analogy to speech recognition and would gladly use it if a colleague of mine suggested this buffoonery. I invite you to use the analogy.

$^{\dagger}$It gets somewhat more complicated than this because of ideas like cross validation and having a train/validate/test set. Using the ($5$-fold) cross validation idea, you take your four folds for training and do all of the modeling steps, including the imputation, on them, ignoring the fifth fold. Then you do the same but ignore a different fold (which means you train on previously-ignored data), et cetera. This way, all training sets are ignorant of the out-of-sample data.

Related Question