Solved – Imputing missing values on a testing set

machine learningmissing datapredictive-models

I'm a newbie to machine learning so forgive me if the answer to this question is obvious.
I have been working on a binary prediction problem using logistic regression. Using a selection of categorical and continuous I have been able to predict accuracy on a testing set with an AUC of about $0.7$.
I have been comparing multiple data pre-processing approaches where I carry out combinations of various filtering steps which are:

  • no data filtering
  • removing mean based outliers without replacement
  • removing mean based outliers with mean replacement & additionally replacing
    NA's with the mean.
  • removing median absolute deviation outliers without replacement
  • removing median absolute deviation outliers with mean replacement &
    additionally replacing NA's with the mean.
  • repeating the above 5 procedures on a data set that has all of the NA's
    removed.

I find that my model is the most predictive on a testing set whenever I remove all median absolute deviation outliers and replace them with the mean and additionally replace pre-exisiting NA's with the mean.

Is it OK to impute mean based missing values with the mean whenever implementing the model?

Thanks!

Best Answer

Yes.

It is fine to perform mean imputation, however, make sure to calculate the mean (or any other metrics) only on the train data to avoid data leakage to your test set.