Solved – Imputation of missing data before or after centering and scaling

centeringdata preprocessingdata-imputationmachine learning

I want to impute missing values of a dataset for machine learning (knn imputation). Is it better to scale and center the data before the imputation or afterwards?

Since the scaling and centering might rely on min and max values, in the first case the subsequent imputation might add new max / min values and tamper the scaled/centered data.

However, the imputation process might also profit from a scaled and centered dataset.

What do you think is better, and why?

Best Answer

Presumably, if you really need to center & scale the data, that should be done after imputation, as the imputation could influence on the correct center and scale to use!

Generally, the imputation should be the very first step in any analysis you do.

EDIT answer to comment by @Inon:

You say that imputation should preserve center & scale, and also standardization. Why? If the missing values truly are missing at random, maybe it does not matter much, but generally missingness might depend on other observed variables, and then estimates of mean and scale could be skewed by this pattern in the missingness. Imputation (better multiple imputation) is a way to fight this skewing. But if you do imputation after scaling, you just preserve the bias introduced by the missingness mechanism. Imputation is meant to fight this, and doing imputation after scaling just defeats this.

Related Question