Do data imputation and normalization when using polynomial regression

data-imputationnormalizationpolynomialregression

The question is about the practical use of polynomial regression.
Let's say there is a dataset with columns A, B, T where T is a dependent variable, A and B are independent variables. A and B contain missing values. I want to fill in the gaps with the mean, then normalize values by the formula:

(x – u) / s,

where u is the mean and s is the standard deviation.
Everything is clear when I use linear regression. What about polynomial?
A^2, B^2 and AB columns are added for a quadratic polynomial case. How to fill AB, if the values of A and B are missing?
By the product of averages? When calculating AB, should I multiply the normalized values or should I normalize the result after?

Best Answer

I want to fill in the gaps with the mean, then normalize values

First, single imputations of missing predictor values are likely to lead to bias. See van Buuren's Flexible Imputation of Missing Data.

Second, there is usually no need to normalize the predictor values in this type of regression.

Third, for derived variables like $A^2$, $B^2$ and $AB$, van Buuren says in section 6.4.1:

The easiest way to deal with the problem is to leave any derived data outside the imputation process.

So your best choice is to do multiple imputation of the missing data on $A$ and $B$ and then just let standard design-matrix calculations produce the polynomial terms from the $A$ and $B$ values in each imputed data set.

Best Answer

Related Solutions

Solved – Should data be normalized before or after imputation of missing data

Related Question