Solved – K-Nearest Neighbor imputation explanation

data-imputationk nearest neighbourr

I have a dataframe with some missing data in it. I need to deal with those missing data before trying anything.

I've seen that knnImputation in R is a good choice but I would like to understand what it really does before. (I'm just a student who's trying to deal with data science)

I think I've understand the knn classifier but I don't find any good doc about knn imputation.

Do you know some ? Or maybe if someone could explain a little ?
Also, the choice of k in knn imputation is the same with the knn classifier ?
Thanks!

Best Answer

The $k$ nearest neighbors algorithm can be used for imputing missing data by finding the $k$ closest neighbors to the observation with missing data and then imputing them based on the the non-missing values in the neighbors. There are several possible approaches to this. You can use 1NN schema, where you find the most similar neighbor and then use its value as a missing data replacement. Alternatively you can use kNN, with $k$ neighbors and take mean of the neighbors, or weighted mean, where the distances to neighbors are used as weights, so the closer neighbor is, the more weight it has when taking the mean. Using weighted mean seems to be used most commonly.

See also this brief article by Yohan Obadia on Medium and the Nearest neighbor imputation algorithms: a critical evaluation paper by Beretta and Santaniello for more detailed discussion.

Related Question