Solved – K-nearest neighbour imputation of missing values

data-imputationk nearest neighbourmachine learningMATLABmissing data

I have a dataset where the columns correspond to features and the rows correspond to data points. I have around 5'000 data points and 8 features. Now, I would like to impute the missing values with the nearest neighbour method. For this I'm using the Matlab function knnimpute.

Let's say feature 4 of row 10 has a missing value. Should I search the nearest data points (rows) or the nearest columns? I tend to search the nearest data points because I want the the feature value of a closest data point. I think in this case I have to call knnimpute(data'), i.e. transposed.

Of course there is the possibility that a whole row has only missing values (or more than 50% missing values). I think Matlab does no imputation if a whole row has only missing values.

Is there a rule what to do if a whole row has only missing values? And what should I do if there are e.g. more than 50% missing values in a row?

Best Answer

  1. Should I search the nearest data points (rows) or the nearest columns?
    Yes indeed. You should search for the nearest point (i.e. row) and impute the missing value in feature j using the jth feature from the nearest neighbours. I don't know why in knnimpute() Matlab works by columns, in that case is indeed correct to transpose the dataset.
  2. Is there a rule what to do if a whole row has only missing values? And what should I do if there are e.g. more than 50% missing values in a row?
    Well if a whole row (or better, column) has only missing value the knnimpute() will must certainly fail. You can fill such column with a given value, let's say 0, so it doesn't affect the dissimilarity measure. If a given row (column) has instead a lot of missing values and you don't want (or you can't) use knnimpute() you can implement your very own imputation technique. A standard technique is the mean of the column itself (counting only non-missing values, of course and you can easily do it in Matlab thanks to the nanmean() function). On StackOverflow I posted an answer, which you can find here, regarding several missing data imputation techniques. Maybe you can read this and choose for a nice technique in such drastic scenario.
Related Question