I have a dataset where the columns correspond to features and the rows correspond to data points. I have around 5'000 data points and 8 features. Now, I would like to impute the missing values with the nearest neighbour method. For this I'm using the Matlab function knnimpute
.
Let's say feature 4 of row 10 has a missing value. Should I search the nearest data points (rows) or the nearest columns? I tend to search the nearest data points because I want the the feature value of a closest data point. I think in this case I have to call knnimpute(data')
, i.e. transposed.
Of course there is the possibility that a whole row has only missing values (or more than 50% missing values). I think Matlab does no imputation if a whole row has only missing values.
Is there a rule what to do if a whole row has only missing values? And what should I do if there are e.g. more than 50% missing values in a row?
Best Answer
Yes indeed. You should search for the nearest point (i.e. row) and impute the missing value in feature j using the jth feature from the nearest neighbours. I don't know why in
knnimpute()
Matlab works by columns, in that case is indeed correct to transpose the dataset.Well if a whole row (or better, column) has only missing value the
knnimpute()
will must certainly fail. You can fill such column with a given value, let's say 0, so it doesn't affect the dissimilarity measure. If a given row (column) has instead a lot of missing values and you don't want (or you can't) useknnimpute()
you can implement your very own imputation technique. A standard technique is the mean of the column itself (counting only non-missing values, of course and you can easily do it in Matlab thanks to thenanmean()
function). On StackOverflow I posted an answer, which you can find here, regarding several missing data imputation techniques. Maybe you can read this and choose for a nice technique in such drastic scenario.