Solved – Scaling on Categorical Variables for KNN Imputation

categorical datadata-imputationk nearest neighbourmultidimensional scalingr

Problem:
I am attempting to impute on a data set in R (6000+ rows, 55 columns) with high NA proportions in most variables (from 10 – 80% missing) and have found evidence to support the KNN approach to the high NA problem. My data are survey responses with factors that have differing levels (Yes/No, Never-Always, 1-10, etc). I am new to imputing on such a data set.

My Current Assumptions:
Just to preface, my assumptions may not be correct since I am new to KNN imputation but from what I understand:

  1. KNN algorithm can predict categorical outcome variables (mine is binomial)

  2. KNN algorithm can use categorical predictor variables (mine are varied in levels)

  3. KNN imputation can only be done effectively if data is on the same scale. (Ex – if one 'satisfaction rating' variable has range of 1 – 10 but 'likelihood to recommend' has levels 1 – 5 then 'satisfaction rating' would have a greater effect on the Euclidian distance, making the nearest neighbors falsely selected)

Questions:

  1. Can categorical predictor variables of differing number of levels be used in the KNN algorithm to impute missing data?

  2. If yes to the 1st question, how do you scale them?

If these questions are unanswerable, does anyone know of an imputation practice that could fit my situation? (do not suggest mode imputation)

Best Answer

I have asked a somewhat similar question in which I require advice on scaling categorical data with just two levels: 1 and 0. To me, this type of encoding appears to be already standardised as the range is 1 when coercing to numeric atomic vector type. Your categorical data containing differing numbers of levels can be used in kNN but you would need to create dummy variables where the range of each is 1. Each level will become a variable itself and a value of 1 or 0 is all each observation can take.

For the imputation question, personally I would be cautious about imputing subjective responses. They are likely to be highly uncertain. You could try to predict the missing values using the other attributes present. A model within a model if you will.

Related Question