How to handle missing values when computing similarity (or distances)?
(I have binary feature values and do use the simple matching coefficient, but I feel that the answer to this question may be more general)
I can think of two options:
- Remove missing values
- Count missing values as error
But removing null values has the problem that a high score can be achieved with only one/a few values (see example A). And counting missing as error has the problem that real miss matches should be counted higher than missing values (see example B).
Is there a technique that has none of these shortcomings?
Example A
Instance1 Instance2
Feature1 missing missing
Feature2 1 missing
Feature3 0 0
Feature4 missing 1
Feature5 missing missing
Feature6 missing missing
Simple-matching-similarity-REMOVE = 1/1 (twice as high as B)
Simple-matching-similarity-COUNT-AS-ERROR = 1/6
Example B
Instance1 Instance2
Feature1 missing 1
Feature2 1 0
Feature3 0 0
Feature4 1 1
Feature5 1 0
Feature6 0 missing
Simple-matching-similarity-REMOVE = 2/4
Simple-matching-similarity-COUNT-AS-ERROR = 2/6 (twice as high as A)
Best Answer
As far as I know, there isn't any formal theoretical framework that describes how to do this. A heuristic technique that I've seen used in the past for similar types of missing-data problems is to replace missing values (but for purposes of performing the relative distance calculation only, not for the rest of your analysis!) with a sensibly chosen default value. In your case, a sensible default might be to choose the mean or median of each feature value, tabulated over all instances which do have the feature present. Alternatively, if you wanted to penalize instances with missing values a little bit (i.e., causing them to be treated a little more "conservatively", or more likely dissimilar, since you just don't know for sure) you might substitute missing feature values with the mean value plus some margin, say one sigma or something like that. As I said, it's a heuristic technique, so there's no well-defined "correct answer", you'd have to use your own judgment in deciding precisely which values to substitute for the missing instances.