Solved – similarity measures with missing values

binary datadistance-functionsmissing datasimilarities

How to handle missing values when computing similarity (or distances)?
(I have binary feature values and do use the simple matching coefficient, but I feel that the answer to this question may be more general)

I can think of two options:

  • Remove missing values
  • Count missing values as error

But removing null values has the problem that a high score can be achieved with only one/a few values (see example A). And counting missing as error has the problem that real miss matches should be counted higher than missing values (see example B).

Is there a technique that has none of these shortcomings?

Example A

           Instance1 Instance2
  Feature1 missing   missing
  Feature2 1         missing
  Feature3 0         0
  Feature4 missing   1
  Feature5 missing   missing
  Feature6 missing   missing

  Simple-matching-similarity-REMOVE = 1/1 (twice as high as B)
  Simple-matching-similarity-COUNT-AS-ERROR = 1/6

Example B

           Instance1 Instance2
  Feature1 missing   1
  Feature2 1         0
  Feature3 0         0
  Feature4 1         1
  Feature5 1         0
  Feature6 0         missing

  Simple-matching-similarity-REMOVE = 2/4
  Simple-matching-similarity-COUNT-AS-ERROR = 2/6 (twice as high as A)

Best Answer

As far as I know, there isn't any formal theoretical framework that describes how to do this. A heuristic technique that I've seen used in the past for similar types of missing-data problems is to replace missing values (but for purposes of performing the relative distance calculation only, not for the rest of your analysis!) with a sensibly chosen default value. In your case, a sensible default might be to choose the mean or median of each feature value, tabulated over all instances which do have the feature present. Alternatively, if you wanted to penalize instances with missing values a little bit (i.e., causing them to be treated a little more "conservatively", or more likely dissimilar, since you just don't know for sure) you might substitute missing feature values with the mean value plus some margin, say one sigma or something like that. As I said, it's a heuristic technique, so there's no well-defined "correct answer", you'd have to use your own judgment in deciding precisely which values to substitute for the missing instances.

Related Question