Solved – How Gower’s dissimilarity handle missing values in numeric columns

clusteringdistancegower-similaritymissing datar

I would like to ask a question about Gower dissimilarity, I was wondering how Gower measure handle missing values in numeric columns, especially that Gower standardized each column based on the range of the same attribute ?

I have read both details of functions daisy and gower.dist in R and their original source (chapter 1 of Kaufman and Rousseeuw (1990)) but I got confuse.
http://www.inside-r.org/packages/cran/StatMatch/docs/gower.dist
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html

I tried also to look at similar discussions in this website
Gower distance and MDS: How to determine which variables count?
but I did not find an answer.

also imputing the data with a dummy/mean values are not an option for me, I need my data as it is. my data is students' exams marks.

Best Answer

It's your choice. There is no "correct" way.

The most "correct" way would be the work with two similarities. An upper bound and a lower bound.

Consider this toy example:

dist(  [A, B],  [C,?] )

if the missing value is D then you get a similarity of 0, that is your worst case. But if the missing value is B, and say you don't have any other records with a B and no A either, then it even could be the most similar object.

But then you would need algorithms that can handle this well, and I don't know of any.

A popular approach is missing value imputation. By replacing missing values (at least temporarily) with your best estimate, you are often closest to the real result.

Another popular approach is to ignore records with missing data.

Related Question