I have 9 numeric and 5 binary (0-1) variables, with 73 samples in my dataset. I know that the Gower distance is a good metric for datasets with mixed variables.
I tried both daisy(cluster) and gower.dist(StatMatch) functions. We can assign weights in both fuctions; I assigned weights like that; 5 weights for numeric attributes and 1 for binary ones.
But they give different distance matrixes. Shouldn't they give the same results?
These are my features and first sample.
A B C D E F G H I J K L M N
800 1200 0 0 0 0 1 2 0.31 0.33 0.1 0.62 0.35 0.44
A; Numeric (Square feet)
B; Numeric (Dollar)
C-D-E-F-G; Binary (Yes-No)
H; Numeric (Number of children)
J-K-L-M-N Numeric (Percent)
Best Answer
They in fact do give the same results. I am not sure how you are comparing them but here is an example:
Now that I understand you are adding an extra component to the
daisy
function there is still a solution. It lies in the documentation forgower.dist
. The key part is in the first part of the documentation, namely that columns of mode logical will be considered as binary asymmetric variables. So you want to make sure your data structure is appropriate.