Solved – Gower distance with R functions; “gower.dist” and “daisy”

clusteringdistancegower-similarityr

I have 9 numeric and 5 binary (0-1) variables, with 73 samples in my dataset. I know that the Gower distance is a good metric for datasets with mixed variables.

I tried both daisy(cluster) and gower.dist(StatMatch) functions. We can assign weights in both fuctions; I assigned weights like that; 5 weights for numeric attributes and 1 for binary ones.

But they give different distance matrixes. Shouldn't they give the same results?
These are my features and first sample.

A    B      C   D   E   F   G   H    I       J       K   L       M       N  
800 1200    0   0   0   0   1   2   0.31    0.33    0.1 0.62    0.35    0.44

A; Numeric (Square feet)
B; Numeric (Dollar)
C-D-E-F-G; Binary (Yes-No)
H; Numeric (Number of children)
J-K-L-M-N Numeric (Percent)

Best Answer

They in fact do give the same results. I am not sure how you are comparing them but here is an example:

# Create example data
set.seed(123)
# create nominal variable
nom <- factor(rep(letters[1:3], each=10))
# create numeric variables
vars <- as.matrix(replicate(17, rnorm(30)))
df <- data.frame(nom, vars)

library(cluster)
daisy.mat <- as.matrix(daisy(df, metric="gower"))

library(StatMatch)
gower.mat <- gower.dist(df)

# you can look directly to see the numbers are the same
head(daisy.mat, 3)
head(gower.mat, 3)

# now identical will return FALSE, why?
identical(daisy.mat, gower.mat)
> identical(daisy.mat, gower.mat)
[1] FALSE

# This is because there is of extremely small differences 
# in the numbers returned by the different functions
max(abs(daisy.mat - gower.mat))
> max(abs(daisy.mat - gower.mat))
[1] 5.551115e-17

# Using all.equal has a higher tolerance threshold
all.equal(daisy.mat, gower.mat, check.attributes = F)
> all.equal(daisy.mat, gower.mat, check.attributes = F)
[1] TRUE

Now that I understand you are adding an extra component to the daisy function there is still a solution. It lies in the documentation for gower.dist. The key part is in the first part of the documentation, namely that columns of mode logical will be considered as binary asymmetric variables. So you want to make sure your data structure is appropriate.

set.seed(123)
# create nominal variable
nom <- factor(rep(letters[1:3], each=10))
# create binary variables
bin <- as.matrix(replicate(5, rep(sample(c(0,1), 30, replace=T))))
# create numeric variables
vars <- as.matrix(replicate(9, rnorm(30)))
df <- data.frame(nom, bin, vars)

# You can see that the columns are not 'logical' types
# We need to change this
str(df)
> str(df)
'data.frame':   30 obs. of  15 variables:
     $ nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
     $ X1  : num  0 1 0 1 1 0 1 1 1 0 ...
     $ X2  : num  1 1 1 1 0 0 1 0 0 0 ...
     $ X3  : num  1 0 0 0 1 0 1 1 1 0 ...
     $ X4  : num  0 1 0 1 0 0 1 0 0 1 ...
     $ X5  : num  1 0 0 0 0 1 0 0 0 1 ...
     $ X1.1: num  1.026 -0.285 -1.221 0.181 -0.139 ...
     $ X2.1: num  -0.045 -0.785 -1.668 -0.38 0.919 ...
     $ X3.1: num  1.13 -1.46 0.74 1.91 -1.44 ...
     $ X4.1: num  0.298 0.637 -0.484 0.517 0.369 ...
     $ X5.1: num  1.997 0.601 -1.251 -0.611 -1.185 ...
     $ X6  : num  0.0597 -0.7046 -0.7172 0.8847 -1.0156 ...
     $ X7  : num  -0.0886 1.0808 0.6308 -0.1136 -1.5329 ...
     $ X8  : num  0.134 0.221 1.641 -0.219 0.168 ...
     $ X9  : num  0.704 -0.106 -1.259 1.684 0.911 ...


# make columns logical
df[,2:6] <- sapply(df[,2:6], FUN=function(x) ifelse(x==1, TRUE, FALSE))

# now the columns are the correct types
> str(df)
'data.frame':   30 obs. of  15 variables:
     $ nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
     $ X1  : logi  FALSE TRUE FALSE TRUE TRUE FALSE ...
     $ X2  : logi  TRUE TRUE TRUE TRUE FALSE FALSE ...
     $ X3  : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...
     $ X4  : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
     $ X5  : logi  TRUE FALSE FALSE FALSE FALSE TRUE ...
     $ X1.1: num  1.026 -0.285 -1.221 0.181 -0.139 ...
     $ X2.1: num  -0.045 -0.785 -1.668 -0.38 0.919 ...
     $ X3.1: num  1.13 -1.46 0.74 1.91 -1.44 ...
     $ X4.1: num  0.298 0.637 -0.484 0.517 0.369 ...
     $ X5.1: num  1.997 0.601 -1.251 -0.611 -1.185 ...
     $ X6  : num  0.0597 -0.7046 -0.7172 0.8847 -1.0156 ...
     $ X7  : num  -0.0886 1.0808 0.6308 -0.1136 -1.5329 ...
     $ X8  : num  0.134 0.221 1.641 -0.219 0.168 ...
     $ X9  : num  0.704 -0.106 -1.259 1.684 0.911 ...


# now you can do your calls
daisy.mat <- as.matrix(daisy(df, metric="gower", type=list(asymm=c(2,3,4,5,6))))
gower.mat <- gower.dist(df)

# and you can see that the results are the same
all.equal(as.matrix(daisy.mat), gower.mat, check.attributes = F)
[1] TRUE

Best Answer

Related Solutions

Solved – How does the Gower distance calculate the difference between binary variables’

Solved – K-medoids clustering with Gower distance in R

Related Question