Solved – Assign new data to a cluster (using Gower distance and PAM algorithm)

clusteringk medoidsk-meansr

I have a dataset which has mixed data types and hence I used Gower dissimilarity matrix as input to cluster the data using Partitioning Around Medoids (PAM) algorithm.

I wanted to know if there is any way by which I can assign new data points using the existing PAM model. Since I have used Gower distance, I am not sure of how to go about it as I understand that calculating Gower dissimilarity for 1 new data point is not possible.

Is there any way in R like we have fit_transform in Python – for Gower distances? And is it even possible to save PAM/K-means models in R and use the same to "predict"/assign clusters to new data?

Best Answer

In theory if you know the medoids from the train clustering, you just need to calculate the distances to these medoids again in your test data, and assign it to the closest. So below I use the iris example:

library(cluster)
set.seed(111)
idx = sample(nrow(iris),100)
trn = iris[idx,]
test = iris[-idx,]

mdl = pam(daisy(iris[idx,],metric="gower"),3)

we get out the medoids like this:

trn[mdl$id.med,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
40           5.1         3.4          1.5         0.2     setosa
100          5.7         2.8          4.1         1.3 versicolor
113          6.8         3.0          5.5         2.1  virginica

So below I write a function to take these 3 medoid rows out of the train data, calculate a distance matrix with the test data, and extract for each test data, the closest medoid:

predict_pam = function(model,traindata,newdata){

nclus = length(model$id.med)
DM = daisy(rbind(traindata[model$id.med,],newdata),metric="gower")
max.col(-as.matrix(DM)[-c(1:nclus),1:nclus])

}

You can see it works pretty well:

predict_pam(mdl,trn,test)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[39] 3 3 3 3 3 3 3 3 3 3 3 3

We can visualize this:

library(MASS)
library(ggplot2)

df = data.frame(cmdscale(daisy(rbind(trn,test),metric="gower")),
rep(c("train","test"),c(nrow(trn),nrow(test))))
colnames(df) = c("Dim1","Dim2","data")

df$cluster = c(mdl$clustering,predict_pam(mdl,trn,test))
df$cluster = factor(df$cluster)

ggplot(df,aes(x=Dim1,y=Dim2,col=cluster,shape=data)) + 
geom_point() + facet_wrap(~data)

enter image description here

Related Question