Solved – Clustering of points based on vector feature similarities in R

clusteringmachine learningrsimilarities

I have as an input a number of points that I need to partition into clusters. Each point has a number of features that are ideally to be used to find the similarity between each point and the others. Some of these features are scalar values (one number) and others are vectors.

For example, assume that each point has the following features:

  1. S1: scalar value

  2. V1: 48 $\times$ 1 vector

  3. V2: 48 $\times$ 1 vector

For example one point may have (S1,V1, V2) as (100, {0, 100, 20, 30}, {75,0,10, 5})

My hypothesis is to use cosine similarity to find how similar the vector V1 or V2 of one point is to the vector V1 or V2 of another point. I have already computed the similarity matrices between all points in terms of V1 and V2 similarities.

By exploring the standard clustering algorithms in R, I have found that k-means turns to use the Euclidean distance, which might be suitable for clustering points according to their scalar values, because [subject unclear] doesn't work for the situation where I have hybrid types of features (scalars and vectors). Also the K-medoid clustering seems to be supporting only the Euclidean and the Manhattan distances.

I think what should be done is to generate one more distance/similarity matrix between all points based on the scalar value, so that we end with three similarity matrices that show the similarity between each point and the other points according to each feature regardless of it being a scalar or a vector, and use those matrices for finding the neighbourhood of points while clustering.

I wonder if there is an implementation for a clustering algorithm that accepts as an input the similarity matrices (or alternatively the dissimilarity/distance matrices) between vector features of multiple points and uses them for clustering?

Best Answer

Actually, K-medoids (aka: PAM) can work with any kind of distance or similarity metric. So do DBSCAN, OPTICS and Hierarchical clustering. The latter however is usually implemented in $O(n^3)$, so not an option if you have a lot of instances.