Solved – Weighted clustering on large spatial dataset


I have a very large (36k items) spatial dataset of locations of commercial landuses with their corresponding square footages. I am hoping to use the pam() command in R (from {cluster} package) to form clusters around a set of centers determined by other methods.

I am trying to figure out how to weight the individual points such that large square footages have more attraction to other point than small square footages. My initial thought was to duplicate each point once per 1000 square feet, such that a
100,000 square foot point would be duplicated 100 times. However, I've read elsewhere that the clustering algorithms are computationally intense – the package documentation suggests using clara() for large datasets, but this method won't allow me to specify the medoids beforehand.

Is there another method for weighted clustering? Am I perhaps going though this all wrong?

Best Answer

If using pam, you will need to define your own dissimilarity matrix. A simple example would be to let $\textrm{distCOMBO}(i,j) = \alpha * \textrm{distEuc}(i,j) + (1-\alpha)*\textrm{distSize}(i,j)*g(\textrm{distEuc}(i,j))$, where distEuc is the Euclidean distance and distSize is something you determine, based on the attraction associated with size of the locations (square footage). Finally $g()$ is used to address scaling of the size contribution relative to the distance (a la gravity and relationship between the size of two masses and their distance). You can define $g$ as you see fit. Perhaps it is an inverse of the Euclidean distance, perhaps some other function, as you see fit (e.g. 0 if two locations are in the same region, 1 otherwise).

While there are potentially better similarity functions, this first draft is appealing because it is tunable (i.e. let $\alpha$ range over $[0,1]$).

Related Question