Solved – Weighted clustering on large spatial dataset

clusteringrspatial

I have a very large (36k items) spatial dataset of locations of commercial landuses with their corresponding square footages. I am hoping to use the pam() command in R (from {cluster} package) to form clusters around a set of centers determined by other methods.

I am trying to figure out how to weight the individual points such that large square footages have more attraction to other point than small square footages. My initial thought was to duplicate each point once per 1000 square feet, such that a
100,000 square foot point would be duplicated 100 times. However, I've read elsewhere that the clustering algorithms are computationally intense – the package documentation suggests using clara() for large datasets, but this method won't allow me to specify the medoids beforehand.

Is there another method for weighted clustering? Am I perhaps going though this all wrong?

Best Answer

If using pam, you will need to define your own dissimilarity matrix. A simple example would be to let $\textrm{distCOMBO}(i,j) = \alpha * \textrm{distEuc}(i,j) + (1-\alpha)*\textrm{distSize}(i,j)*g(\textrm{distEuc}(i,j))$, where distEuc is the Euclidean distance and distSize is something you determine, based on the attraction associated with size of the locations (square footage). Finally $g()$ is used to address scaling of the size contribution relative to the distance (a la gravity and relationship between the size of two masses and their distance). You can define $g$ as you see fit. Perhaps it is an inverse of the Euclidean distance, perhaps some other function, as you see fit (e.g. 0 if two locations are in the same region, 1 otherwise).

While there are potentially better similarity functions, this first draft is appealing because it is tunable (i.e. let $\alpha$ range over $[0,1]$).

Related Solutions

Solved – Clustering spatial data in R

There is different approach for scalable clustering, divide and conquer approach, parallel clustering and incremental one. This is for general approach after you can use normal clustering methods. There a good method of clustering i really appreciate is DBSCAN (Density-Based Spatial Clustering of Applications with Noise) it is one of the most used clustering algorithm.

Solved – Density-based spatial clustering of applications with noise (DBSCAN) clustering in R

I'm still stuck with this problem. I have received some suggestions from the R mailing list (thanks to Christian Hennig) that I attach here:

Have you considered the dbscan function in library fpc, or was it another one? The fpc::dbscan() function doesn't have a "distance" parameter but several options, one of which may resolve your memory problem (look up the documentation of the "memory" parameter).

Using a distance matrix for hundreds of thousands of points is a recipe for disaster (memory-wise). I'm not sure whether the function that you used did that, but fpc::dbscan() can avoid it.

It is true that fpc::dbscan() requires tuning constants that the user has to provide. There is unfortunately no general rule how to do this; it would be necessary to understand the method and the meaning of the constants, and how this translates into the requirements of your application.

You may try several different choices and do some cluster validation to see what works, but I can't explain this in general terms easily via email.

I have made some attempts with my data but without any success:

"Yes, I have tried dbscan from fpc but I'm still stuck on the memory problem. Regarding your answer, I'm not sure which memory parameter should I look at. Following is the code I tried with dbscan parameters, maybe you can see if there is any mistake.

> sstdat=read.csv("sst.dat",sep=";",header=F,col.names=c("lon","lat","sst"))
> library(fpc)
> sst1=subset(sstdat, sst<50)
> sst2=subset(sst1, lon>-6)
> sst2=subset(sst2, lon<40)
> sst2=subset(sst2, lat<46)
> dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c("hybrid"), 
         seeds = FALSE, showplot = FALSE, countmode = NULL)
Error: no se puede ubicar un vector de tamaño  858.2 Mb
> head(sst2)
             lon   lat   sst
1257 35.18 24.98 26.78
1258 35.22 24.98 26.78
1259 35.27 24.98 26.78
1260 35.31 24.98 26.78
1261 35.35 24.98 26.78
1262 35.40 24.98 26.85

In this example I only apply dbscan() to temperature values, not lon/lat, so eps parameter is 0.1. As it is a gridded data set any point is surrounded by eight data points, then I thought that at least 5 of the surrounding points should be within the reachability distance. But I'm not sure I'm getting the right approach by only considering temperature value, maybe then I'm missing spatial information. How should I deal with longitude and latitude data?

Dimensions of sst2 are: 152243 rows x 3 columns "

I share this mail messages here in case any of you can share some light on R and DBSCAN. Thanks again

Best Answer

Related Solutions

Solved – Clustering spatial data in R

Solved – Density-based spatial clustering of applications with noise (DBSCAN) clustering in R

Related Question