Solved – Density-based spatial clustering of applications with noise (DBSCAN) clustering in R

clusteringrspatial

this question started as "Clustering spatial data in R" and now has moved to DBSCAN question.

As the responses to the first question suggested I searched information about DBSCAN and read some docs about. New questions have arisen.

DBSCAN requires some parameters, one of them is "distance". As my data are three dimensional, longitude, latitude and temperature, which "distance" should I use? which dimension is related to that distance? I suposse it should be temperature. How do I find such minimum distance with R?

Another parameter is the minimum number of points neded to form a cluster. Is there any method to find that number? Unfortunately I haven't found.

Searching thorugh Google I could not find an R example for using dbscan in a dataset similar to mine, do you know any website with such kind of examples? So I can read and try to adapt to my case.

The last question is that my first R attempt with DBSCAN (without a proper answer to the prior questions) resulted in a memory problem. R says it can not allocate vector. I start with a 4 km spaced grid with 779191 points that ends in approximately 300000 rows x 3 columns (latitude, longitude and temperature) when removing not valid SST points. Any hint to address this memory problem. Does it depend on my computer or in DBSCAN itself?

Thanks for the patience to read a long and probably boring message and for your help.

Best Answer

I'm still stuck with this problem. I have received some suggestions from the R mailing list (thanks to Christian Hennig) that I attach here:

Have you considered the dbscan function in library fpc, or was it another one? The fpc::dbscan() function doesn't have a "distance" parameter but several options, one of which may resolve your memory problem (look up the documentation of the "memory" parameter).

Using a distance matrix for hundreds of thousands of points is a recipe for disaster (memory-wise). I'm not sure whether the function that you used did that, but fpc::dbscan() can avoid it.

It is true that fpc::dbscan() requires tuning constants that the user has to provide. There is unfortunately no general rule how to do this; it would be necessary to understand the method and the meaning of the constants, and how this translates into the requirements of your application.

You may try several different choices and do some cluster validation to see what works, but I can't explain this in general terms easily via email.

I have made some attempts with my data but without any success:

"Yes, I have tried dbscan from fpc but I'm still stuck on the memory problem. Regarding your answer, I'm not sure which memory parameter should I look at. Following is the code I tried with dbscan parameters, maybe you can see if there is any mistake.

> sstdat=read.csv("sst.dat",sep=";",header=F,col.names=c("lon","lat","sst"))
> library(fpc)
> sst1=subset(sstdat, sst<50)
> sst2=subset(sst1, lon>-6)
> sst2=subset(sst2, lon<40)
> sst2=subset(sst2, lat<46)
> dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c("hybrid"), 
         seeds = FALSE, showplot = FALSE, countmode = NULL)
Error: no se puede ubicar un vector de tamaño  858.2 Mb
> head(sst2)
             lon   lat   sst
1257 35.18 24.98 26.78
1258 35.22 24.98 26.78
1259 35.27 24.98 26.78
1260 35.31 24.98 26.78
1261 35.35 24.98 26.78
1262 35.40 24.98 26.85

In this example I only apply dbscan() to temperature values, not lon/lat, so eps parameter is 0.1. As it is a gridded data set any point is surrounded by eight data points, then I thought that at least 5 of the surrounding points should be within the reachability distance. But I'm not sure I'm getting the right approach by only considering temperature value, maybe then I'm missing spatial information. How should I deal with longitude and latitude data?

Dimensions of sst2 are: 152243 rows x 3 columns "

I share this mail messages here in case any of you can share some light on R and DBSCAN. Thanks again

Related Question