I'm still stuck with this problem. I have received some suggestions from the R mailing list (thanks to Christian Hennig) that I attach here:
Have you considered the dbscan function in library fpc, or was it
another one? The fpc::dbscan()
function doesn't have a "distance"
parameter but several options, one of which may resolve your memory
problem (look up the documentation of the "memory" parameter).
Using a distance matrix for hundreds of thousands of points is a
recipe for disaster (memory-wise). I'm not sure whether the function
that you used did that, but fpc::dbscan()
can avoid it.
It is true that fpc::dbscan()
requires tuning constants that the
user has to provide. There is unfortunately no general rule how to do
this; it would be necessary to understand the method and the meaning
of the constants, and how this translates into the requirements of
your application.
You may try several different choices and do some cluster validation
to see what works, but I can't explain this in general terms easily
via email.
I have made some attempts with my data but without any success:
"Yes, I have tried dbscan from fpc but I'm still stuck on the memory problem. Regarding your answer, I'm not sure which memory parameter should I look at. Following is the code I tried with dbscan parameters, maybe you can see if there is any mistake.
> sstdat=read.csv("sst.dat",sep=";",header=F,col.names=c("lon","lat","sst"))
> library(fpc)
> sst1=subset(sstdat, sst<50)
> sst2=subset(sst1, lon>-6)
> sst2=subset(sst2, lon<40)
> sst2=subset(sst2, lat<46)
> dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c("hybrid"),
seeds = FALSE, showplot = FALSE, countmode = NULL)
Error: no se puede ubicar un vector de tamaƱo 858.2 Mb
> head(sst2)
lon lat sst
1257 35.18 24.98 26.78
1258 35.22 24.98 26.78
1259 35.27 24.98 26.78
1260 35.31 24.98 26.78
1261 35.35 24.98 26.78
1262 35.40 24.98 26.85
In this example I only apply dbscan()
to temperature values, not lon/lat, so eps
parameter is 0.1. As it is a gridded data set any point is surrounded by eight data points, then I thought that at least 5 of the surrounding points should be within the reachability distance. But I'm not sure I'm getting the right approach by only considering temperature value, maybe then I'm missing spatial information. How should I deal with longitude and latitude data?
Dimensions of sst2
are: 152243 rows x 3 columns "
I share this mail messages here in case any of you can share some light on R and DBSCAN. Thanks again
Best Answer
If using
pam
, you will need to define your own dissimilarity matrix. A simple example would be to let $\textrm{distCOMBO}(i,j) = \alpha * \textrm{distEuc}(i,j) + (1-\alpha)*\textrm{distSize}(i,j)*g(\textrm{distEuc}(i,j))$, where distEuc is the Euclidean distance and distSize is something you determine, based on the attraction associated with size of the locations (square footage). Finally $g()$ is used to address scaling of the size contribution relative to the distance (a la gravity and relationship between the size of two masses and their distance). You can define $g$ as you see fit. Perhaps it is an inverse of the Euclidean distance, perhaps some other function, as you see fit (e.g. 0 if two locations are in the same region, 1 otherwise).While there are potentially better similarity functions, this first draft is appealing because it is tunable (i.e. let $\alpha$ range over $[0,1]$).