I'm still stuck with this problem. I have received some suggestions from the R mailing list (thanks to Christian Hennig) that I attach here:
Have you considered the dbscan function in library fpc, or was it
another one? The fpc::dbscan()
function doesn't have a "distance"
parameter but several options, one of which may resolve your memory
problem (look up the documentation of the "memory" parameter).
Using a distance matrix for hundreds of thousands of points is a
recipe for disaster (memory-wise). I'm not sure whether the function
that you used did that, but fpc::dbscan()
can avoid it.
It is true that fpc::dbscan()
requires tuning constants that the
user has to provide. There is unfortunately no general rule how to do
this; it would be necessary to understand the method and the meaning
of the constants, and how this translates into the requirements of
your application.
You may try several different choices and do some cluster validation
to see what works, but I can't explain this in general terms easily
via email.
I have made some attempts with my data but without any success:
"Yes, I have tried dbscan from fpc but I'm still stuck on the memory problem. Regarding your answer, I'm not sure which memory parameter should I look at. Following is the code I tried with dbscan parameters, maybe you can see if there is any mistake.
> sstdat=read.csv("sst.dat",sep=";",header=F,col.names=c("lon","lat","sst"))
> library(fpc)
> sst1=subset(sstdat, sst<50)
> sst2=subset(sst1, lon>-6)
> sst2=subset(sst2, lon<40)
> sst2=subset(sst2, lat<46)
> dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c("hybrid"),
seeds = FALSE, showplot = FALSE, countmode = NULL)
Error: no se puede ubicar un vector de tamaƱo 858.2 Mb
> head(sst2)
lon lat sst
1257 35.18 24.98 26.78
1258 35.22 24.98 26.78
1259 35.27 24.98 26.78
1260 35.31 24.98 26.78
1261 35.35 24.98 26.78
1262 35.40 24.98 26.85
In this example I only apply dbscan()
to temperature values, not lon/lat, so eps
parameter is 0.1. As it is a gridded data set any point is surrounded by eight data points, then I thought that at least 5 of the surrounding points should be within the reachability distance. But I'm not sure I'm getting the right approach by only considering temperature value, maybe then I'm missing spatial information. How should I deal with longitude and latitude data?
Dimensions of sst2
are: 152243 rows x 3 columns "
I share this mail messages here in case any of you can share some light on R and DBSCAN. Thanks again
I know of people who spatially cluster individual crime types: see the CrimeStat documentation for a number of applied examples. I don't see much utility in trying to separate different clusters based on the crime type though. Many places are crime generalists, such as a busy commercial area which will have many assaults, robberies, and thefts. These overlapping hot spots would be difficult to separate in any supervised clustering technique.
About the only crime type I might expect this is feasible is residential burglary; those hotspots tend to differ from areas of elevated crime due to more people walking around and interacting.
I can see some utility in such a project though. A hotspot that has many different crime types and a hotspot that only has one crime type may require different strategies by the police department to address the crime problems. That might call for unsupervised classification though.
Best Answer
K-means attempts to group observations by spatial proximity. If you were to specify 2 clusters (k=2), for example, you might find that there were two groups of clusters that were (hopefully) spaced far apart. In that case, you might find then that values of low latitude and low longitude might be clustered in the same group as values with low temperature. Conversely, the other cluster may show high latitude and high longitude observations tend to fall in the same space as high temperature. In this example, you might infer that it found some measure of association between the attributes based upon proximity features. Note, that a lot of the analysis is greatly augmented by visualizing the results (which is easy to do in 3 dimensions). Because the clusters may not be well separated, even though you force 2 or more clusters. There's also no guarantee (at least that I know of) that you will assign the correct number of clusters (for your problem) to begin with; another reason to look at the results.
If you were to only look at temperature clusters, you might find that there was a tendency to measure distinctly different groups of temperatures that were not randomly distributed .. but again, much of that meaning could also be investigated just be looking at the data itself, or using other statistical measures (Fisher Linear Discriminant, for example).
I generated a simulation (via R) to show an illustration of the above example. I started by generating synthetic data with low values for lat,long,temp and high values for the same set. Then I just concatenated the low and high values together and made a dataframe object (matrix) of the 3 attributes. K-means with k=2 was able to find very good separation between the groups without prior knowledge of their associations (as can be seen in the summary, where it grouped all of the 1st half in one set and 2nd half in the other, as we would expect). Notice the summary results also show good separation between groups it found (99.4%).