Solved – Clustering with 3 attributes

clusteringdata miningk-meansspatial

Please bear with me because I am very new to data mining.

I have a database of 3 attributes: latitude, longitude and temperature. I want to find clusters for the temperature data and I also want to include the effect of latitude and longitude in that so that temperature is not the only determining factor for the clusters.

If I, let's say, build K-means clusters using all these attributes (in WEKA for example), what do the resulting clusters tell me? Can I get any interpretation of how latitude and longitude information is related to clusters of temperature? What is the correct way to go here?

Best Answer

K-means attempts to group observations by spatial proximity. If you were to specify 2 clusters (k=2), for example, you might find that there were two groups of clusters that were (hopefully) spaced far apart. In that case, you might find then that values of low latitude and low longitude might be clustered in the same group as values with low temperature. Conversely, the other cluster may show high latitude and high longitude observations tend to fall in the same space as high temperature. In this example, you might infer that it found some measure of association between the attributes based upon proximity features. Note, that a lot of the analysis is greatly augmented by visualizing the results (which is easy to do in 3 dimensions). Because the clusters may not be well separated, even though you force 2 or more clusters. There's also no guarantee (at least that I know of) that you will assign the correct number of clusters (for your problem) to begin with; another reason to look at the results.

If you were to only look at temperature clusters, you might find that there was a tendency to measure distinctly different groups of temperatures that were not randomly distributed .. but again, much of that meaning could also be investigated just be looking at the data itself, or using other statistical measures (Fisher Linear Discriminant, for example).

I generated a simulation (via R) to show an illustration of the above example. I started by generating synthetic data with low values for lat,long,temp and high values for the same set. Then I just concatenated the low and high values together and made a dataframe object (matrix) of the 3 attributes. K-means with k=2 was able to find very good separation between the groups without prior knowledge of their associations (as can be seen in the summary, where it grouped all of the 1st half in one set and 2nd half in the other, as we would expect). Notice the summary results also show good separation between groups it found (99.4%).

temperature-kmeans

Related Solutions

Solved – Density-based spatial clustering of applications with noise (DBSCAN) clustering in R

I'm still stuck with this problem. I have received some suggestions from the R mailing list (thanks to Christian Hennig) that I attach here:

Have you considered the dbscan function in library fpc, or was it another one? The fpc::dbscan() function doesn't have a "distance" parameter but several options, one of which may resolve your memory problem (look up the documentation of the "memory" parameter).

Using a distance matrix for hundreds of thousands of points is a recipe for disaster (memory-wise). I'm not sure whether the function that you used did that, but fpc::dbscan() can avoid it.

It is true that fpc::dbscan() requires tuning constants that the user has to provide. There is unfortunately no general rule how to do this; it would be necessary to understand the method and the meaning of the constants, and how this translates into the requirements of your application.

You may try several different choices and do some cluster validation to see what works, but I can't explain this in general terms easily via email.

I have made some attempts with my data but without any success:

"Yes, I have tried dbscan from fpc but I'm still stuck on the memory problem. Regarding your answer, I'm not sure which memory parameter should I look at. Following is the code I tried with dbscan parameters, maybe you can see if there is any mistake.

> sstdat=read.csv("sst.dat",sep=";",header=F,col.names=c("lon","lat","sst"))
> library(fpc)
> sst1=subset(sstdat, sst<50)
> sst2=subset(sst1, lon>-6)
> sst2=subset(sst2, lon<40)
> sst2=subset(sst2, lat<46)
> dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c("hybrid"), 
         seeds = FALSE, showplot = FALSE, countmode = NULL)
Error: no se puede ubicar un vector de tamaño  858.2 Mb
> head(sst2)
             lon   lat   sst
1257 35.18 24.98 26.78
1258 35.22 24.98 26.78
1259 35.27 24.98 26.78
1260 35.31 24.98 26.78
1261 35.35 24.98 26.78
1262 35.40 24.98 26.85

In this example I only apply dbscan() to temperature values, not lon/lat, so eps parameter is 0.1. As it is a gridded data set any point is surrounded by eight data points, then I thought that at least 5 of the surrounding points should be within the reachability distance. But I'm not sure I'm getting the right approach by only considering temperature value, maybe then I'm missing spatial information. How should I deal with longitude and latitude data?

Dimensions of sst2 are: 152243 rows x 3 columns "

I share this mail messages here in case any of you can share some light on R and DBSCAN. Thanks again

Solved – Clustering crime data which has {latitude, longitude, crime-type} tuples

I know of people who spatially cluster individual crime types: see the CrimeStat documentation for a number of applied examples. I don't see much utility in trying to separate different clusters based on the crime type though. Many places are crime generalists, such as a busy commercial area which will have many assaults, robberies, and thefts. These overlapping hot spots would be difficult to separate in any supervised clustering technique.

About the only crime type I might expect this is feasible is residential burglary; those hotspots tend to differ from areas of elevated crime due to more people walking around and interacting.

I can see some utility in such a project though. A hotspot that has many different crime types and a hotspot that only has one crime type may require different strategies by the police department to address the crime problems. That might call for unsupervised classification though.

Best Answer

Related Solutions

Solved – Density-based spatial clustering of applications with noise (DBSCAN) clustering in R

Solved – Clustering crime data which has {latitude, longitude, crime-type} tuples

Related Question