Solved – Clustering with 3 attributes

clusteringdata miningk-meansspatial

Please bear with me because I am very new to data mining.

I have a database of 3 attributes: latitude, longitude and temperature. I want to find clusters for the temperature data and I also want to include the effect of latitude and longitude in that so that temperature is not the only determining factor for the clusters.

If I, let's say, build K-means clusters using all these attributes (in WEKA for example), what do the resulting clusters tell me? Can I get any interpretation of how latitude and longitude information is related to clusters of temperature? What is the correct way to go here?

Best Answer

K-means attempts to group observations by spatial proximity. If you were to specify 2 clusters (k=2), for example, you might find that there were two groups of clusters that were (hopefully) spaced far apart. In that case, you might find then that values of low latitude and low longitude might be clustered in the same group as values with low temperature. Conversely, the other cluster may show high latitude and high longitude observations tend to fall in the same space as high temperature. In this example, you might infer that it found some measure of association between the attributes based upon proximity features. Note, that a lot of the analysis is greatly augmented by visualizing the results (which is easy to do in 3 dimensions). Because the clusters may not be well separated, even though you force 2 or more clusters. There's also no guarantee (at least that I know of) that you will assign the correct number of clusters (for your problem) to begin with; another reason to look at the results.

If you were to only look at temperature clusters, you might find that there was a tendency to measure distinctly different groups of temperatures that were not randomly distributed .. but again, much of that meaning could also be investigated just be looking at the data itself, or using other statistical measures (Fisher Linear Discriminant, for example).

I generated a simulation (via R) to show an illustration of the above example. I started by generating synthetic data with low values for lat,long,temp and high values for the same set. Then I just concatenated the low and high values together and made a dataframe object (matrix) of the 3 attributes. K-means with k=2 was able to find very good separation between the groups without prior knowledge of their associations (as can be seen in the summary, where it grouped all of the 1st half in one set and 2nd half in the other, as we would expect). Notice the summary results also show good separation between groups it found (99.4%).

temperature-kmeans

Related Question