[GIS] Appropriate distance metric for spatial clustering of geographic coordinates

clusteringdistance

I have a set of locations in geographic coordinates, and I would like to group the points using hierarchical clustering followed by tree-cutting at various "heights" in order to calculate group-wise means of variables at recorded at each location.

Hierarchical clustering of the distance matrix of geographic coordinates, I presume, may be a misleading way to form groups because latitude and longitude are not equally spaced.

I can then imagine two ways forward:

Using the great circle distance for the distance metric.
Converting the geographic coordinates to an equally-scaled projection and then finding the Euclidean distance.

Apart from option two being more complicated to perform, are these approaches equivalent?
And what exactly is the meaning of the tree cutting height in these cases?

Best Answer

Thanks to @whuber for setting me on the right track here. Looks as if there will be no additional answers forthcoming, so will settle this question by posting my own observations that may be useful for others learning about distances, clustering, and projections.

The following R code, using the geosphere, rgdal, and sp packages demonstrates that careful selection of the right projection can give an accurate distance matrix (where accurate is defined as geodesic distance) when points are up to 2000 km apart (axes are in metres).

library(sp)
library(rgdal)
library(geosphere)

## Produce 200 randomly positioned geographic coordinates
## in central Canada
xyLatLon <- data.frame(lon=(runif(200)*-30)-85,
                       lat=(runif(200)*5)+50)

## Convert to a Lambert Conformal Conic projection that should
## reasonably approximate the true distance
newProj <- "+proj=lcc +lat_1=49 +lat_2=77 +lat_0=63.390675
            +lon_0=-91.86666666666666 +x_0=6200000 +y_0=3000000
            +ellps=GRS80 +units=m +no_defs" 
xyLcc <- spTransform(SpatialPoints(xyLatLon, proj4string=CRS("+proj=longlat")), CRS(newProj))


## Find the geodesic distance matrix from geographic coordinates
## assuming the WGS84 ellipsoid
xyDist1 <- distm(xyLatLon, fun=distMeeus)

## Find the Euclidean distance matrix from the projection
xyDist2 <- as.matrix(dist(coordinates(xyLcc)))

## Find the Euclidean distance matrix of the geographic coordinates
xyDist3 <- as.matrix(dist(xyLatLon))

Plots of the elements of these three distance matrices are shown below. The plot on the left indicates that the projection selected is highly correlated with the geodesic distance across the range of distances used here. While the right plot demonstrates the considerable error that would be expected if unprojected geographic coordinates were to be used.

enter image description here

1) k-means with `kmeans-postgresql`

Installation: You need to compile and install this from source code, which is easier to do on *NIX than Windows (I don't know where to start). If you have PostgreSQL installed from packages, make sure you also have the development packages (e.g., postgresql-devel for CentOS).

Download, extract, build and install:

wget http://api.pgxn.org/dist/kmeans/1.1.0/kmeans-1.1.0.zip
unzip kmeans-1.1.0.zip
cd kmeans-1.1.0/
make USE_PGXS=1
sudo make install

Enable the extension in a database (using psql, pgAdmin, etc.):

CREATE EXTENSION kmeans;

Usage/Example: You should have a table of points somewhere (I drew a bunch of pseudo random points in QGIS). Here is an example with what I did:

SELECT kmeans, count(*), ST_Centroid(ST_Collect(geom)) AS geom
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

the 5 I provided in the second argument of the kmeans window function is the K integer to produce five clusters. You can change this to whatever integer you want.

Below is the 31 pseudo random points I drew and the five centroids with the label showing the count in each cluster. This was created using the above SQL query.

Kmeans

You can also attempt to illustrate where these clusters are with ST_MinimumBoundingCircle:

SELECT kmeans, ST_MinimumBoundingCircle(ST_Collect(geom)) AS circle
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Kmeans2

2) Clustering within a threshold distance with `ST_ClusterWithin`

This aggregate function is included with PostGIS 2.2, and returns an array of GeometryCollections where all the components are within a distance of each other.

Here is an example use, where a distance of 100.0 is the threshold that results in 5 different clusters:

SELECT row_number() over () AS id,
  ST_NumGeometries(gc),
  gc AS geom_collection,
  ST_Centroid(gc) AS centroid,
  ST_MinimumBoundingCircle(gc) AS circle,
  sqrt(ST_Area(ST_MinimumBoundingCircle(gc)) / pi()) AS radius
FROM (
  SELECT unnest(ST_ClusterWithin(geom, 100)) gc
  FROM rand_point
) f;

The largest middle cluster has a enclosing circle radius of 65.3 units or about 130, which is larger than the threshold. This is because the individual distances between the member geometries is less than the threshold, so it ties it together as one larger cluster.

[GIS] Fast Clustering Algorithm for Geographic Data

Use Openlayers Cluster strategy which lets you display points representing clusters of features (geopoints) within some pixel distance. visit http://openlayers.org/dev/examples/strategy-cluster.html for complete source code.

or check out http://gmaps-utility-library-dev.googlecode.com/svn/tags/markerclusterer/1.0/examples/simple_example.html to create cluster maps from geo-points.

Best Answer

Related Solutions

PostGIS Spatial Clustering – Techniques for Spatial Clustering with PostGIS

1) k-means with kmeans-postgresql

2) Clustering within a threshold distance with ST_ClusterWithin

[GIS] Fast Clustering Algorithm for Geographic Data

Related Question

1) k-means with `kmeans-postgresql`

2) Clustering within a threshold distance with `ST_ClusterWithin`