PostGIS – Using GROUP BY for Linear Spatial Clusters

clusteringpostgis

I have a series of points in a horizontal line, arranged into discrete spatial clusters by the underlying data pattern. Here's an example of what they look like:

enter image description here

How can I use PostGIS to group them spatially into the highlighted sets A, B and C?

This is a harder problem than it initially seemed to me. So far I've tried:

Using ST_X() to order them, then lag() to identify points between which there are gaps. Unfortunately this just groups the points into "gap" groups and "interior" groups.
Using convex hulls around arbitrary spatial thresholds. This both failed to separate all clusters and had the side effect of creating mid-cluster separations. (For future reference, this was a dumb idea.)
Using ST_GeoHash() to cluster them. This doesn't accurately catch the linear nature of the geometry.

Best Answer

I'm not at a computer that has access to PostGIS right now, but I feel as though this algorithm might work. Of course if you have vertical groups, you would need to use an exclusion or inclusion clause for ST_Y().

DECLARE @totalUnique int = 0
DECLARE @lastUnique int = 1

CREATE TABLE #TABLEX (ID1 int, ID2 int)
CREATE TABLE #TABLEX2 (ID1 int, ID2 int)

--Get distances of objects
INSERT INTO #TABLEX
SELECT ID1, ID2
FROM   (
         SELECT T1.ID AS ID1, 
                T2.ID AS ID2
         FROM   BaseTable AS T1
                INNER JOIN
                BaseTable AS T2
                    ON  ST_Distance(T1.Shape, T2.Shape) <= SeparationDistance
       ) AS X

--Loop for as long as new connections can be made
WHILE(@lastUnique <> @totalUnique)
BEGIN
  --Count the number of current connections
  SELECT @lastUnique = COUNT(*)
  FROM (
         SELECT * FROM #TABLEX
         GROUP BY ID1, ID2
       ) AS XX 

  --Look for new connections via current known paths
  INSERT INTO #TABLEX (ID1, ID2)
  SELECT A.ID1, B.ID2
  FROM #TABLEX AS A
       INNER JOIN
       #TABLEX AS B
       ON A.ID2 = B.ID1
          AND
          A.ID1 <> B.ID2

  --Count the number of current connections   
  SELECT @totalUnique = COUNT(*)
  FROM (
         SELECT * FROM #TABLEX
         GROUP BY ID1, ID2
       ) AS XX

  --Group each path set by the lowest ID
  INSERT INTO #TABLEX2(ID1, ID2)
  SELECT MIN(ID1) AS theGroup, ID2
  FROM   #TABLEX
  GROUP BY ID2

  TRUNCATE TABLE #TABLEX

  --Reload our new path sets
  INSERT INTO #TABLEX (ID1, ID2)
  SELECT ID1, ID2 FROM #TABLEX2

  TRUNCATE TABLE #TABLEX2
END

--Show final results        
SELECT ID1 AS theGroup, ID2
FROM   #TABLEX

DROP TABLE #TABLEX
DROP TABLE #TABLEX2

1) k-means with `kmeans-postgresql`

Installation: You need to compile and install this from source code, which is easier to do on *NIX than Windows (I don't know where to start). If you have PostgreSQL installed from packages, make sure you also have the development packages (e.g., postgresql-devel for CentOS).

Download, extract, build and install:

wget http://api.pgxn.org/dist/kmeans/1.1.0/kmeans-1.1.0.zip
unzip kmeans-1.1.0.zip
cd kmeans-1.1.0/
make USE_PGXS=1
sudo make install

Enable the extension in a database (using psql, pgAdmin, etc.):

CREATE EXTENSION kmeans;

Usage/Example: You should have a table of points somewhere (I drew a bunch of pseudo random points in QGIS). Here is an example with what I did:

SELECT kmeans, count(*), ST_Centroid(ST_Collect(geom)) AS geom
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

the 5 I provided in the second argument of the kmeans window function is the K integer to produce five clusters. You can change this to whatever integer you want.

Below is the 31 pseudo random points I drew and the five centroids with the label showing the count in each cluster. This was created using the above SQL query.

Kmeans

You can also attempt to illustrate where these clusters are with ST_MinimumBoundingCircle:

SELECT kmeans, ST_MinimumBoundingCircle(ST_Collect(geom)) AS circle
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Kmeans2

2) Clustering within a threshold distance with `ST_ClusterWithin`

This aggregate function is included with PostGIS 2.2, and returns an array of GeometryCollections where all the components are within a distance of each other.

Here is an example use, where a distance of 100.0 is the threshold that results in 5 different clusters:

SELECT row_number() over () AS id,
  ST_NumGeometries(gc),
  gc AS geom_collection,
  ST_Centroid(gc) AS centroid,
  ST_MinimumBoundingCircle(gc) AS circle,
  sqrt(ST_Area(ST_MinimumBoundingCircle(gc)) / pi()) AS radius
FROM (
  SELECT unnest(ST_ClusterWithin(geom, 100)) gc
  FROM rand_point
) f;

The largest middle cluster has a enclosing circle radius of 65.3 units or about 130, which is larger than the threshold. This is because the individual distances between the member geometries is less than the threshold, so it ties it together as one larger cluster.

Best Answer

Related Solutions

QGIS Clustering – Identifying Clusters in Vector Point Data

PostGIS Spatial Clustering – Techniques for Spatial Clustering with PostGIS

1) k-means with kmeans-postgresql

2) Clustering within a threshold distance with ST_ClusterWithin

Related Question

1) k-means with `kmeans-postgresql`

2) Clustering within a threshold distance with `ST_ClusterWithin`