[GIS] How to cluster all points in a postgis table together based on their distance from each other

clusteringpostgispostgresqlsql

I have a PostGIS (in postgres 9.3) table which has Points. I want to figure out the "clusters" in the data. If 2 points are < X distance apart, they should be added to a cluster, and this should be done recursively. In essense, a cluster is a subset of the points, where each point is within X distance of at least one other point in the cluster.

However I cannot think of any way to do this with a PostgreSQL query. Is there any single query to do this, or do I need to write a programme that will externally query the data, and build up the clusters, and continue to query for who knows how many steps?

Best Answer

If you can updgrade to postgis 2.2.0 you might be lucky because that feature has just been introduced. From the doc: "ST_ClusterWithin is an aggregate function that returns an array of GeometryCollections, where each GeometryCollection represents a set of geometries separated by no more than the specified distance."

Check this: http://postgis.net/docs/manual-2.2/ST_ClusterWithin.html

1) k-means with `kmeans-postgresql`

Installation: You need to compile and install this from source code, which is easier to do on *NIX than Windows (I don't know where to start). If you have PostgreSQL installed from packages, make sure you also have the development packages (e.g., postgresql-devel for CentOS).

Download, extract, build and install:

wget http://api.pgxn.org/dist/kmeans/1.1.0/kmeans-1.1.0.zip
unzip kmeans-1.1.0.zip
cd kmeans-1.1.0/
make USE_PGXS=1
sudo make install

Enable the extension in a database (using psql, pgAdmin, etc.):

CREATE EXTENSION kmeans;

Usage/Example: You should have a table of points somewhere (I drew a bunch of pseudo random points in QGIS). Here is an example with what I did:

SELECT kmeans, count(*), ST_Centroid(ST_Collect(geom)) AS geom
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

the 5 I provided in the second argument of the kmeans window function is the K integer to produce five clusters. You can change this to whatever integer you want.

Below is the 31 pseudo random points I drew and the five centroids with the label showing the count in each cluster. This was created using the above SQL query.

Kmeans

You can also attempt to illustrate where these clusters are with ST_MinimumBoundingCircle:

SELECT kmeans, ST_MinimumBoundingCircle(ST_Collect(geom)) AS circle
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Kmeans2

2) Clustering within a threshold distance with `ST_ClusterWithin`

This aggregate function is included with PostGIS 2.2, and returns an array of GeometryCollections where all the components are within a distance of each other.

Here is an example use, where a distance of 100.0 is the threshold that results in 5 different clusters:

SELECT row_number() over () AS id,
  ST_NumGeometries(gc),
  gc AS geom_collection,
  ST_Centroid(gc) AS centroid,
  ST_MinimumBoundingCircle(gc) AS circle,
  sqrt(ST_Area(ST_MinimumBoundingCircle(gc)) / pi()) AS radius
FROM (
  SELECT unnest(ST_ClusterWithin(geom, 100)) gc
  FROM rand_point
) f;

The largest middle cluster has a enclosing circle radius of 65.3 units or about 130, which is larger than the threshold. This is because the individual distances between the member geometries is less than the threshold, so it ties it together as one larger cluster.

PostGIS Clustering – Get a Single Cluster from Cloud of Points with Specified Maximum Diameter

I've written a bottom-up hierarchical clustering algorithm, it has extra parameters that might not be useful to other users, but those should be easy to remove in implementation. First, create a new type to have point ids and geometries.

CREATE TYPE pt AS (
gid character varying(32),
the_geom geometry(Point))

and a type with cluster id

CREATE TYPE clustered_pt AS (
collection_id character varying(16),
gid character varying(32),
the_geom geometry(Point)
cluster_id int)

Next the algorithm function

CREATE OR REPLACE FUNCTION buc(collection_id character varying, points pt[], radius integer)
RETURNS SETOF clustered_pt AS
$BODY$

DECLARE
srid int;
joined_clusters int[];

BEGIN

--If there's only 1 point, don't bother with the loop.
IF array_length(points,1)<2 THEN
    RETURN QUERY SELECT collection_id::character varying(16), gid, the_geom, 1 FROM unnest(points);
    RETURN;
END IF;

CREATE TEMPORARY TABLE IF NOT EXISTS points2 (LIKE pt) ON COMMIT DROP;

BEGIN
    ALTER TABLE points2 ADD COLUMN cluster_id serial;
EXCEPTION
    WHEN duplicate_column THEN --do nothing. Exception comes up when using this function multiple times
END;

TRUNCATE points2;
    --inserting points in
INSERT INTO points2(gid, the_geom)
    (SELECT (unnest(points)).* ); 

--Store the srid to reconvert points after, assumes all points have the same SRID
srid := ST_SRID(the_geom) FROM points2 LIMIT 1;

UPDATE points2 --transforming points to a UTM coordinate system so distances will be calculated in meters.
SET the_geom =  ST_TRANSFORM(the_geom,26986);

LOOP
    --If the smallest maximum distance between two clusters is greater than 2x the desired cluster radius, then there are no more clusters to be formed
    IF (SELECT ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom))  FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id 
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) LIMIT 1)
        > 2 * radius
    THEN
        EXIT;
    END IF;

    joined_clusters := ARRAY[a.cluster_id,b.cluster_id]
        FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) 
        LIMIT 1;

    UPDATE points2
    SET cluster_id = joined_clusters[1]
    WHERE cluster_id = joined_clusters[2];

    --If there's only 1 cluster left, exit loop
    IF (SELECT COUNT(DISTINCT cluster_id) FROM points2) < 2 THEN
        EXIT;

    END IF;

END LOOP;

RETURN QUERY SELECT collection_id::character varying(16), gid, ST_TRANSFORM(the_geom, srid)::geometry(point), cluster_id FROM points2;
END;
$BODY$
LANGUAGE plpgsql

Because of the function syntax in psql, implementation goes as

WITH subq AS(
    SELECT collection_id, ARRAY_AGG((gid, the_geom)::pt) AS points
    FROM data
    GROUP BY collection_id)
SELECT (clusters).* FROM 
    (SELECT buc(collection_id, points, radius) AS clusters FROM subq
) y;

If one were clustering one gigantic point cloud rather than many little clouds, it would be a good idea to add a GiST index on the temporary table. This takes 30 min to process 1.8M collections with 3M total points (i7-3930K @3.2GHz w/ 64GB ram). Any other suggestions for optimization welcome!

Best Answer

Related Solutions

PostGIS Spatial Clustering – Techniques for Spatial Clustering with PostGIS

1) k-means with kmeans-postgresql

2) Clustering within a threshold distance with ST_ClusterWithin

PostGIS Clustering – Get a Single Cluster from Cloud of Points with Specified Maximum Diameter

Related Question

1) k-means with `kmeans-postgresql`

2) Clustering within a threshold distance with `ST_ClusterWithin`