[GIS] Clustering geographic points

clusteringpostgisshapely

I have a bunch of thousands of points.
I'd like to group them, in a user-defined number of clusters for example or based on some kind of "best number of clusters", based for example on their relatives distances.
Which kind of algorithm is available on postGIS and/or shapely (preferred) to do that?
I can simply make some buffers around them and a union but I wonder if there is some other advanced technique that are computationally efficient on large dataset.

Edit:

I've found the DBSCAN algorithm [1] interesting and it seems rather simple to insert in an existing code. There is also the OPTICS algorithm [2] but it seems harder to find a solution.

[1]https://en.wikipedia.org/wiki/DBSCAN
[2]https://en.wikipedia.org/wiki/OPTICS_algorithm

Anyway, I'm stuck with DBSCAN now.

How to store the cluster label in which a point lays within a new column cluster in a PostgreSQL table? Actually, the points ID are not stored along their coordinates within the cluster…
Some checking operations on coordinates in order to retrieve points IDs in original array would be a bit overrated, in my opinion.

Best Answer

I assume from your question that your points have some kind of natural clustering. You can use the ST_ClusterDBSCAN function to try to assign cluster id's to every point. Since this is a window function, you will have to group the points together youself later on.

Here is an example use case: Say you have clusters of trees where some of them would be seen as a forest and some of them are just small patches of trees, depending on how many trees are together and close they are to eachother. There should be at least 100 trees to call it a forest and they should be no more than 10 meters apart. Your cluster function would look like: ST_ClusterDBSCAN(geom, eps := 10, minpoints := 100)

The result would be a cluster id and one step later you can aggregate these points into whatever you like (e.g.

WITH clusters AS (
  SELECT ST_ClusterDBSCAN(geom, eps := 10, minpoints := 100) as clusterid, geom 
  FROM mybigtable
)
SELECT clusterid, ST_Collect(geom) as cluster_geometry FROM clusters GROUP BY clusterid

Now it is up to you to play with the parameters to find the right clusters. Keep in mind that this algorithm assumes a normal distribution for every clusters, so it wouldn't work if every cluster has different eps and minpoints parameters.

Related Solutions

PostGIS Clustering – Get a Single Cluster from Cloud of Points with Specified Maximum Diameter

I've written a bottom-up hierarchical clustering algorithm, it has extra parameters that might not be useful to other users, but those should be easy to remove in implementation. First, create a new type to have point ids and geometries.

CREATE TYPE pt AS (
gid character varying(32),
the_geom geometry(Point))

and a type with cluster id

CREATE TYPE clustered_pt AS (
collection_id character varying(16),
gid character varying(32),
the_geom geometry(Point)
cluster_id int)

Next the algorithm function

CREATE OR REPLACE FUNCTION buc(collection_id character varying, points pt[], radius integer)
RETURNS SETOF clustered_pt AS
$BODY$

DECLARE
srid int;
joined_clusters int[];

BEGIN

--If there's only 1 point, don't bother with the loop.
IF array_length(points,1)<2 THEN
    RETURN QUERY SELECT collection_id::character varying(16), gid, the_geom, 1 FROM unnest(points);
    RETURN;
END IF;

CREATE TEMPORARY TABLE IF NOT EXISTS points2 (LIKE pt) ON COMMIT DROP;

BEGIN
    ALTER TABLE points2 ADD COLUMN cluster_id serial;
EXCEPTION
    WHEN duplicate_column THEN --do nothing. Exception comes up when using this function multiple times
END;

TRUNCATE points2;
    --inserting points in
INSERT INTO points2(gid, the_geom)
    (SELECT (unnest(points)).* ); 

--Store the srid to reconvert points after, assumes all points have the same SRID
srid := ST_SRID(the_geom) FROM points2 LIMIT 1;

UPDATE points2 --transforming points to a UTM coordinate system so distances will be calculated in meters.
SET the_geom =  ST_TRANSFORM(the_geom,26986);

LOOP
    --If the smallest maximum distance between two clusters is greater than 2x the desired cluster radius, then there are no more clusters to be formed
    IF (SELECT ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom))  FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id 
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) LIMIT 1)
        > 2 * radius
    THEN
        EXIT;
    END IF;

    joined_clusters := ARRAY[a.cluster_id,b.cluster_id]
        FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) 
        LIMIT 1;

    UPDATE points2
    SET cluster_id = joined_clusters[1]
    WHERE cluster_id = joined_clusters[2];

    --If there's only 1 cluster left, exit loop
    IF (SELECT COUNT(DISTINCT cluster_id) FROM points2) < 2 THEN
        EXIT;

    END IF;

END LOOP;

RETURN QUERY SELECT collection_id::character varying(16), gid, ST_TRANSFORM(the_geom, srid)::geometry(point), cluster_id FROM points2;
END;
$BODY$
LANGUAGE plpgsql

Because of the function syntax in psql, implementation goes as

WITH subq AS(
    SELECT collection_id, ARRAY_AGG((gid, the_geom)::pt) AS points
    FROM data
    GROUP BY collection_id)
SELECT (clusters).* FROM 
    (SELECT buc(collection_id, points, radius) AS clusters FROM subq
) y;

If one were clustering one gigantic point cloud rather than many little clouds, it would be a good idea to add a GiST index on the temporary table. This takes 30 min to process 1.8M collections with 3M total points (i7-3930K @3.2GHz w/ 64GB ram). Any other suggestions for optimization welcome!

Best Answer

Related Solutions

PostGIS Clustering – Get a Single Cluster from Cloud of Points with Specified Maximum Diameter

Related Question