PostGIS – Using ST_ClusterWithin() on Table

postgis

I am using the postgis ST_ClusterWithin function to cluster my spatial data in a table. I want to get the rows of the table as a result of the query clustering query (maybe the Pk's of each element in each cluster) instead of the geometry collection. Returning geometry collection seems to loose the context of the data by just giving me the geometry and nothing else. Now will I have to query the database with a ST_Contains or ST_Within again on the database to form the clusters using the rows in that table. Is there a different way. I am using a query similar to Spatial clustering with PostGIS

Best Answer

In PostGIS 2.2, your options are to either (a) do a join after-the-fact to get back your IDs or other relevant information, or (b) abuse the Z or M ordinates to sneak some additional information into the geometry objects.

The newer clustering options in PostGIS 2.3 are more flexible; they're implemented as windowing functions and can provide a cluster ID for each row in a query. For more info, check out the docs for ST_ClusterKMeans and ST_ClusterDBSCAN.

1) k-means with `kmeans-postgresql`

Installation: You need to compile and install this from source code, which is easier to do on *NIX than Windows (I don't know where to start). If you have PostgreSQL installed from packages, make sure you also have the development packages (e.g., postgresql-devel for CentOS).

Download, extract, build and install:

wget http://api.pgxn.org/dist/kmeans/1.1.0/kmeans-1.1.0.zip
unzip kmeans-1.1.0.zip
cd kmeans-1.1.0/
make USE_PGXS=1
sudo make install

Enable the extension in a database (using psql, pgAdmin, etc.):

CREATE EXTENSION kmeans;

Usage/Example: You should have a table of points somewhere (I drew a bunch of pseudo random points in QGIS). Here is an example with what I did:

SELECT kmeans, count(*), ST_Centroid(ST_Collect(geom)) AS geom
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

the 5 I provided in the second argument of the kmeans window function is the K integer to produce five clusters. You can change this to whatever integer you want.

Below is the 31 pseudo random points I drew and the five centroids with the label showing the count in each cluster. This was created using the above SQL query.

Kmeans

You can also attempt to illustrate where these clusters are with ST_MinimumBoundingCircle:

SELECT kmeans, ST_MinimumBoundingCircle(ST_Collect(geom)) AS circle
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Kmeans2

2) Clustering within a threshold distance with `ST_ClusterWithin`

This aggregate function is included with PostGIS 2.2, and returns an array of GeometryCollections where all the components are within a distance of each other.

Here is an example use, where a distance of 100.0 is the threshold that results in 5 different clusters:

SELECT row_number() over () AS id,
  ST_NumGeometries(gc),
  gc AS geom_collection,
  ST_Centroid(gc) AS centroid,
  ST_MinimumBoundingCircle(gc) AS circle,
  sqrt(ST_Area(ST_MinimumBoundingCircle(gc)) / pi()) AS radius
FROM (
  SELECT unnest(ST_ClusterWithin(geom, 100)) gc
  FROM rand_point
) f;

The largest middle cluster has a enclosing circle radius of 65.3 units or about 130, which is larger than the threshold. This is because the individual distances between the member geometries is less than the threshold, so it ties it together as one larger cluster.

[GIS] PostgreSQL/PostGIS spatial index – no speed up

The most efficient index for the query expressed in your question is the one on gid as it is the only column that appears in a where expression:

 CREATE INDEX table_gid ON table (gid);

You can safely drop the gist index as it will only consume space and slow inserts/updates/deletes down.

Long explanation

As I said the most effective index in your case is the one on gid as it will allow the db engine to retrieve rows faster (with retrieval usually being the slowest part of the process). After that it will probably better compute the result of the

  ST_Contains(a.way, b.way)

espression without looking at the index. The reason is that the query planner will likely estimate that the extra cost of looking up the gist index on both columns versus looking up the a.way and b.way values directly is not worth the effort as the total number of rows to look up is probably very small especially if the index is unique.

As a rule of thumb remember that the planner will probably favor a table scan over an index scan for small datasets (dataset sizes are estimated by looking at the table statistics).

Best Answer

Related Solutions

PostGIS Spatial Clustering – Techniques for Spatial Clustering with PostGIS

1) k-means with kmeans-postgresql

2) Clustering within a threshold distance with ST_ClusterWithin

[GIS] PostgreSQL/PostGIS spatial index – no speed up

Long explanation

Related Question

1) k-means with `kmeans-postgresql`

2) Clustering within a threshold distance with `ST_ClusterWithin`