qgis – Clustering Points and Summing Up Attributes Per Cluster in QGIS

clusteringconnectivity-analysisqgis

We want to find out the ideal location for sheds and the required dimensions of each shed. As input, we have a point layer with locations (each representing one arable field) and the estimated yield per point (per arable field).

What we now try to do is first to create four clusters for these fields. This should serve as an approximation of the ideal location for a new shed. We've done this by simply using the "point cluster" option in the "symbology" section and adjusting the distance until only four cluster points remained. These cluster points were then saved as new point shapefile.

How do we determine what points are actually clustered in each of the cluster points?

With this information, we can then sum up the estimated yields for all the single points per cluster. This would allow us to calculate whether the shed has to be designed for 10 tons or 100 tons of grain.

Best Answer

I would try K-means clustering algorithm in the QGIS Processing Toolbox (under Vector analysis group).

Just by setting the Number of clusters as 4, it will produce a new Clusters layer with an attribute field CLUSTER_ID (values= 0, 1, 2, 3).

Then an expression like SUM("yield", "CLUSTER_ID") in the Field Calculator will return the total yield for each cluster. (E.G. the Sum_per_Cluster in the below example).

[Update]

To obtain center point per the group (cluster), please try Mean coordinate(s) geoalgorithm in Processing Toolbox > Vector analysis.

Mean coordinates dialog window will show an option Unique ID field. Select CLUSTER_ID field.

Related Solutions

PostGIS Clustering – Get a Single Cluster from Cloud of Points with Specified Maximum Diameter

I've written a bottom-up hierarchical clustering algorithm, it has extra parameters that might not be useful to other users, but those should be easy to remove in implementation. First, create a new type to have point ids and geometries.

CREATE TYPE pt AS (
gid character varying(32),
the_geom geometry(Point))

and a type with cluster id

CREATE TYPE clustered_pt AS (
collection_id character varying(16),
gid character varying(32),
the_geom geometry(Point)
cluster_id int)

Next the algorithm function

CREATE OR REPLACE FUNCTION buc(collection_id character varying, points pt[], radius integer)
RETURNS SETOF clustered_pt AS
$BODY$

DECLARE
srid int;
joined_clusters int[];

BEGIN

--If there's only 1 point, don't bother with the loop.
IF array_length(points,1)<2 THEN
    RETURN QUERY SELECT collection_id::character varying(16), gid, the_geom, 1 FROM unnest(points);
    RETURN;
END IF;

CREATE TEMPORARY TABLE IF NOT EXISTS points2 (LIKE pt) ON COMMIT DROP;

BEGIN
    ALTER TABLE points2 ADD COLUMN cluster_id serial;
EXCEPTION
    WHEN duplicate_column THEN --do nothing. Exception comes up when using this function multiple times
END;

TRUNCATE points2;
    --inserting points in
INSERT INTO points2(gid, the_geom)
    (SELECT (unnest(points)).* ); 

--Store the srid to reconvert points after, assumes all points have the same SRID
srid := ST_SRID(the_geom) FROM points2 LIMIT 1;

UPDATE points2 --transforming points to a UTM coordinate system so distances will be calculated in meters.
SET the_geom =  ST_TRANSFORM(the_geom,26986);

LOOP
    --If the smallest maximum distance between two clusters is greater than 2x the desired cluster radius, then there are no more clusters to be formed
    IF (SELECT ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom))  FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id 
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) LIMIT 1)
        > 2 * radius
    THEN
        EXIT;
    END IF;

    joined_clusters := ARRAY[a.cluster_id,b.cluster_id]
        FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) 
        LIMIT 1;

    UPDATE points2
    SET cluster_id = joined_clusters[1]
    WHERE cluster_id = joined_clusters[2];

    --If there's only 1 cluster left, exit loop
    IF (SELECT COUNT(DISTINCT cluster_id) FROM points2) < 2 THEN
        EXIT;

    END IF;

END LOOP;

RETURN QUERY SELECT collection_id::character varying(16), gid, ST_TRANSFORM(the_geom, srid)::geometry(point), cluster_id FROM points2;
END;
$BODY$
LANGUAGE plpgsql

Because of the function syntax in psql, implementation goes as

WITH subq AS(
    SELECT collection_id, ARRAY_AGG((gid, the_geom)::pt) AS points
    FROM data
    GROUP BY collection_id)
SELECT (clusters).* FROM 
    (SELECT buc(collection_id, points, radius) AS clusters FROM subq
) y;

If one were clustering one gigantic point cloud rather than many little clouds, it would be a good idea to add a GiST index on the temporary table. This takes 30 min to process 1.8M collections with 3M total points (i7-3930K @3.2GHz w/ 64GB ram). Any other suggestions for optimization welcome!

QGIS Analysis – Find Mean Distance Between Cluster Center and Points

On the layer with the center point, you can use QGIS expressions. Get an array of the points from the same cluster using array_foreach and then for each of these points create create a line to the center and get it's length. You get an array of lengths. With array_mean, calculate the mean. This is how the expression looks like:

array_mean (
    array_foreach (
        overlay_nearest('points', $geometry, limit:=-1),
        length (
            make_line (
                $geometry,
                @element
            )
        )
    )
)

Based on the data you have and how it is structured, you have to include a condition so that only points belonging to the same center are considered.

From the red center point, the expression generates a line to each of the white dots from the points layer, measures the length and gets the mean. The lines itself that you see here are generated with an expression based on this one with Geometry generator, see on the right side:

Best Answer

Related Solutions

PostGIS Clustering – Get a Single Cluster from Cloud of Points with Specified Maximum Diameter

QGIS Analysis – Find Mean Distance Between Cluster Center and Points

Related Question