QGIS – Grouping Points into Clusters Containing Similar Number of Features

clusteringgroupingpointpolygon-creationqgis

The question

Using QGIS, how to cluster unevenly distributed points from the same layer to groups of about the same size / containing the same number of features?

The problem

I have a large point layer, containing more then 350'000 unevenly distributed features: buildings all over an administrative region containing cities, villages, as well as countryside. I want to split the layer in N (let's say: 500) separate layers, thus each containing ca. 700 features. Or, said otherwise, ideally would be to get any number of clusters, but each containing no less than 500 and no more then 1000 points – and being as compact as possible (no "gerrymandering" shapes).

What I tried

  • First idea was to create a grid and create a separate layer of all points in the same grid cell. A 2*2 km grid resulted in cells containing between 1 and 2600 features.

  • A second idea was to use K-means clustering. Cluster size, however, resulted in cluster size of between 13 and over 3600 features.

    Both solutions are not ideal, as each "cluster" should contain more or less the same number of features as can be seen on the screenshot, where each polygon contains between 24 and 31 points.

  • I also tried GRASS v.cluster. I only got results using density clustering method, but that did not solve the problem, either.

Simplified sketch to illustrate the problem: unevenly sized polygons ("clusters"), each containing between 24 and 31 points:
enter image description here

Best Answer

I found a way to achieve at least in part what I want to do, even if it is not an exact workflow, more a heuristic approach. So I would still be glad to get better solutions.

What this solution does in an iterative approach and reliably is to create clusters with a maximum numbers of points. A minimum number of points, however cannot be achieved in this way. So I think with this approach, I could reduce the problem to merging contiguous polygons in a way that the sum of an attribute value (no. of points inside the polygon) does not exceed a certain maximum (see here).

How the solution works:

  1. Create a rough grid - for my region of interest (ca. 40 x 60 km), I choose a 10 km x 10 km grid. Edit: I realised that it is better to start with one single grid cell, one that covers the whole extent of the point layer. Like this, one deals with rectangles, not polygons, so the expression in step 4 has to be adapted. The rest remains the same.

  2. Count the number of points per grid cell, using this expression (e.g. to create a new attribute point_count with Field calculator):

    array_length (overlay_contains ('point_layer', $id))

  3. Select by expression to select cells that contain no points with the expression point_count = 0. Than delete these cells.

  4. In a next step, all cells that contain more than a certain number of points (here: 800) should be split in four quadrants (edit: for better results, maybe splitting just in two parts would be an option), thus splitting the grid in four smaller polygons so that each contains less points than the initial one. Polygons that already contain a lower number of points than the threshold remain unchanged. To achieve that, I use Geometry by expression with this expression (edit: see at the bottom for the expression if using rectangles in step 1):

    if (
        point_count > 800,
        collect_geometries (
            make_square( 
                centroid ($geometry), 
                make_point (x_max($geometry), y_max($geometry))
            ),
            make_square( 
                centroid ($geometry), 
                make_point (x_min($geometry), y_max($geometry))
            ),
            make_square( 
                centroid ($geometry), 
                make_point (x_min($geometry), y_min($geometry))
            ),
            make_square( 
                centroid ($geometry), 
                make_point (x_max($geometry), y_min($geometry))
            )
        ),
        $geometry
    )
    
    
    
  5. Convert Multipart to single parts

  6. Continue with step 2, as long as there are cells that contain more than to defined number of points (threshold, here: 800).

The result is a layer containing (in my case a number of 1196) square shaped polygons different sizes, containing between 1 and 800 points. There are five different sizes of squares (10 km, 5 km, 2.5 km, 1.25 km, 0.625 km side length) in my case, thus 5 iterations were needed. Now it would be possible to manually merge polygons with just a few points in it with neighboring ones.

Result: a grid with different sized squares, containing between 1 (white) and 799 (red) points. Left: overview, right: zoomed in to show smaller squares. Labels show the no. of points contained in each polygon:

enter image description here

Edit

If using a single rectangle covering the whole extent of the point layer instead of a square grid, the solution gives better results. In this case, the expression in step 4 has to be adapted as follows, here used for 1000 instead of 800 points (the rest remains the same):

if (
    point_count > 1000,
    collect_geometries(
        make_polygon(
            make_line(
                make_point (x_min($geometry),y_max($geometry)),
                make_point (x_min ($geometry), y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2),
                centroid ($geometry),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2,y_max($geometry))
            )
        ),
        make_polygon(
            make_line(
                make_point (x_max($geometry),y_max($geometry)),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2, y_max($geometry)),
                centroid ($geometry),
                make_point (x_max($geometry),y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2)
            )
        ),
        make_polygon(
            make_line(
                centroid ($geometry),
                make_point (x_max($geometry),y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2),
                make_point (x_max($geometry),y_min($geometry)),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2, y_min($geometry))
            )
        ),
        make_polygon(
            make_line(
                centroid ($geometry),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2, y_min($geometry)),
                make_point (x_min($geometry),y_min($geometry)),
                make_point (x_min($geometry),y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2)
            )
        )
    ),
    $geometry
)

With this variant, I get 943 rectangle shaped polygons of 6 different sizes (7 iterations; the first iteration does not produce any polygons with less then 1000 points) with a range of 1 to 997 points inside the polygons.

The result looks like this:

enter image description here

Related Question