QGIS – Grouping Points into Clusters Containing Similar Number of Features

clusteringgroupingpointpolygon-creationqgis

The question

Using QGIS, how to cluster unevenly distributed points from the same layer to groups of about the same size / containing the same number of features?

The problem

I have a large point layer, containing more then 350'000 unevenly distributed features: buildings all over an administrative region containing cities, villages, as well as countryside. I want to split the layer in N (let's say: 500) separate layers, thus each containing ca. 700 features. Or, said otherwise, ideally would be to get any number of clusters, but each containing no less than 500 and no more then 1000 points – and being as compact as possible (no "gerrymandering" shapes).

What I tried

First idea was to create a grid and create a separate layer of all points in the same grid cell. A 2*2 km grid resulted in cells containing between 1 and 2600 features.
A second idea was to use K-means clustering. Cluster size, however, resulted in cluster size of between 13 and over 3600 features.

Both solutions are not ideal, as each "cluster" should contain more or less the same number of features as can be seen on the screenshot, where each polygon contains between 24 and 31 points.
I also tried GRASS v.cluster. I only got results using density clustering method, but that did not solve the problem, either.

Simplified sketch to illustrate the problem: unevenly sized polygons ("clusters"), each containing between 24 and 31 points:

Best Answer

I found a way to achieve at least in part what I want to do, even if it is not an exact workflow, more a heuristic approach. So I would still be glad to get better solutions.

What this solution does in an iterative approach and reliably is to create clusters with a maximum numbers of points. A minimum number of points, however cannot be achieved in this way. So I think with this approach, I could reduce the problem to merging contiguous polygons in a way that the sum of an attribute value (no. of points inside the polygon) does not exceed a certain maximum (see here).

How the solution works:

Create a rough grid - for my region of interest (ca. 40 x 60 km), I choose a 10 km x 10 km grid. Edit: I realised that it is better to start with one single grid cell, one that covers the whole extent of the point layer. Like this, one deals with rectangles, not polygons, so the expression in step 4 has to be adapted. The rest remains the same.
Count the number of points per grid cell, using this expression (e.g. to create a new attribute point_count with Field calculator):

array_length (overlay_contains ('point_layer', $id))
Select by expression to select cells that contain no points with the expression point_count = 0. Than delete these cells.

In a next step, all cells that contain more than a certain number of points (here: 800) should be split in four quadrants (edit: for better results, maybe splitting just in two parts would be an option), thus splitting the grid in four smaller polygons so that each contains less points than the initial one. Polygons that already contain a lower number of points than the threshold remain unchanged. To achieve that, I use Geometry by expression with this expression (edit: see at the bottom for the expression if using rectangles in step 1):

if (
    point_count > 800,
    collect_geometries (
        make_square( 
            centroid ($geometry), 
            make_point (x_max($geometry), y_max($geometry))
        ),
        make_square( 
            centroid ($geometry), 
            make_point (x_min($geometry), y_max($geometry))
        ),
        make_square( 
            centroid ($geometry), 
            make_point (x_min($geometry), y_min($geometry))
        ),
        make_square( 
            centroid ($geometry), 
            make_point (x_max($geometry), y_min($geometry))
        )
    ),
    $geometry
)

Convert Multipart to single parts
Continue with step 2, as long as there are cells that contain more than to defined number of points (threshold, here: 800).

The result is a layer containing (in my case a number of 1196) square shaped polygons different sizes, containing between 1 and 800 points. There are five different sizes of squares (10 km, 5 km, 2.5 km, 1.25 km, 0.625 km side length) in my case, thus 5 iterations were needed. Now it would be possible to manually merge polygons with just a few points in it with neighboring ones.

Result: a grid with different sized squares, containing between 1 (white) and 799 (red) points. Left: overview, right: zoomed in to show smaller squares. Labels show the no. of points contained in each polygon:

Edit

If using a single rectangle covering the whole extent of the point layer instead of a square grid, the solution gives better results. In this case, the expression in step 4 has to be adapted as follows, here used for 1000 instead of 800 points (the rest remains the same):

if (
    point_count > 1000,
    collect_geometries(
        make_polygon(
            make_line(
                make_point (x_min($geometry),y_max($geometry)),
                make_point (x_min ($geometry), y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2),
                centroid ($geometry),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2,y_max($geometry))
            )
        ),
        make_polygon(
            make_line(
                make_point (x_max($geometry),y_max($geometry)),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2, y_max($geometry)),
                centroid ($geometry),
                make_point (x_max($geometry),y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2)
            )
        ),
        make_polygon(
            make_line(
                centroid ($geometry),
                make_point (x_max($geometry),y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2),
                make_point (x_max($geometry),y_min($geometry)),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2, y_min($geometry))
            )
        ),
        make_polygon(
            make_line(
                centroid ($geometry),
                make_point (x_min ($geometry)+(x_max($geometry)-x_min($geometry))/2, y_min($geometry)),
                make_point (x_min($geometry),y_min($geometry)),
                make_point (x_min($geometry),y_min($geometry)+(y_max($geometry)-y_min ($geometry))/2)
            )
        )
    ),
    $geometry
)

With this variant, I get 943 rectangle shaped polygons of 6 different sizes (7 iterations; the first iteration does not produce any polygons with less then 1000 points) with a range of 1 to 997 points inside the polygons.

The result looks like this:

Related Solutions

Generating Irregular Grid with Minimum Points – How To Achieve This?

I see MerseyViking has recommended a quadtree. I was going to suggest the same thing and in order to explain it, here's the code and an example. The code is written in R but ought to port easily to, say, Python.

The idea is remarkably simple: split the points approximately in half in the x-direction, then recursively split the two halves along the y-direction, alternating directions at each level, until no more splitting is desired.

Because the intent is to disguise actual point locations, it is useful to introduce some randomness into the splits. One fast simple way to do this is to split at a quantile set a small random amount away from 50%. In this fashion (a) the splitting values are highly unlikely to coincide with data coordinates, so that the points will fall uniquely into quadrants created by the partitioning, and (b) point coordinates will be impossible to reconstruct precisely from the quadtree.

Because the intention is to maintain a minimum quantity k of nodes within each quadtree leaf, we implement a restricted form of quadtree. It will support (1) clustering points into groups having between k and 2*k-1 elements each and (2) mapping the quadrants.

This R code creates a tree of nodes and terminal leaves, distinguishing them by class. The class labeling expedites post-processing such as plotting, shown below. The code uses numeric values for the ids. This works up to depths of 52 in the tree (using doubles; if unsigned long integers are used, the maximum depth is 32). For deeper trees (which are highly unlikely in any application, because at least k * 2^52 points would be involved), ids would have to be strings.

quadtree <- function(xy, k=1) {
  d = dim(xy)[2]
  quad <- function(xy, i, id=1) {
    if (length(xy) < 2*k*d) {
      rv = list(id=id, value=xy)
      class(rv) <- "quadtree.leaf"
    }
    else {
      q0 <- (1 + runif(1,min=-1/2,max=1/2)/dim(xy)[1])/2 # Random quantile near the median
      x0 <- quantile(xy[,i], q0)
      j <- i %% d + 1 # (Works for octrees, too...)
      rv <- list(index=i, threshold=x0, 
                 lower=quad(xy[xy[,i] <= x0, ], j, id*2), 
                 upper=quad(xy[xy[,i] > x0, ], j, id*2+1))
      class(rv) <- "quadtree"
    }
    return(rv)
  }
  quad(xy, 1)
}

Note that the recursive divide-and-conquer design of this algorithm (and, consequently, of most of the post-processing algorithms) means that the time requirement is O(m) and RAM usage is O(n) where m is the number of cells and n is the number of points. m is proportional to n divided by the minimum points per cell, k. This is useful for estimating computation times. For instance, if it takes 13 seconds to partition n=10^6 points into cells of 50-99 points (k=50), m = 10^6/50 = 20000. If you want instead to partition down to 5-9 points per cell (k=5), m is 10 times larger, so the timing goes up to about 130 seconds. (Because the process of splitting a set of coordinates around their middles gets faster as the cells get smaller, the actual timing was only 90 seconds.) To go all the way to k=1 point per cell, it will take about six times longer still, or nine minutes, and we can expect the code actually to be a little faster than that.

Before going further, let's generate some interesting irregularly spaced data and create their restricted quadtree (0.29 seconds elapsed time):

Quadtree

Here's the code to produce these plots. It exploits R's polymorphism: points.quadtree will be called whenever the points function is applied to a quadtree object, for instance. The power of this is evident in the extreme simplicity of the function to color the points according to their cluster identifier:

points.quadtree <- function(q, ...) {
  points(q$lower, ...); points(q$upper, ...)
}
points.quadtree.leaf <- function(q, ...) {
  points(q$value, col=hsv(q$id), ...)
}

Plotting the grid itself is a little trickier because it requires repeated clipping of the thresholds used for the quadtree partitioning, but the same recursive approach is simple and elegant. Use a variant to construct polygonal representations of the quadrants if desired.

lines.quadtree <- function(q, xylim, ...) {
  i <- q$index
  j <- 3 - q$index
  clip <- function(xylim.clip, i, upper) {
    if (upper) xylim.clip[1, i] <- max(q$threshold, xylim.clip[1,i]) else 
      xylim.clip[2,i] <- min(q$threshold, xylim.clip[2,i])
    xylim.clip
  } 
  if(q$threshold > xylim[1,i]) lines(q$lower, clip(xylim, i, FALSE), ...)
  if(q$threshold < xylim[2,i]) lines(q$upper, clip(xylim, i, TRUE), ...)
  xlim <- xylim[, j]
  xy <- cbind(c(q$threshold, q$threshold), xlim)
  lines(xy[, order(i:j)],  ...)
}
lines.quadtree.leaf <- function(q, xylim, ...) {} # Nothing to do at leaves!

As another example, I generated 1,000,000 points and partitioned them into groups of 5-9 each. Timing was 91.7 seconds.

n <- 25000       # Points per cluster
n.centers <- 40  # Number of cluster centers
sd <- 1/2        # Standard deviation of each cluster
set.seed(17)
centers <- matrix(runif(n.centers*2, min=c(-90, 30), max=c(-75, 40)), ncol=2, byrow=TRUE)
xy <- matrix(apply(centers, 1, function(x) rnorm(n*2, mean=x, sd=sd)), ncol=2, byrow=TRUE)
k <- 5
system.time(qt <- quadtree(xy, k))
#
# Set up to map the full extent of the quadtree.
#
xylim <- cbind(x=c(min(xy[,1]), max(xy[,1])), y=c(min(xy[,2]), max(xy[,2])))
plot(xylim, type="n", xlab="x", ylab="y", main="Quadtree")
#
# This is all the code needed for the plot!
#
lines(qt, xylim, col="Gray")
points(qt, pch=".")

enter image description here

As an example of how to interact with a GIS, let's write out all the quadtree cells as a polygon shapefile using the shapefiles library. The code emulates the clipping routines of lines.quadtree, but this time it has to generate vector descriptions of the cells. These are output as data frames for use with the shapefiles library.

cell <- function(q, xylim, ...) {
  if (class(q)=="quadtree") f <- cell.quadtree else f <- cell.quadtree.leaf
  f(q, xylim, ...)
}
cell.quadtree <- function(q, xylim, ...) {
  i <- q$index
  j <- 3 - q$index
  clip <- function(xylim.clip, i, upper) {
    if (upper) xylim.clip[1, i] <- max(q$threshold, xylim.clip[1,i]) else 
      xylim.clip[2,i] <- min(q$threshold, xylim.clip[2,i])
    xylim.clip
  } 
  d <- data.frame(id=NULL, x=NULL, y=NULL)
  if(q$threshold > xylim[1,i]) d <- cell(q$lower, clip(xylim, i, FALSE), ...)
  if(q$threshold < xylim[2,i]) d <- rbind(d, cell(q$upper, clip(xylim, i, TRUE), ...))
  d
}
cell.quadtree.leaf <- function(q, xylim) {
  data.frame(id = q$id, 
             x = c(xylim[1,1], xylim[2,1], xylim[2,1], xylim[1,1], xylim[1,1]),
             y = c(xylim[1,2], xylim[1,2], xylim[2,2], xylim[2,2], xylim[1,2]))
}

The points themselves can be read directly using read.shp or by importing a data file of (x,y) coordinates.

Example of use:

qt <- quadtree(xy, k)
xylim <- cbind(x=c(min(xy[,1]), max(xy[,1])), y=c(min(xy[,2]), max(xy[,2])))
polys <- cell(qt, xylim)
polys.attr <- data.frame(id=unique(polys$id))
library(shapefiles)
polys.shapefile <- convert.to.shapefile(polys, polys.attr, "id", 5)
write.shapefile(polys.shapefile, "f:/temp/quadtree", arcgis=TRUE)

(Use any desired extent for xylim here to window into a subregion or to expand the mapping to a larger region; this code defaults to the extent of the points.)

This alone is enough: a spatial join of these polygons to the original points will identify the clusters. Once identified, database "summarize" operations will generate summary statistics of the points within each cell.

QGIS Point Clustering – How to Create Dynamic Point Clusters in QGIS

If you are ready to upgrade, qgis 3 has a "point cluster" renderer that do exactly what you are looking for.

With qgis 2 you could try the "Point displacement" renderer, at last it will prevent overlap

Best Answer

Related Solutions

Generating Irregular Grid with Minimum Points – How To Achieve This?

QGIS Point Clustering – How to Create Dynamic Point Clusters in QGIS

Related Question