Given a large (~ 1 million) sample of unevenly distributed points – is it possible to generate irregular grid (in size, but could also be irregular in shape if that is possible?) that will contain specified minimum amount of n points?
It is of less importance for me if genereted 'cells' of such grid contain exactly n number of points or at least n points.
I'm aware of solutions like genvecgrid in ArcGIS or Create Grid Layer in QGIS/mmgis however they will all create regular grids which will result in an output with empty cells (smaller problem – I could simply discard them) or cells with count of points less than n (bigger problem since I'd need a solution to aggregate those cells, probably using some tools from here?).
I've been googling around to no avail and am open to both commercial (ArcGIS & extensions) or free (Python, PostGIS, R) solutions.
Best Answer
I see MerseyViking has recommended a quadtree. I was going to suggest the same thing and in order to explain it, here's the code and an example. The code is written in
R
but ought to port easily to, say, Python.The idea is remarkably simple: split the points approximately in half in the x-direction, then recursively split the two halves along the y-direction, alternating directions at each level, until no more splitting is desired.
Because the intent is to disguise actual point locations, it is useful to introduce some randomness into the splits. One fast simple way to do this is to split at a quantile set a small random amount away from 50%. In this fashion (a) the splitting values are highly unlikely to coincide with data coordinates, so that the points will fall uniquely into quadrants created by the partitioning, and (b) point coordinates will be impossible to reconstruct precisely from the quadtree.
Because the intention is to maintain a minimum quantity
k
of nodes within each quadtree leaf, we implement a restricted form of quadtree. It will support (1) clustering points into groups having betweenk
and 2*k
-1 elements each and (2) mapping the quadrants.This
R
code creates a tree of nodes and terminal leaves, distinguishing them by class. The class labeling expedites post-processing such as plotting, shown below. The code uses numeric values for the ids. This works up to depths of 52 in the tree (using doubles; if unsigned long integers are used, the maximum depth is 32). For deeper trees (which are highly unlikely in any application, because at leastk
* 2^52 points would be involved), ids would have to be strings.Note that the recursive divide-and-conquer design of this algorithm (and, consequently, of most of the post-processing algorithms) means that the time requirement is O(m) and RAM usage is O(n) where
m
is the number of cells andn
is the number of points.m
is proportional ton
divided by the minimum points per cell,k
. This is useful for estimating computation times. For instance, if it takes 13 seconds to partition n=10^6 points into cells of 50-99 points (k=50), m = 10^6/50 = 20000. If you want instead to partition down to 5-9 points per cell (k=5), m is 10 times larger, so the timing goes up to about 130 seconds. (Because the process of splitting a set of coordinates around their middles gets faster as the cells get smaller, the actual timing was only 90 seconds.) To go all the way to k=1 point per cell, it will take about six times longer still, or nine minutes, and we can expect the code actually to be a little faster than that.Before going further, let's generate some interesting irregularly spaced data and create their restricted quadtree (0.29 seconds elapsed time):
Here's the code to produce these plots. It exploits
R
's polymorphism:points.quadtree
will be called whenever thepoints
function is applied to aquadtree
object, for instance. The power of this is evident in the extreme simplicity of the function to color the points according to their cluster identifier:Plotting the grid itself is a little trickier because it requires repeated clipping of the thresholds used for the quadtree partitioning, but the same recursive approach is simple and elegant. Use a variant to construct polygonal representations of the quadrants if desired.
As another example, I generated 1,000,000 points and partitioned them into groups of 5-9 each. Timing was 91.7 seconds.
As an example of how to interact with a GIS, let's write out all the quadtree cells as a polygon shapefile using the
shapefiles
library. The code emulates the clipping routines oflines.quadtree
, but this time it has to generate vector descriptions of the cells. These are output as data frames for use with theshapefiles
library.The points themselves can be read directly using
read.shp
or by importing a data file of (x,y) coordinates.Example of use:
(Use any desired extent for
xylim
here to window into a subregion or to expand the mapping to a larger region; this code defaults to the extent of the points.)This alone is enough: a spatial join of these polygons to the original points will identify the clusters. Once identified, database "summarize" operations will generate summary statistics of the points within each cell.