[GIS] Point in polygon algorithm for multiple polygons

algorithmgoogle mapspoint-in-polygonpostgis

I've a Google map with bunch of polygons on it.

Here is a problem I'm interested in: Given a lat,lng point, what's the best way to determine all the polygons that this point lies in?

The obvious way is to run a "point in polygon" algorithm iteratively for each polygon, but I was wondering if there an efficient algorithm to answer such queries especially if you have thousands of polygons.

Best Answer

As with almost all such questions, the optimal approach depends on the "use cases" and how the features are represented. The use cases are typically distinguished by (a) whether there are many or few objects in each layer and (b) whether either (or both) layers allow for precomputing some data structures; that is, whether one or both of them is sufficiently static and unchanging to make the investment in precomputation worthwhile.

In the present case, this yields the following scenarios. Normally the points are dynamic: that is, they are not given beforehand. (If they are available in advance, or in very large groups, some optimizations based on sorting them will be available.) Let Q be the number of query points and P be the number of polygon vertices.

Vector polygon data

(1) Few points, few polygon vertices in toto. Use a brute-force procedure, such as the classic line-stabbing algorithm. For any decent method, the cost is O(P*Q), because it costs O(1) time to compare a point to a polygon edge and all such comparisons have to be made.

(2) Possibly many polygon vertices, but they are dynamic: each time a point is used in the query, the polygons might all have changed. Again use a brute-force algorithm. The cost is still O(P*Q), which will be large because P will be large, but there's no helping that. If the changes are small or controlled (e.g., the polygons are slightly changing shape or simply moving around slowly) you might be able to use a version of the next solution and find an efficient way to update the data structures as the polygons change. That would likely be a matter for original research.

(3) Many polygon vertices and static polygons (that is, the polygon layer will rarely change). Precompute a data structure to support the search (which could be based on a line sweep or a quadtree algorithm). The cost of precomputation for these algorithms is O(P*log(P)), but the cost of the queries becomes O(Q*log(P)), so the total cost is O((P+Q)*log(P)).

Some improvements are available in special cases, such as

(a) All polygons are convex (preprocessing the polygons can be done more quickly),

(b) All polygon interiors are disjoint, in which case you can think of their union as being a single polygon (which allows for straightforward efficient algorithms, such as those based on triangulation, and

(c) Most polygons are not very tortuous--that is, they occupy large portions of their bounding boxes--in which case you can do an initial test based on the bounding boxes only and then refine that solution. This is a popular optimization.

(d) The number of points is large. Sorting them might improve the timing. For instance, when implementing a left-to-right line sweep point-in-polygon algorithm, you would sort the points on their first coordinate, allowing you to sweep over the points at the same time you sweep over the polygon edges. I'm not aware that such an optimization has been published. One that has been published, though, is to perform a constrained triangulation of the union of all the points and polygon vertices: once that triangulation is complete, identifying the interior points should be quick. Computational cost will scale as O(Q*log(Q) + (P+Q)*log(P+Q)).

Raster polygon data

This is incredibly easy: view the polygon layer as a binary indicator raster (1=inside a polygon, 0=outside). (This could require a lookup table to convert raster values to inside/outside indicators.) Each point probe now requires O(1) effort to index the raster cell and read its value. Total effort is O(Q).

In general

A nice hybrid solution in the case of many static vector polygons (vector case 3 above) is initially to rasterize the polygons, perhaps even with a coarse resolution, this time distinguishing any cells intersecting any part of a polygon boundary (give them a value of 2, say). Using a raster probe (cost: O(1)) typically results in a definite answer (the point is known to be inside or outside), but occasionally results in an indefinite answer (the point falls in a cell through which at least one edge passes), in which case the more expensive O(log(P)) vector query is made. This method incurs some extra storage cost for the raster, but in many cases even a small raster (one MB will allow for a 2000 by 2000 raster that stores {0,1,2,null} values) can confer huge advantages in computational time. Asymptotically, the computational effort is the same as for a vector solution, but in practice it is O(Q + P*log(P)) and possibly as low as O(Q+P) (achieved by using a very fine resolution for the raster and using brute-force methods for the very rare vector queries that have to be performed).

Related Question