Python – Fastest Way to Join Many Points to Many Polygons

pythonspatial-join

I need to attribute a geographical area (based on shapefile polygons) to 8 million GPS points. There are 405 polygons. I currently have the code below, and a quick extrapolation tells me it will take 84 hours to compute this.

I am new to GIS, but intuitively, I feel that there has to be a smarter way around this than, for each point, randomly testing each polygon with a point-in-polygon function until there is match. For example, simply putting the points and polygons in 2 groups (such as "East" and "West") simply based on their easternmost/westernmost longitude may divide by two the number of operations to perform by almost 2.

Are there indeed such much efficient algorithms? Are any of them easily implementable in python?

from shapely.geometry import shape, Point, Polygon
import pandas as pd
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap

plt.figure(figsize=(16,16))
m = Basemap(projection='merc',llcrnrlat=15,urcrnrlat=55,llcrnrlon=70,urcrnrlon=140,lat_ts=20,resolution='c')
m.readshapefile(china_shapefile, 'prefectures', drawbounds=True)

print len(m.prefectures)

#get data
china_shapefile="C:/Users/.../CHN_adm2_SIMPLIFIED/CHN_adm2"
points_csv="C:/Users/.../sample_map_data.csv"


points=pd.read_csv(points_csv)
points=points.dropna()

lat=points["Latitude"].tolist()
lon=points["Longitude"].tolist()
coordinates=zip(lat,lon)

for coordinate in coordinates:
    for info,shape in zip(m.prefectures_info,m.prefectures):
        poly = Polygon(shape)
        #print info["NAME_1"],info["NAME_2"]
        xpt,ypt = m(coordinate[1],coordinate[0])
        point1 = Point(xpt,ypt)
        if point1.within(poly):
            print coordinate,info["NAME_1"],info["NAME_2"]#,info["NAME_3"]
            break

Best Answer

As blord-castillo says, you can use a Spatial Index. They are many modules in Python for that (PiPy: spatial index, R-tree, Quad-Ttree, Kd-tree, Interval-tree, ...).

The most used method for indexing spatial data is the R-tree index (also used by QGIS or PyQGIS as stated by GeoJohn, see geoprocessing across multiple vector layers QGIS2 for example)

The fastest is Rtree (rtree python polygon index) because it is a simple Python wrapper of the C library libspatialindex. It is used in More Efficient Spatial join in Python without QGIS, ArcGIS, PostGIS, etc, for example (with Fiona, Shapely and Rtree).

Rather than using Pandas, you should use GeoPandas that includes optionally rtree as a dependency (spatial joins, Spatial indexes for geo data frames and series or More Efficient Spatial join in Python without QGIS, ArcGIS, PostGIS, etc with GeoPandas)

This reference gives you 2 solutions for your problem (spatial join between points and polygons) with the limitations of the procedure:

  • Without an index, you must iterate through all the geometries (polygons and points).
  • With a bounding spatial index (Spatial Index RTree), you iterate only through the geometries which have a chance to intersect with your current geometry ('filter' which can save a considerable amount of calculations and time...)
  • But a Spatial Index is not a magic wand. When a very large part of the dataset has to be retrieved, a Spatial Index cannot give any speed benefit.