GeoPandas – Speeding Up GeoPandas Spatial Join Operations

geopandaspythonspatial-join

I am using GeoPandas's sjoin function to join 2 dataframes: dataframeA has latitude and longitude information whereas dataframeB has polygon information. Number of rows in dataframeA may vary (~70M) but are the same for dataframeB (825k). I want to perform point in polygon operation and update dataframeA with information from dataframeB. Here is my code which works (rtree and libspatialindex has been installed):

dataframeB = gpd.GeoDataFrame(dataFromReadCSV,crs="EPSG:4326",geometry=geometry)    
dataframeA = gpd.GeoDataFrame(dataframeA,crs="EPSG:4326",geometry=gpd.points_from_xy(dataframeA.longitude, dataframeA.latitude))
dataframeA = gpd.sjoin(dataframeA, dataframeB, op='within', how='left')

Since the memory requirement for this task is very high, I chunk dataFrameA before sjoin and append the results from disk. This process has been working fine.

Environment: Python 3.6; Dask – for high performance cluster

Problem: For chunked dataframeA (~7-8M rows), it takes about 2-3 hrs. I know point in polygon is computationally expensive.

Is there a way to speed this up?

Best Answer

You can get significant speedup if you have GeoPandas 0.8 and PyGEOS installed (geopandas.org/install.html#using-the-optional-pygeos-dependency). PyGEOS uses vectorized numpy ufuncs and can be orders of magnitude faster than standard shapely.

Note that PyGEOS will be part of Shapely 2.0, so once that is released, installing PyGEOS separately will not be needed.