[GIS] Speed up row-wise point in polygon with Geopandas

geopandaspoint-in-polygonpython

I have a geopandas GeoDataframe which contains some attributes and a geometry column which is filled with shapely Point(lon, lat) objects. Moreover, there is a GeoSeries which is equally indexed as the GeoDataframe and which contains shapely Polygon() objects. The polygons are derived from the commonly known world borders shapefile which means that many of them are rather large and complicated.

My goal is to join both GeoDataframes on a common attribute and then do a point in polygon test for each feature-pair, giving back all rows from the Points GeoDataframe for which the result is true. I think my approach given below works, however, it is quite slow and I wonder if there is any way to speed this up significantly.

In order to test this under similar conditions, you could download the World Country Points shapefile from the bottom of this page to get approx. 180.000 points as well as World Borders shapefile to get the polygons. Save them in the same folder as your Python script, then load them into GeoDataFrames with

import geopandas as gpd

print('Reading points...')
PointsGeodataframe = gpd.GeoDataFrame.from_file('Global_Points_2013March12_HIU_USDoS.shp')
print('Reading polygons...')    
PolygonsGeodataframe = gpd.GeoDataFrame.from_file('TM_WORLD_BORDERS-0.3.shp')

At first, I merge both dataframes into one while joining on the columns storing the two digit ISO 3166-1 alpha-2 codes, keeping keys from the Points GeoDataframe.

print('Merging GeoDataframes...')
merged = PointsGeodataframe.merge(PolygonsGeodataframe, left_on='iso_alpha2', right_on='ISO2', how='left')
print(merged.head(5))

Out:

CC   NAME_x REGION_x                                    geometry_x  \
0  AG  ALGERIA   AFRICA  POINT (-2.056197991715294 35.07505011900781)   
1  AG  ALGERIA   AFRICA   POINT (-1.93859720224998 35.08791119378083)   
2  AG  ALGERIA   AFRICA  POINT (-2.116729200249893 35.09038023113033)   
3  AG  ALGERIA   AFRICA  POINT (-2.136513829499904 35.09662084358041)   
4  AG  ALGERIA   AFRICA  POINT (-2.169619420782624 35.10288084768303)   

  iso_alpha2 iso_alpha3  iso_num  tld    AREA FIPS ISO2 ISO3     LAT    LON  \
0         DZ        DZA       12  .dz  238174   AG   DZ  DZA  28.163  2.632   
1         DZ        DZA       12  .dz  238174   AG   DZ  DZA  28.163  2.632   
2         DZ        DZA       12  .dz  238174   AG   DZ  DZA  28.163  2.632   
3         DZ        DZA       12  .dz  238174   AG   DZ  DZA  28.163  2.632   
4         DZ        DZA       12  .dz  238174   AG   DZ  DZA  28.163  2.632   

    NAME_y   POP2005  REGION_y  SUBREGION  UN  \
0  Algeria  32854159         2         15  12   
1  Algeria  32854159         2         15  12   
2  Algeria  32854159         2         15  12   
3  Algeria  32854159         2         15  12   
4  Algeria  32854159         2         15  12   

                                          geometry_y  
0  POLYGON ((2.96361 36.802216, 2.981389 36.80693...  
1  POLYGON ((2.96361 36.802216, 2.981389 36.80693...  
2  POLYGON ((2.96361 36.802216, 2.981389 36.80693...  
3  POLYGON ((2.96361 36.802216, 2.981389 36.80693...  
4  POLYGON ((2.96361 36.802216, 2.981389 36.80693...  

To do a row-wise check if the Point (stored in geometry_x) is located within the polygon (stored in geometry_y) and get back all rows where this is the case, boolean indexing can be used (the geometry_x and geometry_y columns must be transformed into GeoSeries in order to use shapely's within on them:

rowsWherePointIsInPolygon = merged[gpd.GeoSeries(merged['geometry_x']).within(gpd.GeoSeries(merged['geometry_y']))]
print(rowsWherePointIsInPolygon)

This works fine for a small amount of rows. However, when doing this for the example shapefiles, it gets stuck for about half an hour on a rather fast machine and even longer on slow ones.

Is there any possibility to speed up this operation or is it already as fast as it gets? I don't think that there is any sense here to use a spatial index because the problem is not to find out to which of 400.000 polygons one point belongs to but to check if one point is in one polygon for a lot of points.

Without really knowing if this has something to do with it, I also tried setting

from shapely import speedups
speedups.enable()

but this did not make any change.

Best Answer

I would recommend either using the geopandas-cython branch here or pygeos.

If you use pygeos, I would recommend converting the geometries from shapely to the pygeos version first for the best speedups.