I have a MultiPolygon representing the city boundary of Houston and it has an extremely complicated boundary. I also have a set of ~900,000 points (that has the same minimum bounding box as that Houston polygon). About ~400,000 of these points are within the polygon but the others lie outside it. Using python, geopandas, and shapely I tried intersecting this polygon with my points using r-tree. But because the points and polygon have the same minimum bounding box, r-tree offers no speed-up. The process currently takes 30+ minutes.
Which (if any) type of spatial index can I use to accelerate my intersection query when the polygon and points have the same minimum bounding box?
Edit to add code snippet here:
sindex = gdf['geometry'].sindex
possible_matches_index = list(sindex.intersection(polygon.bounds))
possible_matches = gdf.iloc[possible_matches_index]
points_in_polygon = possible_matches[possible_matches.intersects(polygon)]
Best Answer
I try to explain with an image. If you do not agree that this strategy works I will delete my answer. However, I will continue to use it in my own work.
I took Texas as an example and I split it into 1 by 1 degree pieces. The grid that shows behind corresponds with the R-Tree index despite at the boundaries where the real R-Tree BBOXes are smaller.
In the Texas case a point never hits more than one R-Tree box. In case of adjacent polygons there may be more hits, but never more than a handful. Next thing to do is to make an Intersects test with a true geometry of the corresponding piece. If point is inside that piece it is inside Texas as well, and in the USA, North America etc.