[GIS] Python geopandas dataframe of polygons — determine nearest neighbor polygon

geopandaspythonshapely

I am trying to figure out how to dynamically create expanded regions of interest based on polygon geometries in geopandas until some threshold is satisfied (essentially custom regions across the contiguous US).

I have a geopandas dataframe with the following columns:

unitspaces_geodf[['unit_space_count', 'city', 'state_code', 'latitude', 'longitude', 'cbsa_code', 'geometry']]

where geometry is the corresponding polygon for the cbsa_code (generated from a shape file).

I can get a total count of unit_space_count by grouping the dataframe on cbsa_code:

unitspaces_geodf.groupby('cbsa_code')['unit_space_count'].sum()

Based on the sum() value of unit_space_count in each cbsa_code (let's say a threshold of 100) I want to determine the next nearest geometry and then combine the two geometries (and concatenate the cbsa_code) until unit_space_count is above the threshold. This is essentially creating a new neighborhood/region. Then remove these two (or more) cbsa_code entries from the pool of available entries to concatenate with.

So in the example above, the last line of the groupby clause shows a cbsa_code of 12660 and the total unit_space_count is 72.

For this record, I want to determine the nearest neighbor (using the geometry column and hopefully an out-of-the-box geopandas or shapely method) to generate a new combined cbsa-code (something like 12660-12620 if 12620 happened to be the nearest neighbor) and a new unit_space_count of 1268 (1196 + 72).

I think I may need to first create a map of all the grouped values to see which regions should be joined with which, but after that I need help with determining how to do these calculations.

Best Answer

You can use the Python rtree library to build up a spatial index, which then has a nearest method you can use to get the nearest geometry in the index to any given query. I think Shapely also comes with an rtree implementation which behaves similarly, but I could be wrong - I always use rtree. This is probably the fastest way, as it only requires one calculation for each record.

Otherwise, you'll need to compare every geometry to every other one using the shapely distance method, and choose the smallest one. I guess in Pandas that would be a full outer join of your dataset to itself, and add a column of distance, and then query out the records where ID's are not equal (implying you calculated the distance to the same polygon) and distance is the shortest.

Related Solutions

[GIS] Geopandas Line Polygon Intersection

When comparing geodataframes with geometry operations in Geopandas, the geometries are first matched by index. In the case where there is no matching index (because you only have a single polygon for instance) then the result will be False.

If it were to compare each object in the GeoSeries you would instead need to get back a full rectangular dataframe of boolean values, and this would likely be very inefficient.

If you do want to compare all geometries then you have two options. The first (and probably easiest) is to use the geopandas sjoin method:

gpd.sjoin(line_gdf, poly_gdf, op='intersects')

This returns a new GeoDataFrame with the geometries for each object on the left dataframe repeated for each geometry they intersect in the right, with the index of the object in the right, i.e.:

                        geometry  index_right
0  LINESTRING (0.5 0.5, 0.7 0.7)            0
1  LINESTRING (0.9 0.9, 0.2 0.6)            0

The second method is to us the pandas apply method on the GeoSeries to return the rectangular dataframe:

line_gdf.geometry.apply(lambda g: poly_gdf.intersects(g))

Which in turn returns (with increasing inefficiency as the dataframes grow):

index_right     0
index_left
0            True
1            True

In general, unless you needed the square matrix, my advice would be to stick to the sjoin method.

[GIS] Iterate over geopandas dataframe returning nearest point on each LineString

You might want to try rtree. It is the fastest way to retrieve nearest geometries especially when you have a large number of geometries involved. An example of geopandas itself implementing r-tree method is provided by the excellent Geoff Beoing

Best Answer

Related Solutions

[GIS] Geopandas Line Polygon Intersection

[GIS] Iterate over geopandas dataframe returning nearest point on each LineString

Related Question