[GIS] Python geopandas dataframe of polygons — determine nearest neighbor polygon

geopandaspythonshapely

I am trying to figure out how to dynamically create expanded regions of interest based on polygon geometries in geopandas until some threshold is satisfied (essentially custom regions across the contiguous US).

I have a geopandas dataframe with the following columns:

unitspaces_geodf[['unit_space_count', 'city', 'state_code', 'latitude', 'longitude', 'cbsa_code', 'geometry']]

where geometry is the corresponding polygon for the cbsa_code (generated from a shape file).

raw geodataframe

I can get a total count of unit_space_count by grouping the dataframe on cbsa_code:

unitspaces_geodf.groupby('cbsa_code')['unit_space_count'].sum()

groupby

Based on the sum() value of unit_space_count in each cbsa_code (let's say a threshold of 100) I want to determine the next nearest geometry and then combine the two geometries (and concatenate the cbsa_code) until unit_space_count is above the threshold. This is essentially creating a new neighborhood/region. Then remove these two (or more) cbsa_code entries from the pool of available entries to concatenate with.

So in the example above, the last line of the groupby clause shows a cbsa_code of 12660 and the total unit_space_count is 72.

For this record, I want to determine the nearest neighbor (using the geometry column and hopefully an out-of-the-box geopandas or shapely method) to generate a new combined cbsa-code (something like 12660-12620 if 12620 happened to be the nearest neighbor) and a new unit_space_count of 1268 (1196 + 72).

I think I may need to first create a map of all the grouped values to see which regions should be joined with which, but after that I need help with determining how to do these calculations.

Best Answer

You can use the Python rtree library to build up a spatial index, which then has a nearest method you can use to get the nearest geometry in the index to any given query. I think Shapely also comes with an rtree implementation which behaves similarly, but I could be wrong - I always use rtree. This is probably the fastest way, as it only requires one calculation for each record.

Otherwise, you'll need to compare every geometry to every other one using the shapely distance method, and choose the smallest one. I guess in Pandas that would be a full outer join of your dataset to itself, and add a column of distance, and then query out the records where ID's are not equal (implying you calculated the distance to the same polygon) and distance is the shortest.