I am trying to figure out how to dynamically create expanded regions of interest based on polygon geometries in geopandas until some threshold is satisfied (essentially custom regions across the contiguous US).
I have a geopandas dataframe with the following columns:
unitspaces_geodf[['unit_space_count', 'city', 'state_code', 'latitude', 'longitude', 'cbsa_code', 'geometry']]
where geometry
is the corresponding polygon for the cbsa_code
(generated from a shape file).
I can get a total count of unit_space_count
by grouping the dataframe on cbsa_code
:
unitspaces_geodf.groupby('cbsa_code')['unit_space_count'].sum()
Based on the sum()
value of unit_space_count
in each cbsa_code
(let's say a threshold of 100) I want to determine the next nearest geometry
and then combine the two geometries (and concatenate the cbsa_code
) until unit_space_count
is above the threshold. This is essentially creating a new neighborhood/region. Then remove these two (or more) cbsa_code
entries from the pool of available entries to concatenate with.
So in the example above, the last line of the groupby
clause shows a cbsa_code
of 12660
and the total unit_space_count
is 72
.
For this record, I want to determine the nearest neighbor (using the geometry
column and hopefully an out-of-the-box geopandas or shapely method) to generate a new combined cbsa-code
(something like 12660-12620
if 12620
happened to be the nearest neighbor) and a new unit_space_count
of 1268 (1196 + 72).
I think I may need to first create a map of all the grouped values to see which regions should be joined with which, but after that I need help with determining how to do these calculations.
Best Answer
You can use the Python rtree library to build up a spatial index, which then has a
nearest
method you can use to get the nearest geometry in the index to any given query. I think Shapely also comes with an rtree implementation which behaves similarly, but I could be wrong - I always usertree
. This is probably the fastest way, as it only requires one calculation for each record.Otherwise, you'll need to compare every geometry to every other one using the shapely
distance
method, and choose the smallest one. I guess in Pandas that would be a full outer join of your dataset to itself, and add a column of distance, and then query out the records where ID's are not equal (implying you calculated the distance to the same polygon) and distance is the shortest.