I have two polygon-shapefiles representing similar data. However, the features in these layers are not always 100 % identical. So, a feature in shapefile 1 may be present in shapefile 2, but the form of the polygon is not exactly the same (vertices missing or different). Also, shapefile 1 contains features which are not present in shapefile 2. The opposite is also true.
I'm writing a Python script to merge these datasets. I would like to obtain a layer with all features from shapefile 1, and the attributes of the features of shapefile 2 for those features that are nearly identical to the features of shapefile 1. To define 'Nearly identical' I use the criterium of 75% of the area is intersecting.
How can I efficiently compare the features of shapefile 1 to the features of shapefile 2?
This code works to transfer the attributes of an 'nearly identical' feature of shp2 to shp1. However, it is extremly slow on large datasets, because of the itertuples.
df1 = gpd.read_file('shapefile1.shp')
df2 = gpd.read_file('shapefile2.shp')
df1['attribute1_of_shp2'] = 0
df1['attribute2_of_shp2'] = 0
for feature in df2.itertuples():
geom = getattr(feature, 'geometry')
attr1 = getattr(feature, 'attribute1')
attr2 = getattr(feature, 'attribute2')
intersection = df1['geometry'].intersection(geom)
df1['RelIntersection'] = intersection.area/df1.area
df1.loc[df['RelIntersection'] > .75, 'attribute1_of_shp2'] = attr1
df1.loc[df['RelIntersection'] > .75, 'attribute2_of_shp2'] = attr2
Is there a way to speed-up this?
I've been looking to geopandas overlay but did not find a working solution yet.
Best Answer
Consider the following example dataframes:
They both have 3 records, of which 2 overlap mostly:
If we want to merge those two datasets based on more complex criterion (in this case a kind of 'mostly overlapping'), we cannot just use
pandas.merge
(on an attribute column) orgeopandas.sjoin
(geometries overlap exactly). But we could take an approach where we first calculate the index of the mostly overlapping items, and then with this index, subset our original frames and concatenate them.Let's define this function that for a given Polygon, returns where it overlaps with the geometries in another GeoSeries:
In the case of the example dataframes, this returns
indicating that row 0 of
df1
matches with row 0 of df2, row 1 ofdf1
has no match, and row 2 matches with row 1 ofdf2
.Since there can potentially be multiple matches per row (is that correct), we need to convert this a bit (and drop the NaNs, as we cannot index with that):
Now, we can use this to subset
df2
and concat it withdf1
:This gives for this example dataframe:
Of course, you could then first drop the 'geometry' columns of
df2
to not end up with two columns, or only select those attributes ofdf2_matched
that you want to add todf1
.