Consider the following example dataframes:
import geopandas
from shapely.geometry import Polygon
df1 = geopandas.GeoDataFrame(
{'geometry': [Polygon([(0, 0), (0, 1), (1, 1), (1, 0)]),
Polygon([(1, 1), (2, 1), (2, 2), (1, 2)]),
Polygon([(2, 0), (3, 0), (3, 1), (2, 1)])],
'attr1': [1, 2, 3]})
df2 = geopandas.GeoDataFrame(
{'geometry': [Polygon([(0, 0), (0, 0.9), (1.1, 1), (1, 0.1)]),
Polygon([(2, 0.1), (3.1, 0), (3, 1.1), (2, 1)]),
Polygon([(2, 2), (3, 2), (3, 3), (2, 3)])],
'attr2': [1, 2, 3]})
They both have 3 records, of which 2 overlap mostly:
If we want to merge those two datasets based on more complex criterion (in this case a kind of 'mostly overlapping'), we cannot just use pandas.merge
(on an attribute column) or geopandas.sjoin
(geometries overlap exactly). But we could take an approach where we first calculate the index of the mostly overlapping items, and then with this index, subset our original frames and concatenate them.
Let's define this function that for a given Polygon, returns where it overlaps with the geometries in another GeoSeries:
def nearly_identical(geoms, p):
nearly = (geoms.intersection(p).area / p.area) > 0.75
# return index values where nearly is True
return pd.Series(nearly.index[nearly])
matches = df1.geometry.apply(lambda x: nearly_identical(df2, x))
In the case of the example dataframes, this returns
>>> matches
0
0 0.0
1 NaN
2 1.0
indicating that row 0 of df1
matches with row 0 of df2, row 1 of df1
has no match, and row 2 matches with row 1 of df2
.
Since there can potentially be multiple matches per row (is that correct), we need to convert this a bit (and drop the NaNs, as we cannot index with that):
>>> matches2 = matches.unstack().reset_index(0, drop=True).dropna()
>>> matches2
0 0.0
2 1.0
Now, we can use this to subset df2
and concat it with df1
:
# take those values of df2 that have a match with df1
df2_matched = df2.reindex(index=matches2.values)
# overwrite its index with the corresponding index in df1 (for the matching row),
# so we can concatenate them based on this index
df2_matched.index = matches2.index
df_merged = pd.concat([df1, df2_matched], axis=1)
This gives for this example dataframe:
>>> df_merged
attr1 geometry attr2 geometry
0 1 POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0)) 1.0 POLYGON ((0 0, 0 0.9, 1.1 1, 1 0.1, 0 0))
1 2 POLYGON ((1 1, 2 1, 2 2, 1 2, 1 1)) NaN NaN
2 3 POLYGON ((2 0, 3 0, 3 1, 2 1, 2 0)) 2.0 POLYGON ((2 0.1, 3.1 0, 3 1.1, 2 1, 2 0.1))
Of course, you could then first drop the 'geometry' columns of df2
to not end up with two columns, or only select those attributes of df2_matched
that you want to add to df1
.
Best Answer
As @Vince said in the comments, the behavior of the function is consistent with the definition of
overlap
(emphasis mine):The possible values for
sjoin
'spredicate
argument (formerly namedop
) are:['covers', 'within', 'contains', 'crosses', None, 'intersects', 'touches', 'covered_by', 'contains_properly', 'overlaps']
.Their definitions are copied below:
contains_properly
predicate.