GeoPandas Issues – Fixing Overlaps Sjoin Failures When Two Geometries Are Identical Using GeoPandas

geopandasgeosshapely

I am trying to find all polygons that overlap in a collection.
In my unit test I added the following:

import geopandas as gpd
import pandas as pd
from shapely.geometry import box

s1 = [1, 2, 3, 4, 0.5]
s2 = [5, 6, 7, 8, 0.5]
s3 = [1, 2, 3, 4, 0.3]
# s3 = [1.1, 2.1, 3.1, 4.1, 0.3] # Works
df = pd.DataFrame(
    [s1, s2, s3], columns=["e_1", "n_1", "e_2", "n_2", "value"]
)
crs = "+proj=longlat +a=1000000 +b=1000000 +no_defs"
gdf = gpd.GeoDataFrame(
    df,
    crs=crs,
    geometry=[
        box(e_1, n_1, e_2, n_2)
        for _, e_1, n_1, e_2, n_2 in df[
            ["e_1", "n_1", "e_2", "n_2"]
        ].itertuples()
    ],
)
joined = gpd.sjoin(gdf.copy(),
                   gdf,
                   how='inner', op='overlaps')
joined.shape

which gives (0, 12) meaning it did not join.

If I uncomment the s3 line with 1.1, 2.1 etc. I get the expected result (2,12).

Have I misunderstood how the overlap predicate works or is this a bug?

Best Answer

As @Vince said in the comments, the behavior of the function is consistent with the definition of overlap (emphasis mine):

Geometries overlaps if they have more than one but not all points in common, have the same dimension, and the intersection of the interiors of the geometries has the same dimension as the geometries themselves.

The possible values for sjoin's predicate argument (formerly named op) are: ['covers', 'within', 'contains', 'crosses', None, 'intersects', 'touches', 'covered_by', 'contains_properly', 'overlaps'].

Their definitions are copied below:

overlaps: Geometries overlaps if they have more than one but not all points in common, have the same dimension, and the intersection of the interiors of the geometries has the same dimension as the geometries themselves.
covered_by: An object A is said to cover another object B if no points of B lie in the exterior of A.
touches: An object is said to touch other if it has at least one point in common with other and its interior does not intersect with any part of the other. Overlapping features therefore do not touch.
intersects: An object is said to intersect other if its boundary and interior intersects in any way with those of the other.
crosses: An object is said to cross other if its interior intersects the interior of the other but does not contain it, and the dimension of the intersection is less than the dimension of the one or the other.
contains: An object is said to contain other if at least one point of other lies in the interior and no points of other lie in the exterior of the object. (Therefore, any given polygon does not contain its own boundary – there is not any point that lies in the interior.) If either object is empty, this operation returns False. This is the inverse of within() in the sense that the expression a.contains(b) == b.within(a) always evaluates to True.
within: An object is said to be within other if at least one of its points is located in the interior and no points are located in the exterior of the other. If either object is empty, this operation returns False. This is the inverse of contains() in the sense that the expression a.within(b) == b.contains(a) always evaluates to True.
covers: An object A is said to cover another object B if no points of B lie in the exterior of A. If either object is empty, this operation returns False.
contains_properly: Returns True if geometry B is completely inside geometry A, with no common boundary points. A contains B properly if B intersects the interior of A but not the boundary (or exterior). This means that a geometry A does not “contain properly” itself, which contrasts with the contains function, where common points on the boundary are allowed. I wasn't able to find more details about the contains_properly predicate.

Related Solutions

[GIS] Merging two datasets where polygons are nearly identical using geopandas

Consider the following example dataframes:

import geopandas
from shapely.geometry import Polygon

df1 = geopandas.GeoDataFrame(
    {'geometry': [Polygon([(0, 0), (0, 1), (1, 1), (1, 0)]),
                  Polygon([(1, 1), (2, 1), (2, 2), (1, 2)]),
                  Polygon([(2, 0), (3, 0), (3, 1), (2, 1)])],
     'attr1': [1, 2, 3]})

df2 = geopandas.GeoDataFrame(
    {'geometry': [Polygon([(0, 0), (0, 0.9), (1.1, 1), (1, 0.1)]),
                  Polygon([(2, 0.1), (3.1, 0), (3, 1.1), (2, 1)]),
                  Polygon([(2, 2), (3, 2), (3, 3), (2, 3)])],
     'attr2': [1, 2, 3]})

They both have 3 records, of which 2 overlap mostly:

If we want to merge those two datasets based on more complex criterion (in this case a kind of 'mostly overlapping'), we cannot just use pandas.merge (on an attribute column) or geopandas.sjoin (geometries overlap exactly). But we could take an approach where we first calculate the index of the mostly overlapping items, and then with this index, subset our original frames and concatenate them.

Let's define this function that for a given Polygon, returns where it overlaps with the geometries in another GeoSeries:

def nearly_identical(geoms, p):
    nearly = (geoms.intersection(p).area / p.area) > 0.75
    # return index values where nearly is True
    return pd.Series(nearly.index[nearly])

matches = df1.geometry.apply(lambda x: nearly_identical(df2, x))

In the case of the example dataframes, this returns

>>> matches
     0
0  0.0
1  NaN
2  1.0

indicating that row 0 of df1 matches with row 0 of df2, row 1 of df1 has no match, and row 2 matches with row 1 of df2.
Since there can potentially be multiple matches per row (is that correct), we need to convert this a bit (and drop the NaNs, as we cannot index with that):

>>> matches2 = matches.unstack().reset_index(0, drop=True).dropna()
>>> matches2
0    0.0
2    1.0

Now, we can use this to subset df2 and concat it with df1:

# take those values of df2 that have a match with df1
df2_matched = df2.reindex(index=matches2.values)
# overwrite its index with the corresponding index in df1 (for the matching row), 
# so we can concatenate them based on this index
df2_matched.index = matches2.index

df_merged = pd.concat([df1, df2_matched], axis=1)

This gives for this example dataframe:

>>> df_merged
   attr1                             geometry  attr2                                     geometry
0      1  POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))    1.0    POLYGON ((0 0, 0 0.9, 1.1 1, 1 0.1, 0 0))
1      2  POLYGON ((1 1, 2 1, 2 2, 1 2, 1 1))    NaN                                          NaN
2      3  POLYGON ((2 0, 3 0, 3 1, 2 1, 2 0))    2.0  POLYGON ((2 0.1, 3.1 0, 3 1.1, 2 1, 2 0.1))

Of course, you could then first drop the 'geometry' columns of df2 to not end up with two columns, or only select those attributes of df2_matched that you want to add to df1.

[GIS] Perform sjoin in geopandas leads to:’AttributeError: ‘GeoSeries’ object has no attribute ‘columns”

geopandas.sjoin expects a GeoDataFrame, not a GeoSeries. So instead of

gpd.sjoin(gdf["geom"], exp_union_gdf , how="inner", op='intersects')

you can do

gpd.sjoin(gdf, exp_union_gdf , how="inner", op='intersects')

Best Answer

Related Solutions

[GIS] Merging two datasets where polygons are nearly identical using geopandas

[GIS] Perform sjoin in geopandas leads to:’AttributeError: ‘GeoSeries’ object has no attribute ‘columns”

Related Question