[GIS] Merging two datasets where polygons are nearly identical using geopandas

geopandaspandasshapely

I have two polygon-shapefiles representing similar data. However, the features in these layers are not always 100 % identical. So, a feature in shapefile 1 may be present in shapefile 2, but the form of the polygon is not exactly the same (vertices missing or different). Also, shapefile 1 contains features which are not present in shapefile 2. The opposite is also true.

I'm writing a Python script to merge these datasets. I would like to obtain a layer with all features from shapefile 1, and the attributes of the features of shapefile 2 for those features that are nearly identical to the features of shapefile 1. To define 'Nearly identical' I use the criterium of 75% of the area is intersecting.

How can I efficiently compare the features of shapefile 1 to the features of shapefile 2?

This code works to transfer the attributes of an 'nearly identical' feature of shp2 to shp1. However, it is extremly slow on large datasets, because of the itertuples.

df1 = gpd.read_file('shapefile1.shp')
df2 = gpd.read_file('shapefile2.shp')

df1['attribute1_of_shp2'] = 0 
df1['attribute2_of_shp2'] = 0 
for feature in df2.itertuples(): 
    geom = getattr(feature, 'geometry') 
    attr1 = getattr(feature, 'attribute1') 
    attr2 = getattr(feature, 'attribute2') 
    intersection = df1['geometry'].intersection(geom)
    df1['RelIntersection'] = intersection.area/df1.area
    df1.loc[df['RelIntersection'] > .75, 'attribute1_of_shp2'] = attr1
    df1.loc[df['RelIntersection'] > .75, 'attribute2_of_shp2'] = attr2

Is there a way to speed-up this?

I've been looking to geopandas overlay but did not find a working solution yet.

Best Answer

Consider the following example dataframes:

import geopandas
from shapely.geometry import Polygon

df1 = geopandas.GeoDataFrame(
    {'geometry': [Polygon([(0, 0), (0, 1), (1, 1), (1, 0)]),
                  Polygon([(1, 1), (2, 1), (2, 2), (1, 2)]),
                  Polygon([(2, 0), (3, 0), (3, 1), (2, 1)])],
     'attr1': [1, 2, 3]})

df2 = geopandas.GeoDataFrame(
    {'geometry': [Polygon([(0, 0), (0, 0.9), (1.1, 1), (1, 0.1)]),
                  Polygon([(2, 0.1), (3.1, 0), (3, 1.1), (2, 1)]),
                  Polygon([(2, 2), (3, 2), (3, 3), (2, 3)])],
     'attr2': [1, 2, 3]})

They both have 3 records, of which 2 overlap mostly:

If we want to merge those two datasets based on more complex criterion (in this case a kind of 'mostly overlapping'), we cannot just use pandas.merge (on an attribute column) or geopandas.sjoin (geometries overlap exactly). But we could take an approach where we first calculate the index of the mostly overlapping items, and then with this index, subset our original frames and concatenate them.

Let's define this function that for a given Polygon, returns where it overlaps with the geometries in another GeoSeries:

def nearly_identical(geoms, p):
    nearly = (geoms.intersection(p).area / p.area) > 0.75
    # return index values where nearly is True
    return pd.Series(nearly.index[nearly])

matches = df1.geometry.apply(lambda x: nearly_identical(df2, x))

In the case of the example dataframes, this returns

>>> matches
     0
0  0.0
1  NaN
2  1.0

indicating that row 0 of df1 matches with row 0 of df2, row 1 of df1 has no match, and row 2 matches with row 1 of df2.
Since there can potentially be multiple matches per row (is that correct), we need to convert this a bit (and drop the NaNs, as we cannot index with that):

>>> matches2 = matches.unstack().reset_index(0, drop=True).dropna()
>>> matches2
0    0.0
2    1.0

Now, we can use this to subset df2 and concat it with df1:

# take those values of df2 that have a match with df1
df2_matched = df2.reindex(index=matches2.values)
# overwrite its index with the corresponding index in df1 (for the matching row), 
# so we can concatenate them based on this index
df2_matched.index = matches2.index

df_merged = pd.concat([df1, df2_matched], axis=1)

This gives for this example dataframe:

>>> df_merged
   attr1                             geometry  attr2                                     geometry
0      1  POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))    1.0    POLYGON ((0 0, 0 0.9, 1.1 1, 1 0.1, 0 0))
1      2  POLYGON ((1 1, 2 1, 2 2, 1 2, 1 1))    NaN                                          NaN
2      3  POLYGON ((2 0, 3 0, 3 1, 2 1, 2 0))    2.0  POLYGON ((2 0.1, 3.1 0, 3 1.1, 2 1, 2 0.1))

Of course, you could then first drop the 'geometry' columns of df2 to not end up with two columns, or only select those attributes of df2_matched that you want to add to df1.

Related Solutions

GeoPandas – Merging Overlapping Features

The GeoDataFrame

import geopandas as gpd
g1 = gpd.GeoDataFrame.from_file("poly_intersect.shp")
g1.shape
(4, 3)

1) You can use the itertools module

a) If you want to merge the intersections of the overlapping polygons

import itertools
geoms = g1['geometry'].tolist()
intersection_iter = gpd.GeoDataFrame(gpd.GeoSeries([poly[0].intersection(poly[1]) for poly in  itertools.combinations(geoms, 2) if poly[0].intersects(poly[1])]), columns=['geometry'])
intersection_iter.to_file("intersection_iter.shp")

Union

union_iter = intersection_iter.unary_union

b) If you want to merge the intersected polygons change intersection by union (all the polygons overlap in my example)

2) You can use GeoPandas Overlay

auto_inter = gpd.overlay(g1, g1, how='intersection')
auto_inter.shape
(7,4)

The resulting GeoDataframe

GeoPandas add the intersection geometries to the existing geometries, therefore

intersection = auto_inter[4:7]
intersection.to_file("intersection.shp")

Union

union = intersection.unary_union

b) use gpd.overlay(g1, g1, how='union')

GeoPandas – Intersecting Two Shape Problem Using GeoPandas

1) The problem is that the intersection of two polygons is not always a polygon

import geopandas as gp
poly1 = gp.read_file("poly_origin.shp")
poly2 = gp.read_file("poly_test.shp")
data = []
for index, orig in poly1.iterrows():
    for index2, ref in poly2.iterrows():      
        if ref['geometry'].intersects(orig['geometry']): 
         owdspd=orig['id']
         data.append({'geometry':ref['geometry'].intersection(orig['geometry']),'wdspd':owdspd})

for geom in data: 
   print geom
{'geometry': <shapely.geometry.polygon.Polygon object at 0x10aab9c10>, 'wdspd': 1}
{'geometry': <shapely.geometry.polygon.Polygon object at 0x10aab9e50>, 'wdspd': 2}
{'geometry': <shapely.geometry.linestring.LineString object at 0x10aacc250>, 'wdspd': 3}

The result is 2 Polygons and a LineString therefore the error

"ValueError: Geometry column cannot contain mutiple geometry types when writing to file"

2) A solution is to convert all the geometries to the same type

from shapely.geometry import LineString
for i in data:
 geom = i['geometry']
 if geom.geom_type=='Polygon':
      i['geometry']=  LineString(list(geom.exterior.coords))

Now, you can convert to a GeoDataFrame

df = gp.GeoDataFrame(data,columns=['geometry','wdspd'])
df                          geometry                   wdspd
0  LINESTRING (96.42123294751703 -44.632201588700...      1
1  LINESTRING (282.773470482277 -127.548992317643...      2
2  LINESTRING (144.3726591760299 -221.86142322097...      3

3) Or to use GeoPandas Overlay

inter = gp.overlay(poly1, poly2, how='intersection')
inter
     id  id_2                geometry
 0   2     2  POLYGON ((282.773470482277 -127.5489923176438,...
 1   1     1  POLYGON ((51.0880537778328 -116.6195085418168,...

Export the result

df.to_file("linestring_intersection.shp")

Best Answer

Related Solutions

GeoPandas – Merging Overlapping Features

GeoPandas – Intersecting Two Shape Problem Using GeoPandas

Export the result

Related Question