GeoPandas – Using sjoin for Spatial Join

geodataframegeopandaspoint-in-polygonpythonspatial-join

I have a GeoDataFrame with 12.431 observations of geographical units, called "cities". I also have another layer file with points, called "points". Both of them are in CRS: EPSG4326.

I want to create a variable in the GeoDataFrame which indicates if it contains a point from the other file. This is done correctly from my code.

However, when I use len(df3) I obtained that my number of observations increase to 12.599. How is it possible? I have explored if I have missing values in any column but I do not find them.

Why the increase of 168 observations?

I need to maintain the same number of observations as in "cities".

This is the code I am using:

import numpy as np
import pandas as pd
import geopandas as gpd

df3 = gpd.sjoin(cities, points[['geometry']], how="left") 
df3['points'] = np.where(np.isnan(df3['index_right']), 0, 1)
 
del(df3['index_right'])

Best Answer

For each point inside a polygon, a new polygon will be created. So a polygon with 2 points in it will become two polygons in the output:

import geopandas as gpd

city = gpd.read_file(r"/home/bera/Desktop/GIStest/cities.shp")

#city.shape
#Out[2]: (3, 3)

# city.Name
# Out[9]: 
# 0    Manchester
# 1    Birmingham
# 2     Sheffield

pnt = gpd.read_file(r"/home/bera/Desktop/GIStest/pointsss.shp")
#pnt.shape
#Out[3]: (5, 2)

df3 = gpd.sjoin(city, pnt[['geometry']], how="left") 
#df3.shape
#Out[5]: (4, 4)


# df3.Name
# Out[10]: 
# 0    Manchester
# 0    Manchester
# 1    Birmingham
# 2     Sheffield

#Examples of what you can do to process the duplicates:

#df3.drop_duplicates(subset="Name", keep="first") #Delete them
#df3.groupby("Name").count() #Count

Related Solutions

[GIS] Performing sjoin on polygons and lines without intersection using GeoPandas

The geopandas.sjoin function only supports the 'intersects', 'within' and 'contains' predicates, and not a "nearest" one.

You can write a custom function to find the id of the nearest linestring for each polygon, and then merge on that. This could look like:

def nearest_linestring(polygon, df_lines):
    idx = df_lines.geometry.distance(polygon).idxmin()
    return df_lines.loc[idx, 'id']

df_polygon['id_nearest_line'] = df_polygon.geometry.apply(nearest_linestring, df_lines=df_lines)

pd.merge(df_polygon, df_lines, right_on='id_nearest_line', left_on='id',how='inner')

However, an important remark with this approach: it will only find a single nearest one, so if you had for a certain polygon multiple linestrings that are intersecting with it, it will not give them all. It should be possible to update the function for that though.
Second remark: if you have a lot of data, calculating the distance for all linestrings like the in the function above might not be very efficient. You could use spatial index to improve this, but I would only worry about that if the speed turns out to actually be a problem.

GeoPandas – Speeding Up GeoPandas Spatial Join Operations

You can get significant speedup if you have GeoPandas 0.8 and PyGEOS installed (geopandas.org/install.html#using-the-optional-pygeos-dependency). PyGEOS uses vectorized numpy ufuncs and can be orders of magnitude faster than standard shapely.

Note that PyGEOS will be part of Shapely 2.0, so once that is released, installing PyGEOS separately will not be needed.

Best Answer

Related Solutions

[GIS] Performing sjoin on polygons and lines without intersection using GeoPandas

GeoPandas – Speeding Up GeoPandas Spatial Join Operations

Related Question