GeoPandas – Using sjoin for Spatial Join

geodataframegeopandaspoint-in-polygonpythonspatial-join

I have a GeoDataFrame with 12.431 observations of geographical units, called "cities". I also have another layer file with points, called "points". Both of them are in CRS: EPSG4326.

I want to create a variable in the GeoDataFrame which indicates if it contains a point from the other file. This is done correctly from my code.

However, when I use len(df3) I obtained that my number of observations increase to 12.599. How is it possible? I have explored if I have missing values in any column but I do not find them.

Why the increase of 168 observations?

I need to maintain the same number of observations as in "cities".

This is the code I am using:

import numpy as np
import pandas as pd
import geopandas as gpd

df3 = gpd.sjoin(cities, points[['geometry']], how="left") 
df3['points'] = np.where(np.isnan(df3['index_right']), 0, 1)
 
del(df3['index_right'])

Best Answer

For each point inside a polygon, a new polygon will be created. So a polygon with 2 points in it will become two polygons in the output:

import geopandas as gpd

city = gpd.read_file(r"/home/bera/Desktop/GIStest/cities.shp")

#city.shape
#Out[2]: (3, 3)

# city.Name
# Out[9]: 
# 0    Manchester
# 1    Birmingham
# 2     Sheffield

pnt = gpd.read_file(r"/home/bera/Desktop/GIStest/pointsss.shp")
#pnt.shape
#Out[3]: (5, 2)

df3 = gpd.sjoin(city, pnt[['geometry']], how="left") 
#df3.shape
#Out[5]: (4, 4)


# df3.Name
# Out[10]: 
# 0    Manchester
# 0    Manchester
# 1    Birmingham
# 2     Sheffield

#Examples of what you can do to process the duplicates:

#df3.drop_duplicates(subset="Name", keep="first") #Delete them
#df3.groupby("Name").count() #Count

enter image description here