Python DBSCAN – Clustering Points with DBSCAN Considering Polygon Boundaries

partitionpointpolygonpythonspatial-cluster

I have a GeoPandas dataframe called geo containing:

  • 431978 unique spatial Points representing all households in a city.
  • 112737 Polygons representing all the premises (panden) in a city

dataframe looks like this (note that the x, y coordinates come from the Point geometry column):

enter image description here

I need to cluster all these Points within a 5 meter range using DBSCAN but partitioned over the premises so that all clustered Points cannot cross the boundaries of the Polygon.

So the following code produces the WRONG clustering as theoretically depicted in the screenshot:

from sklearn.cluster import DBSCAN
epsilon = 5
coords = geo[['x', 'y']].values
db = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric = 'euclidean')
.fit(coords))

cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# number of clusters
num_clusters = len(set(cluster_labels))
print ('number of clustered verblijfsobjecten: 
{}'.format(num_clusters))

enter image description here

The right DBSCAN cluster assignment, again, would take into a count the boundaries of the Polygons that the clustered Points CANNOT cross. So to depict the right clustering for one of the premises:

enter image description here

I assume assigning the DBSCAN algorithm on each group resulting from the geo.groupby(['pand_id']) and get_group method, would be a way to go ?

In SQL this is easy but my question is how to do this in Python/sklearn. thnx!

SQL statement:

Q = """drop table if exists TABLE.SCHEMA;
create table service_afvalcontainers.stag_vo_cluster as
select
    pand_id,
    ST_X(geometrie) x,
    ST_Y(geometrie) y,
    ST_ClusterDBSCAN(geometrie,5,1) over(PARTITION BY pand_id) as 
    cluster,
    geometrie    
from
    TABLE.SCHEMA bla bla

;
"""
UPDATED ANSWER PART: In Python I tried:

### get pand_id to feed into loop. 
unique_pand_id = geo.pand_id.unique().tolist()

coords_frame = []
for pand in unique_pand_id[:10]:
coords = (geo.loc[geo['pand_id'] == pand, ('index' ,'pand_id' ,'verblijfsobject_id', 'x', 'y')].values)
coords_frame.append(coords) 

cluster_label_frame = []
epsilon=5
for elem in [[elem[3:5] for elem in group] for group in coords_frame]:
db = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric = 'euclidean')
 .fit(elem))
cluster_labels = db.labels_
cluster_label_frame.append(cluster_labels)

This cluster_label_frame looks ok indeed but the thing is, I need to have these cluster_ids and attach them back onto the original geo dataframe. I probably need to add the geo index to this label_frame with which I can hopefully join back to the geo frame. It doens' t feel like the right approach but it`s a step further…

enter image description here

Best Answer

Yes, a good approach would be to separate your points by pand_id and then run DBSCAN for each of those.

To filter your dataframe depending on the content of the pand_id column you can use the .loc[] feature, for example:

geo.loc[geo['pand_id'] == 123]

This will yield all the elements on your dataset that have 123 on their pand_id column.

Furthermore, if the pand_id values vary greatly, and thus knowing them explicitly might be a problem, you can obtain the unique values on your column with:

geo['pand_id'].unique()

You can then iterate over those values, filter with .loc[], obtain X,Y values or whatever you need and proceed with the DBSCAN algorithm.

Related Question