I have a GeoPandas dataframe called geo
containing:
- 431978 unique spatial Points representing all households in a city.
- 112737 Polygons representing all the premises (panden) in a city
dataframe looks like this (note that the x, y coordinates come from the Point geometry column):
I need to cluster all these Points within a 5 meter range using DBSCAN but partitioned over the premises so that all clustered Points cannot cross the boundaries of the Polygon.
So the following code produces the WRONG clustering as theoretically depicted in the screenshot:
from sklearn.cluster import DBSCAN
epsilon = 5
coords = geo[['x', 'y']].values
db = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric = 'euclidean')
.fit(coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# number of clusters
num_clusters = len(set(cluster_labels))
print ('number of clustered verblijfsobjecten:
{}'.format(num_clusters))
The right DBSCAN cluster assignment, again, would take into a count the boundaries of the Polygons that the clustered Points CANNOT cross. So to depict the right clustering for one of the premises:
I assume assigning the DBSCAN algorithm on each group resulting from the geo.groupby(['pand_id'])
and get_group
method, would be a way to go ?
In SQL this is easy but my question is how to do this in Python/sklearn. thnx!
SQL statement:
Q = """drop table if exists TABLE.SCHEMA;
create table service_afvalcontainers.stag_vo_cluster as
select
pand_id,
ST_X(geometrie) x,
ST_Y(geometrie) y,
ST_ClusterDBSCAN(geometrie,5,1) over(PARTITION BY pand_id) as
cluster,
geometrie
from
TABLE.SCHEMA bla bla
;
"""
UPDATED ANSWER PART: In Python I tried:
### get pand_id to feed into loop.
unique_pand_id = geo.pand_id.unique().tolist()
coords_frame = []
for pand in unique_pand_id[:10]:
coords = (geo.loc[geo['pand_id'] == pand, ('index' ,'pand_id' ,'verblijfsobject_id', 'x', 'y')].values)
coords_frame.append(coords)
cluster_label_frame = []
epsilon=5
for elem in [[elem[3:5] for elem in group] for group in coords_frame]:
db = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric = 'euclidean')
.fit(elem))
cluster_labels = db.labels_
cluster_label_frame.append(cluster_labels)
This cluster_label_frame
looks ok indeed but the thing is, I need to have these cluster_ids and attach them back onto the original geo
dataframe. I probably need to add the geo index to this label_frame with which I can hopefully join back to the geo
frame. It doens' t feel like the right approach but it`s a step further…
Best Answer
Yes, a good approach would be to separate your points by
pand_id
and then run DBSCAN for each of those.To filter your dataframe depending on the content of the
pand_id
column you can use the.loc[]
feature, for example:This will yield all the elements on your dataset that have
123
on theirpand_id
column.Furthermore, if the
pand_id
values vary greatly, and thus knowing them explicitly might be a problem, you can obtain the unique values on your column with:You can then iterate over those values, filter with
.loc[]
, obtain X,Y values or whatever you need and proceed with the DBSCAN algorithm.