[GIS] Removing duplicate points in GeoPandas using distance and time

duplicationgeopandaspointpythonshapely

I have a list of of coordinates that have areas mapped out on a map

user_id id  latitude    longitude   requested_at
84  106 13.0472367  77.5022635  12-10-2020 14:59
84  107 13.0472789  77.5024498  12-10-2020 15:00
84  109 13.0478857  77.50110649 12-10-2020 15:01
84  113 13.0472431  77.5022722  12-10-2020 15:02
84  269 13.0473241  77.5022914  12-10-2020 15:58
84  272 13.0472491  77.5022977  12-10-2020 16:02
84  298 13.0472387  77.5021328  12-10-2020 16:13
84  300 13.047189   77.5022587  12-10-2020 16:14
84  361 13.0473381  77.5023609  12-10-2020 17:13
84  1755    13.0473334  77.5023081  14-10-2020 15:53
84  1844    13.0472747  77.502287   14-10-2020 18:01
3883    5374    12.8655367  77.7969649  18-10-2020 11:54
3883    7005    12.865576   77.7971244  19-10-2020 20:04
3883    7094    12.8654815  77.7972272  19-10-2020 21:32
3883    7703    12.8654448  77.7971621  20-10-2020 14:19
3883    8256    12.8654733  77.7970371  21-10-2020 07:59
3883    10022   12.8654014  77.7971733  22-10-2020 18:33
3883    10514   12.8654521  77.7970823  23-10-2020 08:23
3883    24186   12.8655376  77.796956   03-11-2020 10:16
3883    25685   12.8654658  77.7970327  04-11-2020 16:11
3883    29091   14.4539237  75.9065827  07-11-2020 16:33
41802   11757   12.8399959  77.6432516  13-10-2020 21:25
41802   11809   12.8399985  77.6432539  14-10-2020 20:38
41802   11900   12.839994   77.6432571  14-10-2020 20:39
41802   12215   12.8400107  77.6432862  16-10-2020 11:28
41802   27308   12.8419143  77.6434106  19-10-2020 07:49
41802   27309   12.8400799  77.6431911  19-10-2020 07:50
41802   6935    13.21153259 77.6601718  23-10-2020 11:03
41802   6939    13.21157837 77.66084737 23-10-2020 11:04
41802   6941    13.1345632  77.5726076  23-10-2020 18:36
41802   11727   13.134561   77.5726105  23-10-2020 18:44
41802   11736   13.1345605  77.5726143  23-10-2020 18:47
41802   27414   12.8399924  77.6432909  29-10-2020 10:38
41802   27434   12.8399968  77.643295   29-10-2020 10:39
41802   27443   12.8399749  77.6433084  29-10-2020 14:53
41802   27449   12.8399812  77.6432899  29-10-2020 15:16
41802   27461   12.83844757 77.69411675 06-11-2020 07:56
41802   27468   12.83850098 77.69418132 06-11-2020 08:02
41802   27451   12.8400088  77.6432962  07-11-2020 10:43

using these as points I am trying to remove the duplicate latitude and longitude within 500 meters and based on the user_id and the hour of the request

data_gpd = gpd.GeoDataFrame(Raw_trip_1, geometry = gpd.points_from_xy(Raw_trip_1.longitude, Raw_trip_1.latitude), crs={'init':'epsg:4326'})

How do I remove the duplicate and leave the last value like in pandas I would follow these steps

Raw_2 = Raw_1.drop_duplicates(subset=['user_id', 'requested_at'], keep='last')

coords = Raw_trip_2[["latitude", "longitude"]].values



ms_per_radian = 6371.0088
epsilon = 0.005 
db = DBSCAN(eps=epsilon, min_samples=4, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
indices = db.core_sample_indices_ 
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: {}'.format(num_clusters))

The ideal result of this would be I get one geo latitude and longitude per hour per user

Best Answer

If you just want a solution for your ideal result,

get one geo latitude and longitude per hour per user

then there is no need to remove duplicates or work with point clusters:

I assume that you have the data as a csv-file:

import pandas as pd
from datetime import datetime

test_df = pd.read_csv("test_df.csv")

# 1. convert datetime string in requested_at column to datetime object
test_df["req_at_dt"] = test_df["requested_at"].apply(
    lambda dt_string: datetime.strptime(dt_string, '%d-%m-%Y %I:%M:%S %p')
)

# 2. convert datetime object to datetime string by hour
test_df["req_at_hr"] = test_df["req_at_dt"].apply(
    lambda dt_string: datetime.strftime(dt_string, '%d-%m-%Y %H')
)

# 3. group the dataframe by user and hour and select the last coordinates
result = test_df.groupby(["user_id", "req_at_hr"])["latitude", "longitude"].last()