[GIS] remove duplicate points based on a specified distance

postgispostgis-2.0

i have a table of points with each point acquired on a specified date, by a specified source, and having a measured value. i want to consider points as duplicates if they have the same date and if they are within a certain distance of each other (say <10 m), even if they came from different sources. here's a brief example using 8 points:

create table points (gid serial primary key, date date, source varchar, value integer);

select addGeometryColumn('points', 'geom', 32633, 'point', 2);

insert into points (date, source, value, geom) values
  ('1985-06-19', 'B', 22, st_geomFromText('POINT(593835 3910494)', 32633)),
  ('1985-06-19', 'A', 48, st_geomFromText('POINT(602920 3910591)', 32633)),
  ('1987-04-17', 'B', 12, st_geomFromText('POINT(592317 3910479)', 32633)),
  ('1993-07-22', 'A', 16, st_geomFromText('POINT(592323 3910476)', 32633)),
  ('1993-07-22', 'B', 30, st_geomFromText('POINT(592317 3910479)', 32633)),
  ('2000-07-09', 'A', 20, st_geomFromText('POINT(592108 3909121)', 32633)),
  ('2006-07-14', 'A', 22, st_geomFromText('POINT(592933 3909842)', 32633)),
  ('2006-07-14', 'B', 18, st_geomFromText('POINT(592931 3909843)', 32633));

i'm trying to create a query that should result in 6 points, since points 4/5 and 7/8 would be considered duplicates because they have the same date and <10 m from each other. i also want the query to select the duplicate point that has the maximum value, so in my example the result set would be ('geom' field not shown):

 gid |    date    | source | value 
-----+------------+--------+-------
   1 | 1985-06-19 | B      |    22
   2 | 1985-06-19 | A      |    48
   3 | 1987-04-17 | B      |    12
   5 | 1993-07-22 | B      |    30
   6 | 2000-07-09 | A      |    20
   7 | 2006-07-14 | A      |    22
(6 rows)

been trying various queries using self joining and the st_dwithin and st_distance functions, but can't seem to nail it down. i'm using postgis 2.0 and postgresql 9.1. thanks!

update: based on the link whuber suggested, i explored using the st_snapToGrid function and this query seems to produce the result i'm looking for:

select distinct on (date, grid_location) grouped.*
from
(select *, st_snapToGrid(geom, 10, 10) as grid_location
from points order by date, grid_location, value desc) as grouped;

update 2: the above query eliminated most of the duplicates in my actual dataset, but i found several still making it through since sometimes points would be snapped to different grids if they happened to be bisected by a grid boundary. i've posted another solution as an answer to my question.

Best Answer

this seems to work, although not a compact, single query solution. it also only works if the duplicates are a result of 2 sources (it wouldn't select a single value if duplicate values came from sources 'A', 'B', and 'C'):

--table of just duplicates
create table dups as
select t1.*, st_centroid(st_union(t1.geom, t2.geom)) as cent_geom
from points as t1 join points as t2 on st_dwithin(t1.geom, t2.geom, 10)
where t1.date = t2.date and t1.gid != t2.gid
order by date, value desc;

--table of no duplicates
create table no_dups as select points.*
from points left join dups on points.gid = dups.gid
where dups.gid is null;

--insert distinct duplicates
insert into no_dups
select distinct on (date, cent_geom) gid, date, source, value, geom
from dups;

select * from no_dups
order by gid;
Related Question