[GIS] remove duplicate points based on a specified distance

postgispostgis-2.0

i have a table of points with each point acquired on a specified date, by a specified source, and having a measured value. i want to consider points as duplicates if they have the same date and if they are within a certain distance of each other (say <10 m), even if they came from different sources. here's a brief example using 8 points:

create table points (gid serial primary key, date date, source varchar, value integer);

select addGeometryColumn('points', 'geom', 32633, 'point', 2);

insert into points (date, source, value, geom) values
  ('1985-06-19', 'B', 22, st_geomFromText('POINT(593835 3910494)', 32633)),
  ('1985-06-19', 'A', 48, st_geomFromText('POINT(602920 3910591)', 32633)),
  ('1987-04-17', 'B', 12, st_geomFromText('POINT(592317 3910479)', 32633)),
  ('1993-07-22', 'A', 16, st_geomFromText('POINT(592323 3910476)', 32633)),
  ('1993-07-22', 'B', 30, st_geomFromText('POINT(592317 3910479)', 32633)),
  ('2000-07-09', 'A', 20, st_geomFromText('POINT(592108 3909121)', 32633)),
  ('2006-07-14', 'A', 22, st_geomFromText('POINT(592933 3909842)', 32633)),
  ('2006-07-14', 'B', 18, st_geomFromText('POINT(592931 3909843)', 32633));

i'm trying to create a query that should result in 6 points, since points 4/5 and 7/8 would be considered duplicates because they have the same date and <10 m from each other. i also want the query to select the duplicate point that has the maximum value, so in my example the result set would be ('geom' field not shown):

 gid |    date    | source | value 
-----+------------+--------+-------
   1 | 1985-06-19 | B      |    22
   2 | 1985-06-19 | A      |    48
   3 | 1987-04-17 | B      |    12
   5 | 1993-07-22 | B      |    30
   6 | 2000-07-09 | A      |    20
   7 | 2006-07-14 | A      |    22
(6 rows)

been trying various queries using self joining and the st_dwithin and st_distance functions, but can't seem to nail it down. i'm using postgis 2.0 and postgresql 9.1. thanks!

update: based on the link whuber suggested, i explored using the st_snapToGrid function and this query seems to produce the result i'm looking for:

select distinct on (date, grid_location) grouped.*
from
(select *, st_snapToGrid(geom, 10, 10) as grid_location
from points order by date, grid_location, value desc) as grouped;

update 2: the above query eliminated most of the duplicates in my actual dataset, but i found several still making it through since sometimes points would be snapped to different grids if they happened to be bisected by a grid boundary. i've posted another solution as an answer to my question.

Best Answer

this seems to work, although not a compact, single query solution. it also only works if the duplicates are a result of 2 sources (it wouldn't select a single value if duplicate values came from sources 'A', 'B', and 'C'):

--table of just duplicates
create table dups as
select t1.*, st_centroid(st_union(t1.geom, t2.geom)) as cent_geom
from points as t1 join points as t2 on st_dwithin(t1.geom, t2.geom, 10)
where t1.date = t2.date and t1.gid != t2.gid
order by date, value desc;

--table of no duplicates
create table no_dups as select points.*
from points left join dups on points.gid = dups.gid
where dups.gid is null;

--insert distinct duplicates
insert into no_dups
select distinct on (date, cent_geom) gid, date, source, value, geom
from dups;

select * from no_dups
order by gid;

Related Solutions

[GIS] How to debug ‘Layer is not valid error’ from Python plugin

It seems your "key" field must have type integer and not text. I reproduced your problem with the following snippet:

uri = QgsDataSourceURI()
uri.setConnection("localhost", "5432", "mydb", "user", "pass")
query="SELECT  id, CAST (id as text)|| '_' || (ST_DumpPoints(geom)).path[2] as key, (ST_DumpPoints(geom)).geom as points FROM line"
uri.setDataSource("", "(" + query + ")", "points", "", "key")
vl = iface.addVectorLayer(uri.uri(), "QueryLayer", "postgres")

Only after I changed the "key" field type to "int", I could get the layer loaded into QGIS:

uri = QgsDataSourceURI()
uri.setConnection("localhost", "5432", "mydb", "user", "pass")
query="SELECT  id, CAST(CAST (id as text)|| (ST_DumpPoints(geom)).path[2] as int) as key, (ST_DumpPoints(geom)).geom as points FROM line"
uri.setDataSource("", "(" + query + ")", "points", "", "key")
vl = iface.addVectorLayer(uri.uri(), "QueryLayer", "postgres")

Note that I had to adjust your gid to id and your the_geom to geom, in order to work with my own data.

postgis – How to Start Path Search Based on Source and Target Locations in Other Layers using pgRouting

As @robin loche suggested, you don't assign the points in your from and to point lists to the network vertices. You use the locations of your from and to points to find the closest start and destination vertices in the network for each search.

The following nested queries do this in stages:

generate all pairs of source and destination points
find closest network vertices for each point pair
run the routing algorithm

The results of pgr_Dijkstra are returned here as a sequence of rows for each from, to point pair with each row containing a step in the route sequence in a composite type. You may want to nest this to unpack those results a bit more. Being a stream application, you probably also have a directed network.

This should get you started.

ALTER TABLE streams ADD COLUMN length float8; -- add length field for pgr costs
UPDATE streams SET length = ST_Length(geom::geography);

WITH all_pairs AS (
  -- all pairs of start and end geometries with IDs
  -- that get carried through so your routing results
  -- match with the pt IDs you know.
  SELECT f.gid AS fid, f.geom as fgeom,
         t.gid as tid, t.geom as tgeom
    FROM from_pts AS f,
         to_pts AS t
), vertices AS (
SELECT fid, tid,
       (SELECT id -- proximity search for closest from vertex
          FROM streams_vertices_pgr
         ORDER BY the_geom <-> fgeom
         LIMIT 1) as fv,
       (SELECT id -- proximity search for closest to vertex
          FROM streams_vertices_pgr
         ORDER BY the_geom <-> tgeom
         LIMIT 1) as tv
  FROM all_pairs
)
SELECT fid, tid, pgr_Dijkstra(
  'SELECT gid AS id, source, target, length AS cost FROM streams',
  fv, tv,
  directed =>false
) from vertices;

Best Answer

Related Solutions

[GIS] How to debug ‘Layer is not valid error’ from Python plugin

postgis – How to Start Path Search Based on Source and Target Locations in Other Layers using pgRouting

Related Question