[GIS] Compare only valid polygon geometries in PostGIS

errorinvalid-datapostgisspatial-indextopology

I'm trying to run a fuzzy geometry comparison on a large polygon dataset (>2 million rows), but am getting an error like so:

ERROR: ptarray_contains_point called on unclosed ring
SQL state: XX000

The problem lies with a few annoying invalid geometries being present in one of the tables, and this error is raised on the 'ST_DWithin' statement (see code excerpt below):

SELECT
    DISTINCT ON (t1.gid) t1.gid, t1.id, t2.gid, t2.id,
    (CASE WHEN ST_INTERSECTS(t1.geom, ST_MakeValid(t2.geom)) IS false THEN '9999.99'
          ELSE ST_HausdorffDistance(t1.geom, t2.geom)
    END) as hausdorff
FROM 
    (SELECT id, gid, geom FROM new_lyr WHERE gid = 12345) as t1,
    (SELECT id, gid, geom FROM old_lyr) as t2
WHERE
    ST_DWithin(t1.geom, t2.geom, 2)
ORDER BY
    t1.gid, hausdorff;

My question is: How can I somehow bypass the error, while still making use of the in-built PostGIS spatial index (i.e. the ST_DWithin statement). I have tried to put ST_MakeValid(t2.geom) into the WHERE clause, but this takes almost forever to run since it tries to make all the geometries in table t2 valid! Is there any other workaround way to resolve this, either in SQL, or would it be better just to clean up the geometry data in those tables and re-run the query?

Best Answer

This should skip over invalid geometries.

SELECT
   DISTINCT ON (t1.gid) t1.gid, t1.id, t2.gid, t2.id,
(CASE WHEN ST_INTERSECTS(t1.geom, t2.geom) IS false THEN '9999.99'
      ELSE ST_HausdorffDistance(t1.geom, t2.geom)
END) as hausdorff
 FROM 
    (SELECT id, gid, geom FROM new_lyr WHERE gid = 12345) as t1,
   (SELECT id, gid, geom FROM old_lyr WHERE ST_IsValid(geom) ) as t2
 WHERE
      ST_DWithin(t1.geom, t2.geom, 2)
ORDER BY
    t1.gid, hausdorff;

You might want to compare teh performance with putting the ST_IsValid check in the outter where. I think both should work but the planner isn't guaranteed to run them in order.

SELECT
   DISTINCT ON (t1.gid) t1.gid, t1.id, t2.gid, t2.id,
(CASE WHEN ST_INTERSECTS(t1.geom, t2.geom) IS false THEN '9999.99'
      ELSE ST_HausdorffDistance(t1.geom, t2.geom)
END) as hausdorff
 FROM 
    (SELECT id, gid, geom FROM new_lyr WHERE gid = 12345) as t1,
   (SELECT id, gid, geom FROM old_lyr ) as t2
 WHERE
      ST_IsValid(t2.geom)  AND ST_DWithin(t1.geom, t2.geom, 2)
ORDER BY
    t1.gid, hausdorff;

Update to original answer to force the planner's hand.

Here I use ST_Expand && which is just a bounding box check so should be safe with invalid geometries as well. This will materialized the t2 table for those where bounding box is close enough to t2. Using && should be safe with even invalid geometries since invalid geometries should have a bounding box still. This may on downside produce a large worktable with false-positives.

Note the next part I use _ST_DWithin instead of ST_DWithin. The reason is that ST_DWithin is just syntactic sugar for ST_Expand(geom,2) && geom and _ST_DWithin, so no need to do the && again if you've already done it.

 WITH t1 AS (SELECT id, gid, geom FROM new_lyr WHERE gid = 12345),
      t2 AS (SELECT o.id, o.gid, o.geom FROM old_lyr AS o 
       INNER JOIN t1 ON  ST_Expand(t1.geom, 2) && o.geom
             WHERE ST_IsValid(o.geom) )
 SELECT
   DISTINCT ON (t1.gid) t1.gid, t1.id, t2.gid, t2.id,
(CASE WHEN ST_INTERSECTS(t1.geom, t2.geom) IS false THEN '9999.99'
      ELSE ST_HausdorffDistance(t1.geom, t2.geom)
END) as hausdorff
 FROM 
    t1,
    t2
 WHERE
      _ST_DWithin(t1.geom, t2.geom, 2)
ORDER BY
    t1.gid, hausdorff;