PostGIS ST_Union Function – Why Is ST_Union Output Disproportionately Large?

polygonpostgispostgresqlunion

I have two multi-polygon geometry tables (see the figures below) in my PostGIS database and I would like to create a union of these two tables. They are ~12mb and ~1mb in size. However, the union of these two tables which I created using ST_Union is ~1.1gb.

The code I used

CREATE TABLE myschema.myunion AS 
SELECT
        ST_Union(mymultipoly1.geom ,
        mymultipoly2.geom)
FROM
        mymultipoly1,
        mymultipoly2

It runs fast. My only problem is the size of the output. I don't have to keep the attributes of either of the input tables. But I don't want to meddle with the geometry too much as I need to calculate the area of the output reasonably accurately. My attempts so far

I used ST_Simplify(ST_Union... (with a tolerance value of 10m) and reduced the output table by half the size.
I tried to use ST_Collect(ST_Union... which gives me the following error ERROR: array size exceeds the maximum allowed (1073741823). I think this is because ST_Collect attempts to combine all rows of the multi polygon into one row.

I would like to know why the resulting table is huge and is there a more size-efficient way of doing this?

Best Answer

If a geometric union of both sets of geometries is what you are after, this should be the most convenient, and conveniently also the most performant way, to do so:

CREATE TABLE myschema.myunion AS (
  SELECT ST_Union(geom) AS geom
  FROM   (
    SELECT geom,
           ST_ClusterDBSCAN(geom, 0, 1) OVER () AS _clst
    FROM   (
      SELECT geom
      FROM   mymultipoly1
      UNION ALL
      SELECT geom
      FROM   mymultipoly2
    ) AS table_union
  ) AS geometric_clustering
  GROUP BY
         _clst
);

Here

we first create a UNION ALL of both tables into table_union
we then cluster that virtual table by spatial intersection into geometric_clustering - passing eps => 0 to ST_ClusterDBSCAN has just his handy effect, and as an in-memory operation it is fast - and since we also want isolated geometries having a separate cluster id (_clst), we pass in minpoints => 1
we then ST_Union based on _clst

Addendum:

Given that we are looking to minimize the operational cost of ST_Union to a minimum, and for a geometric union operation on multiple sets of geometries to also make sense, we need to collect

geometries in a that do not overlap with geometries in b
geometries in b that do not overlap with geometries in a

to simply add them into the result, and determine

geometries in a that do overlap with geometries in a, identified by group
geometries in b that do overlap with geometries in b, identified by group
(groups of) geometries from above that overlap between a and b, identified by group

to then being able to create unions over the identified groups.

In row set theory we'd require multiple JOIN operations in sequence, spread across multiple SQL statements (and eventually set-UNIONed into a single source set), to find those candidates for the geometric union operation.

Using ST_ClusterDBSCAN here primarily serves convenience, and secondarily may boost performance. As an in-memory operation over a simple input set-UNION of both tables and by setting the maximum candidate distance to 0 (read: geometries have to be within 0 distance of each other to get considered the same cluster), it is fast to identify groups of overlapping geometries, as well as adding those that are isolated (by specifying that clusters can be of size 1). As a Window function, it is able to assign ids (here: _clst) to rows belonging to the same cluster, and that we can later use to GROUP BY and ST_Union over.

Bonus:

Since isolated geometries will get an own _clst assignment each, and subsequently get passed to (and potentially unnecessarily processed by) ST_Union, we can conditionally add geometries to the result set in cases where we expect a large amount of input geometries to be isolated:

CREATE TABLE myschema.myunion AS (
  WITH
    cluster AS (
      SELECT *
      FROM   (
        SELECT geom,
               ST_ClusterDBSCAN(geom, 0, 2) OVER () AS _clst
        FROM   (
          SELECT geom
          FROM   mymultipoly1
          UNION ALL
          SELECT geom
          FROM   mymultipoly2
        ) AS table_union
      ) AS geometric_clustering
      ORDER BY
             _clst NULLS FIRST
    )
  SELECT geom
  FROM   cluster
  WHERE  _clst IS NULL
  UNION ALL
  SELECT ST_Union(geom) AS geom
  FROM   cluster
  WHERE  _clst IS NOT NULL
  GROUP BY
         _clst
);

We set minpoints => 2 to make the function assign NULLs to isolated geometries.

Related Solutions

[GIS] PostGIS union multiple tables, big dataset, faster approach

For what it's worth, the ST_Collect() function is much faster then a call to ST_Union().

The speed increase is mainly attributed to the fact that ST_Collect() returns a Geometry Collection object, and doesn't try to dissolve overlapping geometries.

You can create a single, OGC-compliant geometry (e.g. a POLYGON or MULTIPOLYGON) by wrapping ST_Collect() in a 0-width buffer, e.g. ST_Buffer(ST_Collect(geometry), 0).

I've used it to generate global-scale label points for OpenStreetMap water features (a rather large dataset). With proper GIST indexes as noted by previous respondents, it's a pretty fast operation.

Good Luck!

[GIS] Alternative to ST_Union/ST_MemUnion for merging overlapping polygons using PostGIS

The problem is that ST_Union will add all geometries to the output, not only the overlapping ones, causing postgres to complain. So if there are not too many overlapping polygons, you should try to isolate them first and then run a union. This should work:

SELECT (ST_Dump(ST_Union(a.geom))).geom  
FROM polygons a ,polygons b -- compare the table with itself
WHERE ST_Intersects(a.geom, b.geom) -- only when a geom overlaps
AND (a.id < b.id) -- but not with itself and only one way (hence the < )

The query will likely still take a considerable amount of time, try with a smaller subset (say 40k) first to get an estimate and make sure you have indexed geometry and id.

After you did this, you still have to add the non-overlapping polygons.

Best Answer

Related Solutions

[GIS] PostGIS union multiple tables, big dataset, faster approach

[GIS] Alternative to ST_Union/ST_MemUnion for merging overlapping polygons using PostGIS

Related Question