PostGIS ST_Union Function – Why Is ST_Union Output Disproportionately Large?

polygonpostgispostgresqlunion

I have two multi-polygon geometry tables (see the figures below) in my PostGIS database and I would like to create a union of these two tables. They are ~12mb and ~1mb in size. However, the union of these two tables which I created using ST_Union is ~1.1gb.

The code I used

CREATE TABLE myschema.myunion AS 
SELECT
        ST_Union(mymultipoly1.geom ,
        mymultipoly2.geom)
FROM
        mymultipoly1,
        mymultipoly2

It runs fast. My only problem is the size of the output. I don't have to keep the attributes of either of the input tables. But I don't want to meddle with the geometry too much as I need to calculate the area of the output reasonably accurately. My attempts so far

  1. I used ST_Simplify(ST_Union... (with a tolerance value of 10m) and reduced the output table by half the size.
  2. I tried to use ST_Collect(ST_Union... which gives me the following error ERROR: array size exceeds the maximum allowed (1073741823). I think this is because ST_Collect attempts to combine all rows of the multi polygon into one row.

I would like to know why the resulting table is huge and is there a more size-efficient way of doing this?

enter image description here

Best Answer

If a geometric union of both sets of geometries is what you are after, this should be the most convenient, and conveniently also the most performant way, to do so:

CREATE TABLE myschema.myunion AS (
  SELECT ST_Union(geom) AS geom
  FROM   (
    SELECT geom,
           ST_ClusterDBSCAN(geom, 0, 1) OVER () AS _clst
    FROM   (
      SELECT geom
      FROM   mymultipoly1
      UNION ALL
      SELECT geom
      FROM   mymultipoly2
    ) AS table_union
  ) AS geometric_clustering
  GROUP BY
         _clst
);

Here

  • we first create a UNION ALL of both tables into table_union
  • we then cluster that virtual table by spatial intersection into geometric_clustering - passing eps => 0 to ST_ClusterDBSCAN has just his handy effect, and as an in-memory operation it is fast - and since we also want isolated geometries having a separate cluster id (_clst), we pass in minpoints => 1
  • we then ST_Union based on _clst

Addendum:

Given that we are looking to minimize the operational cost of ST_Union to a minimum, and for a geometric union operation on multiple sets of geometries to also make sense, we need to collect

  • geometries in a that do not overlap with geometries in b
  • geometries in b that do not overlap with geometries in a

to simply add them into the result, and determine

  • geometries in a that do overlap with geometries in a, identified by group
  • geometries in b that do overlap with geometries in b, identified by group
  • (groups of) geometries from above that overlap between a and b, identified by group

to then being able to create unions over the identified groups.

In row set theory we'd require multiple JOIN operations in sequence, spread across multiple SQL statements (and eventually set-UNIONed into a single source set), to find those candidates for the geometric union operation.


Using ST_ClusterDBSCAN here primarily serves convenience, and secondarily may boost performance. As an in-memory operation over a simple input set-UNION of both tables and by setting the maximum candidate distance to 0 (read: geometries have to be within 0 distance of each other to get considered the same cluster), it is fast to identify groups of overlapping geometries, as well as adding those that are isolated (by specifying that clusters can be of size 1). As a Window function, it is able to assign ids (here: _clst) to rows belonging to the same cluster, and that we can later use to GROUP BY and ST_Union over.


Bonus:

Since isolated geometries will get an own _clst assignment each, and subsequently get passed to (and potentially unnecessarily processed by) ST_Union, we can conditionally add geometries to the result set in cases where we expect a large amount of input geometries to be isolated:

CREATE TABLE myschema.myunion AS (
  WITH
    cluster AS (
      SELECT *
      FROM   (
        SELECT geom,
               ST_ClusterDBSCAN(geom, 0, 2) OVER () AS _clst
        FROM   (
          SELECT geom
          FROM   mymultipoly1
          UNION ALL
          SELECT geom
          FROM   mymultipoly2
        ) AS table_union
      ) AS geometric_clustering
      ORDER BY
             _clst NULLS FIRST
    )
  SELECT geom
  FROM   cluster
  WHERE  _clst IS NULL
  UNION ALL
  SELECT ST_Union(geom) AS geom
  FROM   cluster
  WHERE  _clst IS NOT NULL
  GROUP BY
         _clst
);

We set minpoints => 2 to make the function assign NULLs to isolated geometries.

Related Question