I have two multi-polygon geometry tables (see the figures below) in my PostGIS database and I would like to create a union of these two tables. They are ~12mb and ~1mb in size. However, the union of these two tables which I created using ST_Union
is ~1.1gb.
The code I used
CREATE TABLE myschema.myunion AS
SELECT
ST_Union(mymultipoly1.geom ,
mymultipoly2.geom)
FROM
mymultipoly1,
mymultipoly2
It runs fast. My only problem is the size of the output. I don't have to keep the attributes of either of the input tables. But I don't want to meddle with the geometry too much as I need to calculate the area of the output reasonably accurately. My attempts so far
- I used
ST_Simplify(ST_Union...
(with a tolerance value of 10m) and reduced the output table by half the size. - I tried to use
ST_Collect(ST_Union...
which gives me the following errorERROR: array size exceeds the maximum allowed (1073741823)
. I think this is because ST_Collect attempts to combine all rows of the multi polygon into one row.
I would like to know why the resulting table is huge and is there a more size-efficient way of doing this?
Best Answer
If a geometric union of both sets of geometries is what you are after, this should be the most convenient, and conveniently also the most performant way, to do so:
Here
UNION ALL
of both tables intotable_union
geometric_clustering
- passingeps => 0
toST_ClusterDBSCAN
has just his handy effect, and as an in-memory operation it is fast - and since we also want isolated geometries having a separate cluster id (_clst
), we pass inminpoints => 1
ST_Union
based on_clst
Addendum:
Given that we are looking to minimize the operational cost of
ST_Union
to a minimum, and for a geometric union operation on multiple sets of geometries to also make sense, we need to collecta
that do not overlap with geometries inb
b
that do not overlap with geometries ina
to simply add them into the result, and determine
a
that do overlap with geometries ina
, identified by groupb
that do overlap with geometries inb
, identified by groupa
andb
, identified by groupto then being able to create unions over the identified groups.
In row set theory we'd require multiple
JOIN
operations in sequence, spread across multiple SQL statements (and eventually set-UNION
ed into a single source set), to find those candidates for the geometric union operation.Using
ST_ClusterDBSCAN
here primarily serves convenience, and secondarily may boost performance. As an in-memory operation over a simple input set-UNION
of both tables and by setting the maximum candidate distance to0
(read: geometries have to be within0
distance of each other to get considered the same cluster), it is fast to identify groups of overlapping geometries, as well as adding those that are isolated (by specifying that clusters can be of size 1). As a Window function, it is able to assign ids (here:_clst
) to rows belonging to the same cluster, and that we can later use toGROUP BY
andST_Union
over.Bonus:
Since isolated geometries will get an own
_clst
assignment each, and subsequently get passed to (and potentially unnecessarily processed by)ST_Union
, we can conditionally add geometries to the result set in cases where we expect a large amount of input geometries to be isolated:We set
minpoints => 2
to make the function assignNULL
s to isolated geometries.