PostGIS – Enhancing Performance for Geohash Aggregation

elasticsearchgeohashjsonpostgispostgresql

I'm using the GeoNames (https://www.geonames.org/) dataset and want to aggregate the points in geohash of a specific precision. Beforehand I'm filtering with an bbox. So this is the query I came up with:

With bbox AS(
SELECT name, the_geom FROM geonames 
WHERE ST_Contains((ST_MakeEnvelope(-29.79, 16.38, 64.05, 90.26, 4326)), the_geom)
)
SELECT COUNT(name), ST_GeoHash((the_geom),2) 
FROM bbox
   GROUP BY ST_GeoHash((the_geom),2)

output looks like this:

+-------+------------+
| count | st_geohash |
+-------+------------+
| 34200 | tm         |
+-------+------------+
| 3     | up         |
+-------+------------+
| ...   | ...        |
+-------+------------+

and this is the query plan:

    "HashAggregate  (cost=24426.50..24429.00 rows=200 width=40) (actual time=5805.214..5805.229 rows=121 loops=1)"
"  Group Key: st_geohash(bbox.the_geom, 2)"
"  CTE bbox"
"    ->  Bitmap Heap Scan on geonames  (cost=376.34..24317.79 rows=3953 width=46) (actual time=454.394..2950.692 rows=3349419 loops=1)"
"          Recheck Cond: ('0103000020E610000001000000050000000AD7A3703DCA3DC0E17A14AE476130400AD7A3703DCA3DC0713D0AD7A39056403333333333035040713D0AD7A39056403333333333035040E17A14AE476130400AD7A3703DCA3DC0E17A14AE47613040'::geometry ~ the_geom)"
"          Filter: _st_contains('0103000020E610000001000000050000000AD7A3703DCA3DC0E17A14AE476130400AD7A3703DCA3DC0713D0AD7A39056403333333333035040713D0AD7A39056403333333333035040E17A14AE476130400AD7A3703DCA3DC0E17A14AE47613040'::geometry, the_geom)"
"          Rows Removed by Filter: 18"
"          Heap Blocks: exact=48141"
"          ->  Bitmap Index Scan on idx_geonames_geom  (cost=0.00..375.35 rows=11858 width=0) (actual time=444.950..444.950 rows=3349437 loops=1)"
"                Index Cond: ('0103000020E610000001000000050000000AD7A3703DCA3DC0E17A14AE476130400AD7A3703DCA3DC0713D0AD7A39056403333333333035040713D0AD7A39056403333333333035040E17A14AE476130400AD7A3703DCA3DC0E17A14AE47613040'::geometry ~ the_geom)"
"  ->  CTE Scan on bbox  (cost=0.00..88.94 rows=3953 width=64) (actual time=454.401..5030.976 rows=3349419 loops=1)"
"Planning time: 0.492 ms"
"Execution time: 5832.977 ms"

Is there a way to to increase the performence of this query ?
I'm also testing the same thing with Elasticsearch 6.6 and there the query with the same output is a lot faster.

{
    "aggregations" : {
        "zoomed-in" : {
            "filter" : {
                "geo_bounding_box" : {
                    "location" : {
                        "top_left" : "64.05, -29.79",
                        "bottom_right" : "16.38, 90.26"
                    }
                }
            },
            "aggregations":{
                "zoom1":{
                    "geohash_grid" : {
                        "field": "location",
                        "precision": 2,
                        "size": 100000
                    }
                }
            }
        }
    }
}

Best Answer

In PostgreSQL, common table expressions are always materialized. (This will change in version 12.)

To allow more optimizations, move bbox into a view, or inline it as a subquery:

SELECT COUNT(name), ST_GeoHash((the_geom),2) 
FROM (
  SELECT name, the_geom FROM geonames 
  WHERE ST_Contains((ST_MakeEnvelope(-29.79, 16.38, 64.05, 90.26, 4326)), the_geom)
  ) AS bbox
GROUP BY ST_GeoHash((the_geom),2)

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

The explain doesn't show an index coming into play, which could be for two reasons:

You don't have one. So make one with CREATE INDEX tree_gix ON trees USING GIST (geom)
Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using the geography type or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, the LEFT JOIN is not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and species GROUP BY as well, since you already have a unique id in there, the other variables are just noise.

SELECT 
  a.tree_id, a.species, avg(b.age) as age_avg, 
  count(*) as samples, a.geom
FROM trees a 
JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

Best Answer

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

[GIS] How to use St_intersects with different geometry type

Related Question