[GIS] Combined Hstore key/value and spatial query too slow to handle bigger OSM extracts

openstreetmappostgispostgresqlsql

I'm trying to compute some statistics for OSM data using PostgreSQL 9.3.5 and PostGIS 2.1.4. I started with a small bavaria extract which I downloaded from Geofabrik. The db schema is the normal API 0.6 schema, the data was imported via the dump approach into Postgres (using the pgsnapshot_schema_0.6*.sql scripts which come with osmosis). ANALYZE VACUUM was also performed.

The only custom made thing I'm using is a polygon table which contains multipolygons for all administrative boundary relations. The polygon geometry was not simplified in any way.

What I'm now trying to achieve is counting all nodes which are inside of the admin=6 boundaries of bavaria. Here is my SQL query:

SELECT relpoly.id, count(node) 
FROM bavaria.relpolygons relpoly, bavaria.nodes node
WHERE relpoly.tags @> '"boundary"=>"administrative","admin_level"=>"6"'::hstore 
AND ST_Intersects(relpoly.geom, node.geom)
GROUP BY relpoly.id;

The runtime of this query is terrible because Postgres is doing a nested loop join and scans over all nodes for every admin=6 boundary. FYI, bavaria is divided into 98 admin=6 polygons and there are about 30 million nodes in the bavaria extract.

Is it possible to avoid this sub-optimal query execution and to tell Postgres that it should scan all nodes only once (e.g., by incrementing a counter for the corresponding polygon in the result set or by using hints)?

Edit:

1) a spatial index exists on the bavaria nodes:

CREATE INDEX idx_nodes_geom ON bavaria.nodes USING gist (geom);

2) the query plan look like this:

HashAggregate  (cost=284908.49..284908.75 rows=26 width=103)
  ->  Nested Loop  (cost=111.27..283900.80 rows=201537 width=103)
        ->  Bitmap Heap Scan on relpolygons relpoly  (cost=4.48..102.29 rows=26 width=5886)
              Recheck Cond: (tags @> '"boundary"=>"administrative", "admin_level"=>"6"'::hstore)
              ->  Bitmap Index Scan on relpolygons_geom_tags  (cost=0.00..4.47 rows=26 width=0)
                    Index Cond: (tags @> '"boundary"=>"administrative", "admin_level"=>"6"'::hstore)
        ->  Bitmap Heap Scan on nodes node  (cost=106.79..10905.50 rows=983 width=127)
              Recheck Cond: (relpoly.geom && geom)
              Filter: _st_intersects(relpoly.geom, geom)
              ->  Bitmap Index Scan on idx_nodes_geom  (cost=0.00..106.55 rows=2950 width=0)
                    Index Cond: (relpoly.geom && geom)

I created the following two indexes, but the query plan (and runtime) did not change

CREATE INDEX relpolygons_tags_boundary on bavaria.relpolygons( (tags->'boundary') );
CREATE INDEX relpolygons_tags_admin on bavaria.relpolygons( (tags->'admin_level') );
ANALYZE bavaria.relpolygons;

Best Answer

The best way to index hstore tags is using GIN or GIST indexes, which from the docs, support the @>, ?, ?& and ?| operators, that is, searches on keys and key/value pairs. You approach of using a function to extract the tags for a B-tree index is reasonable, but because you are also checking for specific key/value pairs, the analyzer has chosen a full table scan.

I don't have access to bavaria.relpolygons, but based on a similar query for OSM UK on speed limits and highway tags, I get this for my explain on the following query:

SELECT count(*) 
 FROM ways 
WHERE tags @> 'highway=>motorway'::hstore 
 AND tags @> 'maxspeed=>"50 mph"'::hstore;


Aggregate  (cost=48.66..48.67 rows=1 width=0)
    ->  Index Scan using ix_ways_tags_gist on ways  (cost=0.42..48.64 rows=11 width=0)
     Index Cond: ((tags @> '"highway"=>"motorway"'::hstore) AND (tags @> '"maxspeed"=>"50 mph"'::hstore))

which shows a direct index scan (using the gist index), which for a table with 10 million rows is encouraging. The index was created with the simple:

CREATE INDEX ix_ways_tags_gist ON ways USING gist (tags);

While I can't check you spatial condition, I am guessing it is less selective than

WHERE relpoly.tags @> '"boundary"=>"administrative","admin_level"=>"6"'::hstore.

and would therefore only be used for a recheck condition.

There is also this great SO answer on the difference between GIN and GIST indexes. The general finding is that GIN indexes while bigger and slower to build, are much faster on text retrieval problems.

Sorry to answer so late, but I have recently being doing similar work on OSM and hstore, and discovered that not only did I once star this question, but that I could now answer it :D.

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

The explain doesn't show an index coming into play, which could be for two reasons:

You don't have one. So make one with CREATE INDEX tree_gix ON trees USING GIST (geom)
Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using the geography type or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, the LEFT JOIN is not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and species GROUP BY as well, since you already have a unique id in there, the other variables are just noise.

SELECT 
  a.tree_id, a.species, avg(b.age) as age_avg, 
  count(*) as samples, a.geom
FROM trees a 
JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

Best Answer

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

[GIS] How to use St_intersects with different geometry type

Related Question