[GIS] Displaying data from a partitioned PostgreSQL / PostGIS table is unacceptably slow in QGIS

postgispostgresqlqgis

I have two very similar datasets of 27 million polygons stored in PostgreSQL / PostGIS.

One is loaded as a single table, with spatial indexes and regular indexes on key fields.

The other is partitioned based on a grid of geographic areas. Each parition has a spatial index and regular indexes, and has been configured in accordance with the user manual Partitioning guidance; http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

If I load the single table into QGIS, it loads data into the map window and starts rendering polygons very quickly (when zoomed in to 1:10000).

If I use the master partitioned table and do the same, it takes over 8 minutes before it starts to display data. After the 8 minutes it then is able to render the data at a similar speed to the single table as I zoom and pan.

I am really looking for a way to resolve this 8 plus minutes, as QGIS appears not be reading the partitioned table efficiently. I have thought it may be due to;
– QGIS not picking up any spatial index
– QGIS loading the entire 27 million records into its cache / memory
– QGIS not getting the correct bounding box for my window view and loading all the data in

Am I missing some setting or configuration? I have loaded it within GeoMedia without issue, and performed a bounding box query within FME on both the single table and partitioned (both performed at similar speeds). QGIS can connect to PostgreSQL / PostGIS and get the schema / table listings quite quickly with 'Use estimated table metadata' checked. My issue is adding it to the map.

UPDATE 2016/05/18 15.20

On advice from user30184 I enablled PostgreSQL logging to identify the queries hitting the database server.

They showed the following;

For the Single table;

2016-05-18 14:47:23 BST QGISLOG: statement: SELECT st_estimatedextent('schema1','table1','geometry')
2016-05-18 14:47:24 BST [unknown]LOG: statement: SET application_name='QGIS'
2016-05-18 14:47:24 BST QGISLOG: statement: BEGIN READ ONLY
2016-05-18 14:47:24 BST QGISLOG: statement: DECLARE qgis_1 BINARY CURSOR FOR SELECT st_asbinary("geometry",'NDR'),"primaryindex" FROM "schema1"."table1" WHERE "geometry" && st_makeenvelope(minx,miny,maxx,maxy,27700)
2016-05-18 14:47:24 BST QGISLOG: statement: FETCH FORWARD 2000 FROM qgis_1
2016-05-18 14:47:32 BST QGISLOG: statement: FETCH FORWARD 2000 FROM qgis_1
2016-05-18 14:47:32 BST QGISLOG: statement: FETCH FORWARD 2000 FROM qgis_1
2016-05-18 14:47:32 BST QGISLOG: statement: CLOSE qgis_1
2016-05-18 14:47:32 BST QGISLOG: statement: COMMIT

For the Partitioned table;

2016-05-18 14:49:57 BST QGISLOG: statement: SELECT st_extent("geometry") FROM "schema2"."table2"
2016-05-18 14:58:23 BST [unknown]LOG: statement: SET application_name='QGIS'
2016-05-18 14:58:23 BST QGISLOG: statement: BEGIN READ ONLY
2016-05-18 14:58:23 BST QGISLOG: statement: DECLARE qgis_1 BINARY CURSOR FOR SELECT st_asbinary("geometry",'NDR'),"poly_id" FROM "schema2"."table2" WHERE "geometry" && st_makeenvelope(minx,miny,maxx,maxy,27700)
2016-05-18 14:58:23 BST QGISLOG: statement: FETCH FORWARD 2000 FROM qgis_1
2016-05-18 14:58:24 BST QGISLOG: statement: FETCH FORWARD 2000 FROM qgis_1
2016-05-18 14:58:24 BST QGISLOG: statement: FETCH FORWARD 2000 FROM qgis_1
2016-05-18 14:58:24 BST QGISLOG: statement: CLOSE qgis_1
2016-05-18 14:58:24 BST QGISLOG: statement: COMMIT

The thing I noticed is that the Single table is using an estimated extent (st_estimatedextent) whereas the Partitioned table is using the full (st_extent). I suspect this is where the time delay is coming in. In pgAdminIII I ran the query st_estimatedextent in a worksheet for the Single table and it brought back a result instantly, whereas for the Partition it had the following;

ERROR: stats for "nds.geometry" do not exist
********** Error **********

ERROR: stats for "nds.geometry" do not exist
SQL state: XX000

I guess it hasn't picked up the stats on the child partitions at all. I Analysed the master partition table, but that hasn't appeared to have populated any statistics. Any thoughts?

Best Answer

PostgreSQL partitioning is only going to make things faster if you have a non-spatial constraint in your query and that constraint is applied to each child table in the partition via constraint exclusion. (The constraint exclusion config also has to be turned on.)

Partitioning will be useful

  • for performance, if you have a key you can build constrained children on, or
  • for management, if you have a set of tables you want to be able backup/restore/archive but still query "as one".

When you query the parent table with something like

SELECT * FROM partition_parent

the database is going to do something like

SELECT * FROM child_1 
  UNION ALL SELECT * FROM child_2 
  UNION ALL ... SELECT * FROM child_n

and there's no performance win there. Every record is going to be read. Worse, since spatial is not supported in constraint exclusion, you cannot build child tables such that non-participating tables are excluded from the query as you zoom in. So you'll select all records from all tables, every time, before applying a spatial filter (like a QGIS zoom window) to the set.

So, I'm guessing you think partitioning can provide you something it cannot; maybe you shouldn't use it?

Related Question