PostGIS Performance – How to Speed Up Query in PostGIS

performancepoint-in-polygonpostgis

I know this question (or a version of it) has probably been asked but I cannot seem to figure out a solution to my problem.

The core spatial operation is finding which of my points are inside polygons. I am using my own coordinates and Census Bureau shapefiles in my database.

I have a PostGIS database set up with a schema used for doing my stuff. It looks like this:

create schema temp2;

/* load data from csv into this table
*/
create table temp2.coords (
id integer primary key,
lat float,
lng float
);

/* contains the geoms
*/
create table temp2.geoms (
id integer primary key,
geom geometry(point, 4269)
);

/* my spatial index
*/
create index temp2_geoms_idx on temp2.geoms using gist(geom);

/* save the result
*/
create table temp2.result (
id integer primary key,
countycode varchar(64)
);

The main issue is that when I run the query, it takes forever for a small amount of points. It took about 30 min for 100k points. From what I have read, this should happen way faster, and I am sure it has to do with the query that I have.

I have experimented with different spatial indices, different kinds of queries, and read the manual. I think I am misunderstanding something about how the indices work.

This is my query:

select
main.id,
c.cntyidfp
from temp2.geoms as main
left outer join tiger.county as c
on st_intersects(c.the_geom, main.geom);

Explain from the query:

"Gather  (cost=1000.00..1870856.75 rows=6683599 width=10)"
"  Workers Planned: 2"
"  ->  Nested Loop Left Join  (cost=0.00..1201496.85 rows=2784833 width=10)"
"        ->  Parallel Seq Scan on geoms main  (cost=0.00..77500.33 rows=2583333 width=36)"
"        ->  Append  (cost=0.00..0.42 rows=2 width=39218)"
"              ->  Seq Scan on county c  (cost=0.00..0.00 rows=1 width=41)"
"              ->  Index Scan using tiger_data_county_the_geom_gist on county_all c_1  (cost=0.15..0.42 rows=1 width=39230)"
"                    Index Cond: (the_geom && main.geom)"
"                    Filter: _st_intersects(the_geom, main.geom)"
"                    Filter: ((the_geom && main.geom) AND _st_intersects(the_geom, main.geom))"

Shouldn't a simple query like the one I have written be fast?

I thought the whole point of the spatial index was to speed up the "st_" functions.

Shouldn't setting the spatial index to the geometry speed this up?

How can the query be written to avoid this?

EDIT: Query was changed to:

select
m.id,
c.statefp
from tiger.state as c
inner join temp2.geoms2 as m
on st_intersects(m.geom, c.the_geom);

And was able to get a significant speed increase. About 6.2 million points in under 1 minute.

Best Answer

It is usually faster to query the other way around: start from the polygons and identify the points that intersects. This is especially true if you have more points than polygons.

Following the comments, the query would be:

select m.id, c.statefp 
from tiger.state as c 
inner join temp2.geoms2 as m 
  on st_intersects(m.geom, c.the_geom);

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

The explain doesn't show an index coming into play, which could be for two reasons:

You don't have one. So make one with CREATE INDEX tree_gix ON trees USING GIST (geom)
Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using the geography type or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, the LEFT JOIN is not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and species GROUP BY as well, since you already have a unique id in there, the other variables are just noise.

SELECT 
  a.tree_id, a.species, avg(b.age) as age_avg, 
  count(*) as samples, a.geom
FROM trees a 
JOIN trees b
ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species
WHERE a.age IS NULL
GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.

[GIS] How to use St_intersects with different geometry type

Fast query results for ST_Intersects hinge on the fact that not every pair of inputs needs to be tested. PostGIS avoids testing every pair of geometries by implicitly testing the arguments to ST_Intersects with the bounding box intersection operator &&, so that only geometries whose bounding boxes intersect need to be passed to ST_Intersects. When your geometry columns are indexed, PostgreSQL can use the index to fetch only geometries that pass the && filter, significantly reducing the number of comparisons.

Here's the problem. The index provides the bounding boxes of a.geom and b.geom, but not ST_Centroid(a.geom). You and I know that whenever a.geom && b.geom is true, then ST_Centroid(a.geom) && b.geom must also be true, but PostgreSQL has no way to know this.

You can fix this by manually forcing an a.geom && b.geom comparison, which can take advantage of the index.

SELECT a.id, b.id 
FROM PolygonLayer1 a, PolygonLayer2 b
WHERE a.geom && b.geom AND ST_Intersects(ST_Centroid(a.geom), b.geom)

This doesn't explain why you're getting good performance in Case 1, because I have no idea.

Best Answer

Related Solutions

[GIS] Improve performance of a PostGIS st_dwithin query

[GIS] How to use St_intersects with different geometry type

Related Question