PostGIS Performance – How to Speed Up Query in PostGIS

performancepoint-in-polygonpostgis

I know this question (or a version of it) has probably been asked but I cannot seem to figure out a solution to my problem.

The core spatial operation is finding which of my points are inside polygons. I am using my own coordinates and Census Bureau shapefiles in my database.

I have a PostGIS database set up with a schema used for doing my stuff. It looks like this:

create schema temp2;

/* load data from csv into this table
*/
create table temp2.coords (
id integer primary key,
lat float,
lng float
);

/* contains the geoms
*/
create table temp2.geoms (
id integer primary key,
geom geometry(point, 4269)
);

/* my spatial index
*/
create index temp2_geoms_idx on temp2.geoms using gist(geom);

/* save the result
*/
create table temp2.result (
id integer primary key,
countycode varchar(64)
);

The main issue is that when I run the query, it takes forever for a small amount of points. It took about 30 min for 100k points. From what I have read, this should happen way faster, and I am sure it has to do with the query that I have.

I have experimented with different spatial indices, different kinds of queries, and read the manual. I think I am misunderstanding something about how the indices work.

This is my query:

select
main.id,
c.cntyidfp
from temp2.geoms as main
left outer join tiger.county as c
on st_intersects(c.the_geom, main.geom);

Explain from the query:

"Gather  (cost=1000.00..1870856.75 rows=6683599 width=10)"
"  Workers Planned: 2"
"  ->  Nested Loop Left Join  (cost=0.00..1201496.85 rows=2784833 width=10)"
"        ->  Parallel Seq Scan on geoms main  (cost=0.00..77500.33 rows=2583333 width=36)"
"        ->  Append  (cost=0.00..0.42 rows=2 width=39218)"
"              ->  Seq Scan on county c  (cost=0.00..0.00 rows=1 width=41)"
"              ->  Index Scan using tiger_data_county_the_geom_gist on county_all c_1  (cost=0.15..0.42 rows=1 width=39230)"
"                    Index Cond: (the_geom && main.geom)"
"                    Filter: _st_intersects(the_geom, main.geom)"
"                    Filter: ((the_geom && main.geom) AND _st_intersects(the_geom, main.geom))"

Shouldn't a simple query like the one I have written be fast?

I thought the whole point of the spatial index was to speed up the "st_" functions.

Shouldn't setting the spatial index to the geometry speed this up?

How can the query be written to avoid this?

EDIT: Query was changed to:

select
m.id,
c.statefp
from tiger.state as c
inner join temp2.geoms2 as m
on st_intersects(m.geom, c.the_geom);

And was able to get a significant speed increase. About 6.2 million points in under 1 minute.

Best Answer

It is usually faster to query the other way around: start from the polygons and identify the points that intersects. This is especially true if you have more points than polygons.

Following the comments, the query would be:

select m.id, c.statefp 
from tiger.state as c 
inner join temp2.geoms2 as m 
  on st_intersects(m.geom, c.the_geom);