[GIS] How fast should I expect PostGIS to geocode well-formatted addresses

addressgeocodingpostgispostgresqltiger

How fast should I expect PostGIS to geocode well-formatted addresses?

I've installed PostgreSQL 9.3.7 and PostGIS 2.1.7, loaded the nation data and all states data but have found geocoding to be much slower than I anticipated. Did I set my expectations too high? I am getting an average of 3 individual geocodes per second. I need to do about 5 million and I don't want to wait three weeks for this.

This is a virtual machine for processing giant R matrices and I installed this database on the side so the configuration might looke a little goofy. If a major alteration in the config of the VM will help, I can alter the configuration.

Hardware specs

Memory: 65GB
processors: 6
lscpu gives me this:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                6
On-line CPU(s) list:   0-5
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             6
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              0
CPU MHz:               2400.000
BogoMIPS:              4800.00
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-5

OS is centos, uname -rv gives this:

# uname -rv
2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015

Postgresql config

> select version()
"PostgreSQL 9.3.7 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit"
> select PostGIS_Full_version()
POSTGIS="2.1.7 r13414" GEOS="3.4.2-CAPI-1.8.2 r3921" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.2, released 2012/10/08" LIBXML="2.7.6" LIBJSON="UNKNOWN" TOPOLOGY RASTER"

Based on previous suggestions to these types of queries, I upped shared_buffers in the postgresql.conf file to about 1/4 of available RAM and effective cache size to 1/2 of RAM:

shared_buffers = 16096MB     
effective_cache_size = 31765MB

I have installed_missing_indexes() and (after resolving duplicate inserts into some tables) did not have any errors.

Geocoding SQL example #1 (batch) ~ mean time is 2.8/sec

I am following the example from http://postgis.net/docs/Geocode.html, which has me create a table containing address to geocode, and then doing an SQL UPDATE:

UPDATE addresses_to_geocode
              SET  (rating, longitude, latitude,geo) 
              = ( COALESCE((g.geom).rating,-1),
              ST_X((g.geom).geomout)::numeric(8,5), 
              ST_Y((g.geom).geomout)::numeric(8,5),
              geo )
              FROM (SELECT "PatientId" as PatientId
              FROM addresses_to_geocode 
              WHERE "rating" IS NULL ORDER BY PatientId LIMIT 1000) As a
              LEFT JOIN (SELECT "PatientId" as PatientId, (geocode("Address",1)) As geom
              FROM addresses_to_geocode As ag
              WHERE ag.rating IS NULL ORDER BY PatientId LIMIT 1000) As g ON a.PatientId = g.PatientId
              WHERE a.PatientId = addresses_to_geocode."PatientId";

I'm using a batch size of 1000 above and it returns in 337.70 seconds. It's a little slower for smaller batches.

Geocoding SQL example #2 (row by row) ~ mean time is 1.2/sec

When I dig into my addresses by doing the geocodes one at a time with a statement that looks like this (btw, the example below took 4.14 seconds),

SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat, 
    (addy).address As stno, (addy).streetname As street, 
    (addy).streettypeabbrev As styp, (addy).location As city, 
    (addy).stateabbrev As st,(addy).zip 
FROM geocode('6433 DROMOLAND Cir NW, MASSILLON, OH 44646',1) As g;

it's a little slower (2.5x per record) but I can look at the distribution of query times and see that it's a minority of lengthy queries that are slowing this down the most (only the first 2600 of 5 million have lookup times). That is, the top 10% are taking an average of about 100 ms, the bottom 10% average 3.69 seconds, while the mean is 754 ms and the median is 340 ms.

# Just some interaction with the data in R
> range(lookupTimes[1:2600])
[1]  0.00 11.54
> median(lookupTimes[1:2600])
[1] 0.34
> mean(lookupTimes[1:2600])
[1] 0.7541808
> mean(sort(lookupTimes[1:2600])[1:260])
[1] 0.09984615
> mean(sort(lookupTimes[1:2600],decreasing=TRUE)[1:260])
[1] 3.691269
> hist(lookupTimes[1:2600]

Geocoding times for first 2600 rows

Other thoughts

If I can't get an order of magnitude increase in performance, I figured I could at least make an educated guess about predicting slow geocode times but it is not obvious to me why the slower addresses seem to be taking so much longer. I'm running the original address through a custom normalization step to make sure it is formatted nicely before the geocode() function gets it:

sql=paste0("select pprint_addy(normalize_address('",myAddress,"'))")

where myAddress is a [Address], [City], [ST] [Zip] string compiled from a user address table from a non-postgresql database.

I tried (failed) to install the pagc_normalize_address extension but it is not clear that this will bring the kind of improvement I am looking for.
Edited to add monitoring info as per suggestion

Performance

One CPU is pegged: [edit, only one processor per query, so I have 5 unused CPUs]

top - 14:10:26 up 1 day,  3:11,  4 users,  load average: 1.02, 1.01, 0.93
Tasks: 219 total,   2 running, 217 sleeping,   0 stopped,   0 zombie
Cpu(s): 15.4%us,  1.5%sy,  0.0%ni, 83.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65056588k total, 64613476k used,   443112k free,    97096k buffers
Swap: 262139900k total,    77164k used, 262062736k free, 62745284k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3130 postgres  20   0 16.3g 8.8g 8.7g R 99.7 14.2 170:14.06 postmaster
11139 aolsson   20   0 15140 1316  932 R  0.3  0.0   0:07.78 top
11675 aolsson   20   0  135m 1836 1504 S  0.3  0.0   0:00.01 wget
    1 root      20   0 19364 1064  884 S  0.0  0.0   0:01.84 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.06 kthreadd

Sample of disk activity on the data partition while one proc are pegged at 100%: [edit: only one processor in use by this query]

# dstat -tdD dm-3 1
----system---- --dsk/dm-3-
  date/time   | read  writ
12-06 14:06:36|1818k 3632k
12-06 14:06:37|   0     0
12-06 14:06:38|   0     0
12-06 14:06:39|   0     0
12-06 14:06:40|   0    40k
12-06 14:06:41|   0     0
12-06 14:06:42|   0     0
12-06 14:06:43|   0  8192B
12-06 14:06:44|   0  8192B
12-06 14:06:45| 120k   60k
12-06 14:06:46|   0     0
12-06 14:06:47|   0     0
12-06 14:06:48|   0     0
12-06 14:06:49|   0     0
12-06 14:06:50|   0    28k
12-06 14:06:51|   0    96k
12-06 14:06:52|   0     0
12-06 14:06:53|   0     0
12-06 14:06:54|   0     0 ^C

Analyze that SQL

This is from EXPLAIN ANALYZE on that query:

"Update on addresses_to_geocode  (cost=1.30..8390.04 rows=1000 width=272) (actual time=363608.219..363608.219 rows=0 loops=1)"
"  ->  Merge Left Join  (cost=1.30..8390.04 rows=1000 width=272) (actual time=110.934..324648.385 rows=1000 loops=1)"
"        Merge Cond: (a.patientid = g.patientid)"
"        ->  Nested Loop  (cost=0.86..8336.82 rows=1000 width=184) (actual time=10.676..34.241 rows=1000 loops=1)"
"              ->  Subquery Scan on a  (cost=0.43..54.32 rows=1000 width=32) (actual time=10.664..18.779 rows=1000 loops=1)"
"                    ->  Limit  (cost=0.43..44.32 rows=1000 width=4) (actual time=10.658..17.478 rows=1000 loops=1)"
"                          ->  Index Scan using "addresses_to_geocode_PatientId_idx" on addresses_to_geocode addresses_to_geocode_1  (cost=0.43..195279.22 rows=4449758 width=4) (actual time=10.657..17.021 rows=1000 loops=1)"
"                                Filter: (rating IS NULL)"
"                                Rows Removed by Filter: 24110"
"              ->  Index Scan using "addresses_to_geocode_PatientId_idx" on addresses_to_geocode  (cost=0.43..8.27 rows=1 width=152) (actual time=0.010..0.013 rows=1 loops=1000)"
"                    Index Cond: ("PatientId" = a.patientid)"
"        ->  Materialize  (cost=0.43..18.22 rows=1000 width=96) (actual time=100.233..324594.558 rows=943 loops=1)"
"              ->  Subquery Scan on g  (cost=0.43..15.72 rows=1000 width=96) (actual time=100.230..324593.435 rows=943 loops=1)"
"                    ->  Limit  (cost=0.43..5.72 rows=1000 width=42) (actual time=100.225..324591.603 rows=943 loops=1)"
"                          ->  Index Scan using "addresses_to_geocode_PatientId_idx" on addresses_to_geocode ag  (cost=0.43..23534259.93 rows=4449758000 width=42) (actual time=100.225..324591.146 rows=943 loops=1)"
"                                Filter: (rating IS NULL)"
"                                Rows Removed by Filter: 24110"
"Total runtime: 363608.316 ms"

See better breakdown at http://explain.depesz.com/s/vogS

Best Answer

I've spent a lot of time experimenting with this, I think it's better to post separately since they are from different angle.

This is really a complex topic, see more details in my blog post about the geocoding server setup and the script I used., here is just some brief summaries:

A server with only 2 States data is always faster than a server loaded with all 50 states data.

I verified this with my home pc in different times and two different Amazon AWS server.

My AWS free tier server with 2 states data have only 1G RAM, but it have consistent 43 ~ 59 ms performance for data with 1000 records and 45,000 records.

I used exactly same setup procedure for a 8G RAM AWS server with all states loaded, exactly same script and data, and the performance dropped to 80 ~ 105 ms.

My theory is that when geocoder cannot match address in exactly, it started to broad the search range and ignore some part, like zipcode or city. That's why geocode document boast that it can recolonize address with wrong zip code, although it took 3000 ms.

With only 2 states data loaded, the server will take much less time in fruitless search or a match with very low score, because it can only search in 2 states.

I tried to limit this by setting the restrict_region parameter to the state multipolygons in geocode function, hoping that will avoid the fruitless search since I'm pretty sure most of addresses have correct state. Compare these two versions:

  select geocode('501 Fairmount DR , Annapolis, MD 20137',1); 
  select geocode('501 Fairmount DR , Annapolis, MD 20137', 1, the_geom) from tiger.state where statefp = '24';

The only difference made by the second version is that normally if I run same query immediately again it will much quicker because the related data was cached, but the second version disabled this effect.

So the restrict_region is not working as I wished, maybe it just was used to filter the multiple hit result, not to limit search ranges.

You can tune your postgre conf a little bit.

The usual suspect of install missing indexes, vacuum analyze didn't make any difference for me, because the downloading script have done the necessary maintenance already, unless you messed up with them.

However setting postgre conf according to this post did helped. My full scale server with 50 states was having 320 ms with default configuration for some worse shaped data, it improved to 185 ms with 2G shared_buffer, 5G cache, and went to 100 ms further with most settings tuned according to that post.

This is more relevant to postgis and their settings seemed to be similar.

The batch size of each commit didn't matter much for my case. The geocode documentation used a batch size 3. I experimented values from 1, 3, 5 till 10. I didn't find any significant difference with this. With smaller batch size you make more commits and updates, but I think the real bottle neck is not here. Actually I'm using batch size 1 now. Because there are always some unexpected ill formed address will cause exception, I will set the whole batch with error as ignored and proceed for remaining rows. With batch size 1 I don't need to process the table the second time to geocode the possible good records in the batch marked as ignored.

Of course this depend on how your batch script works. I'll post my script with more details later.

You can try to use normalize address to filter bad address if it suit your usage. I saw somebody mentioned this somewhere, but I was not sure how this works since the normalize function only works in format, it cannot really tell you which address is invalid.

Later I realized that if the address is in obviously bad shape and you want to skip them, this could help. For example I have lots of addresses missing street name or even street names. Normalize all address first will be relatively fast, then you can filter the obvious bad address for you then skip them. However this didn't suite my usage since an address without street number or even street name could still be mapped to the street or city, and that information is still useful for me.

And most of the addresses that cannot be geocoded in my case actually have all the fields, just there is no match in database. You cannot filter these addresses just by normalizing them.

EDIT For more details, see my blog post about the geocoding server setup and the script I used.

EDIT 2 I've finished geocoding 2 million addresses and did lots of clean up on addresses based on geocoding result. With better cleaned input, the next batch job is running much more faster. By clean I mean some addresses are obviously wrong and should be removed, or having unexpected content for geocoder to cause problem on geocoding. My theory is: Removing bad addresses can avoid messing up the cache, which improve the performance on good addresses significantly.

I separated the input based on state to make sure every job can have all the data needed for geocoding cached in RAM. However every bad address in the job make the geocoder to search in more states, which could mess up the cache.