[GIS] Did the code example of geocode() in PostGIS tiger geocoder geocode all records then limit later intentionally

geocodingpostgistiger

The geocode function in PostGIS tiger geocoder extension have an example of batch geocoding, which seemed to be used by many people. However I'm not sure if the behavior of this example I observed is intentional.

UPDATE addresses_to_geocode
SET (rating, new_address, lon, lat)
= ( COALESCE((g.geo).rating,-1), pprint_addy((g.geo).addy),
ST_X((g.geo).geomout)::numeric(8,5), ST_Y((g.geo).geomout)::numeric(8,5) )
FROM (SELECT addid
FROM addresses_to_geocode
WHERE rating IS NULL ORDER BY addid LIMIT 3) As a
LEFT JOIN (SELECT addid, (geocode(address,1)) As geo
FROM addresses_to_geocode As ag
WHERE ag.rating IS NULL ORDER BY addid LIMIT 3) As g ON a.addid = g.addid
WHERE a.addid = addresses_to_geocode.addid;

The example code was supposed to only geocode 3 addresses that not be geocoded then update the table. So you can run this code again and again, each time it will update 3 rows.

With a 10 address table, the example code need 18 seconds to run (I knew the performance in my setup is not good, it used to be much better when I only have 3 states data in SSD. Now I have 100G data in regular hard drive and it is much slower), and 10 seconds in 2nd run.

With a 20 address table, it need 45 s to run, 40 s for 2nd run. With a 500 address table, it keep running before I cancelled it.

This make me believe that the example code actually geocoded all rows in table first, then choose the first 3 results to update the table.

I observed this kind behavior earlier with limit:

SELECT geocode(address_string,1) FROM address_sample LIMIT 4;

will take much longer time if the table is big. While this line always use similar time no matter how big the table is:

SELECT geocode(sample.address_string, 1) 
    FROM (SELECT address_string FROM address_sample LIMIT 4) as sample;

So I modified the example code into this:

UPDATE address_table
SET (rating, output_address, lon, lat, geomout)
= ( COALESCE((a.geo).rating,-1), pprint_addy((a.geo).addy),
ST_X((a.geo).geomout)::numeric(8,5), ST_Y((a.geo).geomout)::numeric(8,5),
(a.geo).geomout)
FROM (SELECT sample.addid, geocode(sample.input_address,1) as geo
from (select addid, input_address
from address_table
WHERE rating IS NULL ORDER BY addid LIMIT 3) as sample
) as a
WHERE a.addid = address_table.addid;

Now it always run in 5 seconds no matter how big the table is. However there is a new problem: the first row in my table is a bad address. The original code will assign the rating as -1 when it cannot find a match, then later runs will skip this bad address. My modified version didn't update the rating column as -1, so every run will try to geocode it and skip it.

In summary, I have actually two questions:

  1. Is the behavior I observed in example code intentional? The document said

    for large numbers of addresses you don't want to update all at once,
    since the whole geocode must commit at once

I'm not sure what does this means. If you always geocode all table, why limit the output?

  1. How to make my modified version set the rating column of bad address as -1?

EDIT: I solved my 2nd question.

The LEFT JOIN in example code is needed to return a result table that with bad address rating as -1. The new code runs 3~4 seconds for 3 rows every time, no matter what the table size is.

UPDATE address_table
SET (rating, output_address, lon, lat, geomout)
= ( COALESCE((g.geo).rating,-1), pprint_addy((g.geo).addy),
ST_X((g.geo).geomout)::numeric(8,5), ST_Y((g.geo).geomout)::numeric(8,5),
(g.geo).geomout)
FROM (select addid
from address_table
WHERE rating IS NULL ORDER BY addid LIMIT 3) as a
left join (SELECT sample.addid, geocode(sample.input_address,1) as geo
from (select addid, input_address
from address_table WHERE rating IS NULL ORDER BY addid LIMIT 3) as sample
) as g on a.addid = g.addid
WHERE a.addid = address_table.addid;

EDIT 2: I run the vacuum commands following suggestion of @LR1234567. It didn't improve the general performance of geocoding, but the example code now have similar performance with my modified version. Maybe the usage in example code depend on the cleaning up of tiger schema?

EDIT 3: The example code still have problems.
My data have an invalid address(with incomplete zip code) at first row. The example code will take forever to run if started from first row, and that time is directly related to table size, 271 seconds for 100 row table. Once this invalid row is processed and marked as -1, the example code now can process the normal rows in reasonable time. However my modified code can process the invalid row within 4 seconds.

The only difference between the example code and my version is that I subset the table in from clause instead of where clause.

EDIT4 Here are the explain analysis results which verified my observation:

  1. The example code runs on 100 row table on first time, with first row address invalid:

The first step of scan take 284 s for 99 rows of geocoding

The first step of scan take 284 s for 99 rows of geocoding

the second step of limit to 3 rows happened after, too late

the second step of limit to 3 rows happened after, too late

  1. My modified code for same table:

The first step of geocoding only processed 3 rows

The first step of geocoding only processed 3 rows

the second step of limit actually have same starting and end time of first step, so they happened together

the second step of limit actually have same starting and end time of first step, so they happened together

  1. After the first row of invalid address was marked with -1, run the example code again for other rows:

the rows were limited before geocoding

the rows were limited before geocoding

So this could be something related to how PostgreSQL planning the query if there is an invalid row. Maybe the other people didn't have the invalid row in top then they didn't find this problem. I sorted my input address by zip code so that invalid row appeared at top.

Best Answer

Hopefully this answers your questions

1) You don't want to geocode a whole table at once because of the way Postgres works. All work in an update gets committed as a single transaction, which means two things

a) If for whatever reason your update crashes in the middle, you lose all work already done in that update.

b) Since postgres commits all as a single transaction, lots of memory and resources are being held up for the period of the update and at a certain point slows things down.

2) Your timings seem pretty bad. Did you run vacuum analyze across whole hierarchy?

as detailed in - http://postgis.net/docs/postgis_installation.html#install_tiger_geocoder_extension

SELECT install_missing_indexes();
vacuum analyze verbose tiger.addr;
vacuum analyze verbose tiger.edges;
vacuum analyze verbose tiger.faces;
vacuum analyze verbose tiger.featnames;
vacuum analyze verbose tiger.place;
vacuum analyze verbose tiger.cousub;
vacuum analyze verbose tiger.county;
vacuum analyze verbose tiger.state;
Related Question