[GIS] Did the code example of geocode() in PostGIS tiger geocoder geocode all records then limit later intentionally

geocodingpostgistiger

The geocode function in PostGIS tiger geocoder extension have an example of batch geocoding, which seemed to be used by many people. However I'm not sure if the behavior of this example I observed is intentional.

UPDATE addresses_to_geocode SET (rating, new_address, lon, lat) = ( COALESCE((g.geo).rating,-1), pprint_addy((g.geo).addy), ST_X((g.geo).geomout)::numeric(8,5), ST_Y((g.geo).geomout)::numeric(8,5) ) FROM (SELECT addid FROM addresses_to_geocode WHERE rating IS NULL ORDER BY addid LIMIT 3) As a LEFT JOIN (SELECT addid, (geocode(address,1)) As geo FROM addresses_to_geocode As ag WHERE ag.rating IS NULL ORDER BY addid LIMIT 3) As g ON a.addid = g.addid WHERE a.addid = addresses_to_geocode.addid;

The example code was supposed to only geocode 3 addresses that not be geocoded then update the table. So you can run this code again and again, each time it will update 3 rows.

With a 10 address table, the example code need 18 seconds to run (I knew the performance in my setup is not good, it used to be much better when I only have 3 states data in SSD. Now I have 100G data in regular hard drive and it is much slower), and 10 seconds in 2nd run.

With a 20 address table, it need 45 s to run, 40 s for 2nd run. With a 500 address table, it keep running before I cancelled it.

This make me believe that the example code actually geocoded all rows in table first, then choose the first 3 results to update the table.

I observed this kind behavior earlier with limit:

SELECT geocode(address_string,1) FROM address_sample LIMIT 4;

will take much longer time if the table is big. While this line always use similar time no matter how big the table is:

SELECT geocode(sample.address_string, 1) 
    FROM (SELECT address_string FROM address_sample LIMIT 4) as sample;

So I modified the example code into this:

UPDATE address_table SET (rating, output_address, lon, lat, geomout) = ( COALESCE((a.geo).rating,-1), pprint_addy((a.geo).addy), ST_X((a.geo).geomout)::numeric(8,5), ST_Y((a.geo).geomout)::numeric(8,5), (a.geo).geomout) FROM (SELECT sample.addid, geocode(sample.input_address,1) as geo from (select addid, input_address from address_table WHERE rating IS NULL ORDER BY addid LIMIT 3) as sample ) as a WHERE a.addid = address_table.addid;

Now it always run in 5 seconds no matter how big the table is. However there is a new problem: the first row in my table is a bad address. The original code will assign the rating as -1 when it cannot find a match, then later runs will skip this bad address. My modified version didn't update the rating column as -1, so every run will try to geocode it and skip it.

In summary, I have actually two questions:

Is the behavior I observed in example code intentional? The document said

for large numbers of addresses you don't want to update all at once,
since the whole geocode must commit at once

I'm not sure what does this means. If you always geocode all table, why limit the output?

How to make my modified version set the rating column of bad address as -1?

EDIT: I solved my 2nd question.

The LEFT JOIN in example code is needed to return a result table that with bad address rating as -1. The new code runs 3~4 seconds for 3 rows every time, no matter what the table size is.

UPDATE address_table SET (rating, output_address, lon, lat, geomout) = ( COALESCE((g.geo).rating,-1), pprint_addy((g.geo).addy), ST_X((g.geo).geomout)::numeric(8,5), ST_Y((g.geo).geomout)::numeric(8,5), (g.geo).geomout) FROM (select addid from address_table WHERE rating IS NULL ORDER BY addid LIMIT 3) as a left join (SELECT sample.addid, geocode(sample.input_address,1) as geo from (select addid, input_address from address_table WHERE rating IS NULL ORDER BY addid LIMIT 3) as sample ) as g on a.addid = g.addid WHERE a.addid = address_table.addid;

EDIT 2: I run the vacuum commands following suggestion of @LR1234567. It didn't improve the general performance of geocoding, but the example code now have similar performance with my modified version. Maybe the usage in example code depend on the cleaning up of tiger schema?

EDIT 3: The example code still have problems.
My data have an invalid address(with incomplete zip code) at first row. The example code will take forever to run if started from first row, and that time is directly related to table size, 271 seconds for 100 row table. Once this invalid row is processed and marked as -1, the example code now can process the normal rows in reasonable time. However my modified code can process the invalid row within 4 seconds.

The only difference between the example code and my version is that I subset the table in from clause instead of where clause.

EDIT4 Here are the explain analysis results which verified my observation:

The example code runs on 100 row table on first time, with first row address invalid:

The first step of scan take 284 s for 99 rows of geocoding

the second step of limit to 3 rows happened after, too late

My modified code for same table:

The first step of geocoding only processed 3 rows

the second step of limit actually have same starting and end time of first step, so they happened together

After the first row of invalid address was marked with -1, run the example code again for other rows:

the rows were limited before geocoding

So this could be something related to how PostgreSQL planning the query if there is an invalid row. Maybe the other people didn't have the invalid row in top then they didn't find this problem. I sorted my input address by zip code so that invalid row appeared at top.

Best Answer

Hopefully this answers your questions

1) You don't want to geocode a whole table at once because of the way Postgres works. All work in an update gets committed as a single transaction, which means two things

a) If for whatever reason your update crashes in the middle, you lose all work already done in that update.

b) Since postgres commits all as a single transaction, lots of memory and resources are being held up for the period of the update and at a certain point slows things down.

2) Your timings seem pretty bad. Did you run vacuum analyze across whole hierarchy?

as detailed in - http://postgis.net/docs/postgis_installation.html#install_tiger_geocoder_extension

SELECT install_missing_indexes();
vacuum analyze verbose tiger.addr;
vacuum analyze verbose tiger.edges;
vacuum analyze verbose tiger.faces;
vacuum analyze verbose tiger.featnames;
vacuum analyze verbose tiger.place;
vacuum analyze verbose tiger.cousub;
vacuum analyze verbose tiger.county;
vacuum analyze verbose tiger.state;

Related Solutions

[GIS] How fast should I expect PostGIS to geocode well-formatted addresses

I've spent a lot of time experimenting with this, I think it's better to post separately since they are from different angle.

This is really a complex topic, see more details in my blog post about the geocoding server setup and the script I used., here is just some brief summaries:

A server with only 2 States data is always faster than a server loaded with all 50 states data.

I verified this with my home pc in different times and two different Amazon AWS server.

My AWS free tier server with 2 states data have only 1G RAM, but it have consistent 43 ~ 59 ms performance for data with 1000 records and 45,000 records.

I used exactly same setup procedure for a 8G RAM AWS server with all states loaded, exactly same script and data, and the performance dropped to 80 ~ 105 ms.

My theory is that when geocoder cannot match address in exactly, it started to broad the search range and ignore some part, like zipcode or city. That's why geocode document boast that it can recolonize address with wrong zip code, although it took 3000 ms.

With only 2 states data loaded, the server will take much less time in fruitless search or a match with very low score, because it can only search in 2 states.

I tried to limit this by setting the restrict_region parameter to the state multipolygons in geocode function, hoping that will avoid the fruitless search since I'm pretty sure most of addresses have correct state. Compare these two versions:

  select geocode('501 Fairmount DR , Annapolis, MD 20137',1); 
  select geocode('501 Fairmount DR , Annapolis, MD 20137', 1, the_geom) from tiger.state where statefp = '24';

The only difference made by the second version is that normally if I run same query immediately again it will much quicker because the related data was cached, but the second version disabled this effect.

So the restrict_region is not working as I wished, maybe it just was used to filter the multiple hit result, not to limit search ranges.

You can tune your postgre conf a little bit.

The usual suspect of install missing indexes, vacuum analyze didn't make any difference for me, because the downloading script have done the necessary maintenance already, unless you messed up with them.

However setting postgre conf according to this post did helped. My full scale server with 50 states was having 320 ms with default configuration for some worse shaped data, it improved to 185 ms with 2G shared_buffer, 5G cache, and went to 100 ms further with most settings tuned according to that post.

This is more relevant to postgis and their settings seemed to be similar.

The batch size of each commit didn't matter much for my case. The geocode documentation used a batch size 3. I experimented values from 1, 3, 5 till 10. I didn't find any significant difference with this. With smaller batch size you make more commits and updates, but I think the real bottle neck is not here. Actually I'm using batch size 1 now. Because there are always some unexpected ill formed address will cause exception, I will set the whole batch with error as ignored and proceed for remaining rows. With batch size 1 I don't need to process the table the second time to geocode the possible good records in the batch marked as ignored.

Of course this depend on how your batch script works. I'll post my script with more details later.

You can try to use normalize address to filter bad address if it suit your usage. I saw somebody mentioned this somewhere, but I was not sure how this works since the normalize function only works in format, it cannot really tell you which address is invalid.

Later I realized that if the address is in obviously bad shape and you want to skip them, this could help. For example I have lots of addresses missing street name or even street names. Normalize all address first will be relatively fast, then you can filter the obvious bad address for you then skip them. However this didn't suite my usage since an address without street number or even street name could still be mapped to the street or city, and that information is still useful for me.

And most of the addresses that cannot be geocoded in my case actually have all the fields, just there is no match in database. You cannot filter these addresses just by normalizing them.

EDIT For more details, see my blog post about the geocoding server setup and the script I used.

EDIT 2 I've finished geocoding 2 million addresses and did lots of clean up on addresses based on geocoding result. With better cleaned input, the next batch job is running much more faster. By clean I mean some addresses are obviously wrong and should be removed, or having unexpected content for geocoder to cause problem on geocoding. My theory is: Removing bad addresses can avoid messing up the cache, which improve the performance on good addresses significantly.

I separated the input based on state to make sure every job can have all the data needed for geocoding cached in RAM. However every bad address in the job make the geocoder to search in more states, which could mess up the cache.

Best Answer

Related Solutions

[GIS] How fast should I expect PostGIS to geocode well-formatted addresses

Related Question