[GIS] Neighbors within distance from every row

large datasetsnearest neighborpostgispostgis-2.0postgresql

I have a research project of 180,000 entries and I need to calculate the number of neighbors (points from the same table) that exist within a 300 meter radius for each point. My method is very slow, and I'm asking if there is a better way.

My initial thought was to left-join the table to itself, then calculate ST_Distance_Sphere for every row of the result table, then aggregate count the number of entries that are less than 300 meters – but this is turning out to be extremely slow. The initial JOIN in particular is taking more than an hour to complete (it's still running).

SELECT source_id as id, sum(case when dist_meters < 305 THEN 1 ELSE 0 END) as neighbors_within_distance FROM
(SELECT 
    a.id as source_id,
    a.biz_lat as source_lat,
    a.biz_lon as source_lon,
    b.id as match_id, 
    b.biz_lat,
    b.biz_lon,
    ST_Distance_Sphere(ST_MakePoint(a.biz_lon, a.biz_lat), ST_MakePoint(b.biz_lon,b.biz_lat)) as dist_meters
FROM a LEFT JOIN a as b ON 1=1) as processed group by source_id

I don't currently have any indexes, and the lat/lon columns are stored as floats. I could perform a KNN analysis first to select the nearest 500 points, and then filter the smaller data set down to only neighbors that meet the 300 meter criteria – but I'm not sure how to do that quickly.

Is there a better way to solve this problem?

EDIT : Here are my setup queries and example data rows

DROP TABLE IF EXISTS a;
CREATE TABLE a (
    id              integer,
    name            text,
    street_address  text,
    city            text,
    state           text,
    zip             text,
    county          text,
    telephone       text,
    fax             text,
    email           text,
    web             text,
    type            text,
    num_employees   text,
    sic_1           text,
    sic_2           text,
    headquarters    text,
    revenue         text,
    biz_lat         float,
    biz_lon         float,
    class           text,
    encoding_type   text,
    bbox1           float,
    bbox2           float,
    bbox3           float,
    bbox4           float,
    display         text,
    source          text
);

An example row from the dataset

 ID Company Name    Street Address  City    State/Province  Postal Code County  Telephone Number    Fax Number  Company Email Address   URL/Web Address Company Type    No. of Employees    Primary SIC Code 1  Primary SIC Code 2  Headquarters    Sales/Revenue   lat lon class   type    bbox1   bbox2   bbox3   bbox4   display source
54776   26430. bizname  123 fake    Seguin  Texas   78155   El Paso (830) 555-3910              PRIVATE - PARENT    6   7359    Equipment rental & leasin, nec  Headquarters    1,098,000 (USD) 29.56570181 -97.94911784     place   house  29.56565181 29.56575181 -97.94916784    -97.94906784     123 Fake Seguin Guadalupe County  Texas 78155  United States of America    

Best Answer

You have pretty much suggested the answer to your own question already, which is that you don't have any indexes.

I would suggest that you convert your lat/lon columns to a geography (point) data type, add a spatial index, and rewrite the query to use ST_DWithin, which uses a spatial index, if available, and works with geometries or geographies.

ALTER TABLE a ADD column geom geography;
UPDATE a SET geom = ST_MakePoint(biz_lon, biz_lat);
CREATE INDEX ix_spatial ON a USING GIST (geom);

SELECT a.id, b.id
FROM a, a as b
WHERE ST_DWITHIN (a.geom, b.geom, 300) 
AND a.id != b.id;

This kind of construct is essentially a Cartesian product, or cross (spatial self) join, where ST_DWithin and the index dramatically reducing the search space. The a, b is short hand for a CROSS JOIN b. Note the a.id != b.id to avoid comparing a point with itself. Also, note that the 3rd form of ST_DWithin allows you to set a 4th parameter, use_spheroid to true, which will give slightly less accurate results, but be somewhat quicker, as it avoids the far more complex maths required to do spheroid calculations.

Overall, I would expect ST_DWithin on indexed geometries to be orders of magnitude faster than what you have.

If you want to get a list of all the points that are within 300 meters of every other point, you can use the array_agg function around b.id, eg,

SELECT a.id, array_to_string(array_agg(b.id), ',')
FROM a CROSS JOIN b
WHERE ST_DWITHIN (a.geom, b.geom, 300) 
AND a.id != b.id
GROUP BY a.id;

which will now give you comma separated list of all those ids from b that are within 300 meters for each id in a.