I have a research project of 180,000 entries and I need to calculate the number of neighbors (points from the same table) that exist within a 300 meter radius for each point. My method is very slow, and I'm asking if there is a better way.
My initial thought was to left-join the table to itself, then calculate ST_Distance_Sphere for every row of the result table, then aggregate count the number of entries that are less than 300 meters – but this is turning out to be extremely slow. The initial JOIN in particular is taking more than an hour to complete (it's still running).
SELECT source_id as id, sum(case when dist_meters < 305 THEN 1 ELSE 0 END) as neighbors_within_distance FROM
(SELECT
a.id as source_id,
a.biz_lat as source_lat,
a.biz_lon as source_lon,
b.id as match_id,
b.biz_lat,
b.biz_lon,
ST_Distance_Sphere(ST_MakePoint(a.biz_lon, a.biz_lat), ST_MakePoint(b.biz_lon,b.biz_lat)) as dist_meters
FROM a LEFT JOIN a as b ON 1=1) as processed group by source_id
I don't currently have any indexes, and the lat/lon columns are stored as floats. I could perform a KNN analysis first to select the nearest 500 points, and then filter the smaller data set down to only neighbors that meet the 300 meter criteria – but I'm not sure how to do that quickly.
Is there a better way to solve this problem?
EDIT : Here are my setup queries and example data rows
DROP TABLE IF EXISTS a;
CREATE TABLE a (
id integer,
name text,
street_address text,
city text,
state text,
zip text,
county text,
telephone text,
fax text,
email text,
web text,
type text,
num_employees text,
sic_1 text,
sic_2 text,
headquarters text,
revenue text,
biz_lat float,
biz_lon float,
class text,
encoding_type text,
bbox1 float,
bbox2 float,
bbox3 float,
bbox4 float,
display text,
source text
);
An example row from the dataset
ID Company Name Street Address City State/Province Postal Code County Telephone Number Fax Number Company Email Address URL/Web Address Company Type No. of Employees Primary SIC Code 1 Primary SIC Code 2 Headquarters Sales/Revenue lat lon class type bbox1 bbox2 bbox3 bbox4 display source
54776 26430. bizname 123 fake Seguin Texas 78155 El Paso (830) 555-3910 PRIVATE - PARENT 6 7359 Equipment rental & leasin, nec Headquarters 1,098,000 (USD) 29.56570181 -97.94911784 place house 29.56565181 29.56575181 -97.94916784 -97.94906784 123 Fake Seguin Guadalupe County Texas 78155 United States of America
Best Answer
You have pretty much suggested the answer to your own question already, which is that you don't have any indexes.
I would suggest that you convert your lat/lon columns to a geography (point) data type, add a spatial index, and rewrite the query to use ST_DWithin, which uses a spatial index, if available, and works with geometries or geographies.
This kind of construct is essentially a Cartesian product, or cross (spatial self) join, where ST_DWithin and the index dramatically reducing the search space. The a, b is short hand for a CROSS JOIN b. Note the
a.id != b.id
to avoid comparing a point with itself. Also, note that the 3rd form ofST_DWithin
allows you to set a 4th parameter,use_spheroid
to true, which will give slightly less accurate results, but be somewhat quicker, as it avoids the far more complex maths required to do spheroid calculations.Overall, I would expect ST_DWithin on indexed geometries to be orders of magnitude faster than what you have.
If you want to get a list of all the points that are within 300 meters of every other point, you can use the
array_agg
function around b.id, eg,which will now give you comma separated list of all those ids from b that are within 300 meters for each id in a.