[GIS] Geopandas performance appears quite slow

I'd like to confirm the performance I am observing when utilizing distance from a Geopandas Series.

Problem

Time to complete operations appears to be far greater than comparable operation in PostGIS. I would like to understand if this is known and, if so, if there are suggestions as to methods for making Geopandas more performant, particular regarding geo operations like buffering.

Goal

Given a set of geometries, I would like to calculate some aggregate (e.g. sum()) for each geometry plus those that fall within a given distance of the reference geometry.

Example

Given a set of geometries, I would like to calculate a quarter mile buffer (402 meters) around them and gather the sum the attribute du (dwelling units).

Current Strategy

Current method, utilizing solely the centroids in an attempt to be performant:

# precompute the centroids
centroids = df['wkb_geometry'].centroid

def test_measure(row):
    center = row['wkb_geometry'].centroid
    return df.loc[centroids.distance(center) < 402, 'du'].sum()

df.apply(lambda row: test_measure(row), axis=1)

Precomputing centroid and then using distance does appear to introduce some efficiencies compared to buffering and using within operation. Cost, regardless of methodology, grows at rate n^2 due to the fact that test_measure runs n times where n is the row count and must be run once for each of the n rows.

Times:

100 rows: 0.17s
1000 rows: 11.11s

Prior strategy

Prior, with buffering and within it took about:

100 rows: 1s
1000 rows: 100s

Prior method, not using centroids:

def test_measure(row):
    buffer_shape = row['wkb_geometry'].buffer(500)
    return df.loc[df['wkb_geometry'].within(buffer_shape), 'du'].sum()

Thoughts

Were I to perform a similar operation (using st_dwithin) in PostGIS, I would be able to run the operation in the following times.

Times:

100 rows: not run
1000 rows: not run
24000 rows: 75s

For reference, here is an example of that sort of SQL query:

CREATE OR REPLACE FUNCTION agg_within_dist(
    in_id int,
    in_geometry geometry,
    OUT id int,
    OUT du float)
AS
$$
    SELECT 
        $1 AS geography_id, 
        SUM(CAST(ref.du AS float)) AS du
    FROM s1.s1_scenario_final AS REF WHERE st_dwithin($2, ref.wkb_geometry, 402);
$$
COST 10000 
LANGUAGE SQL STABLE strict;


SELECT (f).* 
FROM (
        SELECT agg_within_dist(geography_id, wkb_geometry) AS f
        FROM scenario
     ) s

Best Answer

You probably use an index in your database. You don´t use one in python with your code. (modul rtree might help http://geoffboeing.com/2016/10/r-tree-spatial-index-python/). This might be a big issue depending on your geometries. Do many points fall into your buffers? You can try to stop the times for each step to see where the time is spent. I guess it will be in the distance < 402 part.

The second thing is that geopands is quite new. Not sure how they implement the functions. Usually it is a wrapper around some C stuff as otherwise python is really slow. PostGIS is a bit older and therefore had more time for refactoring and runs entirely in C. Also the way databases are working (memory pages on row level) is optimized for speed when searching for rows (objects).

Problem

Goal

Example

Current Strategy

Prior strategy

Thoughts

Best Answer

Related Solutions

GeoPandas – Speeding Up Extremely Slow Spatial Join

[GIS] GeoPandas GeoDataFrame plot statistics – how

Related Question