[GIS] How to vectorize creating a Shapely Polygon in Pandas

geopandasoptimizationpandaspythonshapely

I have a GeoPandas DataFrame with a polygon in each row. I want to add a column with coordinates of a bounding box of each polygon. I can do it this way:

def create_bbox(row):
    xmin, ymin, xmax, ymax = row.geometry.bounds
    return Polygon.from_bounds(xmin, ymin, xmax, ymax)  

osm_buildings['bbox'] = osm_buildings.apply(lambda row: create_bbox(row), axis=1)

However, because of the size of the dataset, I need to speed this process up. I want to use vectorization. What I've tried is this:

osm_buildings['bbox'] = Polygon.from_bounds(
     osm_buildings.geometry.bounds.minx, 
     osm_buildings.geometry.bounds.miny, 
     osm_buildings.geometry.bounds.maxx, 
     osm_buildings.geometry.bounds.maxy)

However, I get

*** ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), 
a.any() or a.all().

I don't understand very well what "truth value" is the error referring to and how can I fix it.

Question: How can I make this operation faster than using apply()? I am asking specifically about vectorization but if there is something else to speed it up I'm interested as well.

Best Answer

A simple way to speed up an apply function is swifter. Once installed (can be installed directly or through pip or conda), it's as simple as adding

import swifter

And then changing

osm_buildings['bbox'] = osm_buildings.apply(lambda row: create_bbox(row), axis=1)

to

osm_buildings['bbox'] = osm_buildings.swifter.apply(lambda row: create_bbox(row), axis=1)

It attempts to run in a vectorized fashion (if possible) and use Dask to parallelize the process too. It's not magic, but whether you manage to vectorize your function or not, this should at the very least make the most of any free cpus you many have laying around.

Your vectorization attempt:

You are attempting to create a single polygon from a Series of boundary limits since osm_buildings.geometry.bounds.minx returns a Series (all minx of all bounds of all geometries) and Polygon.from_bounds returns a single polygon, which is why you are getting a ValueError.

The from_bounds method of a shapely Polygon cannot be used inside of a vectorized function.

swifter application:

I tested this on a geopandas dataframe of shape (3989589, 6) and found that swifter results in a decrease in performance. I suspected it's due to the fact that from_bounds is not vectorizable and the overhead in splitting the task is much higher than the actual computation.

Without vectorization:

A more elegant way to write your current (non-vectorized) implementation is

osm_buildings['bbox'] = osm_buildings.geometry.apply(lambda geom: Polygon.from_bounds(*geom.bounds))

NB: using apply on a GeoSeries (osm_buildings.geometry) instead of the whole geodataframe increases the speed substantially since the amount of data it has to parse is drastically reduced.

My computer took 54.7s to run this task (so 13.71 seconds per million rows). How much faster do you need it to run?