[GIS] How to vectorize creating a Shapely Polygon in Pandas

geopandasoptimizationpandaspythonshapely

I have a GeoPandas DataFrame with a polygon in each row. I want to add a column with coordinates of a bounding box of each polygon. I can do it this way:

def create_bbox(row):
    xmin, ymin, xmax, ymax = row.geometry.bounds
    return Polygon.from_bounds(xmin, ymin, xmax, ymax)  

osm_buildings['bbox'] = osm_buildings.apply(lambda row: create_bbox(row), axis=1)

However, because of the size of the dataset, I need to speed this process up. I want to use vectorization. What I've tried is this:

osm_buildings['bbox'] = Polygon.from_bounds(
     osm_buildings.geometry.bounds.minx, 
     osm_buildings.geometry.bounds.miny, 
     osm_buildings.geometry.bounds.maxx, 
     osm_buildings.geometry.bounds.maxy)

However, I get

*** ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), 
a.any() or a.all().

I don't understand very well what "truth value" is the error referring to and how can I fix it.

Question: How can I make this operation faster than using apply()? I am asking specifically about vectorization but if there is something else to speed it up I'm interested as well.

Best Answer

A simple way to speed up an apply function is swifter. Once installed (can be installed directly or through pip or conda), it's as simple as adding

import swifter

And then changing

osm_buildings['bbox'] = osm_buildings.apply(lambda row: create_bbox(row), axis=1)

osm_buildings['bbox'] = osm_buildings.swifter.apply(lambda row: create_bbox(row), axis=1)

It attempts to run in a vectorized fashion (if possible) and use Dask to parallelize the process too. It's not magic, but whether you manage to vectorize your function or not, this should at the very least make the most of any free cpus you many have laying around.

Your vectorization attempt:

You are attempting to create a single polygon from a Series of boundary limits since osm_buildings.geometry.bounds.minx returns a Series (all minx of all bounds of all geometries) and Polygon.from_bounds returns a single polygon, which is why you are getting a ValueError.

The from_bounds method of a shapely Polygon cannot be used inside of a vectorized function.

swifter application:

I tested this on a geopandas dataframe of shape (3989589, 6) and found that swifter results in a decrease in performance. I suspected it's due to the fact that from_bounds is not vectorizable and the overhead in splitting the task is much higher than the actual computation.

Without vectorization:

A more elegant way to write your current (non-vectorized) implementation is

osm_buildings['bbox'] = osm_buildings.geometry.apply(lambda geom: Polygon.from_bounds(*geom.bounds))

NB: using apply on a GeoSeries (osm_buildings.geometry) instead of the whole geodataframe increases the speed substantially since the amount of data it has to parse is drastically reduced.

My computer took 54.7s to run this task (so 13.71 seconds per million rows). How much faster do you need it to run?

Related Solutions

[GIS] Speed up row-wise point in polygon with Geopandas

I would recommend either using the geopandas-cython branch here or pygeos.

If you use pygeos, I would recommend converting the geometries from shapely to the pygeos version first for the best speedups.

[GIS] Assign a point to polygon using pandas and shapely

If you're working with spatial data and Pandas you should take a look at GeoPandas.

The example below demonstrates how to perform a spatial join in GeoPandas (which uses Shapely). A GeoDataFrame object is created from a list of cities and their coordinates and is joined to an ESRI Shapefile containing countries.

import pandas
import geopandas
import geopandas.tools
from shapely.geometry import Point

# Create a DataFrame with some cities, including their location
raw_data = [
    ("London", 51.5, -0.1),
    ("Paris", 48.9, 2.4),
    ("San Francisco", 37.8, -122.4),
]
places = pandas.DataFrame(raw_data, columns=["name", "latitude", "longitude"])

# Create the geometry column from the coordinates
# Remember that longitude is east-west (i.e. X) and latitude is north-south (i.e. Y)
places["geometry"] = places.apply(lambda row: Point(row["longitude"], row["latitude"]), axis=1)
del(places["latitude"], places["longitude"])

# Convert to a GeoDataFrame
places = geopandas.GeoDataFrame(places, geometry="geometry")

# Declare the coordinate system for the places GeoDataFrame
# GeoPandas doesn't do any transformations automatically when performing
# the spatial join. The layers are already in the same CRS (WGS84) so no
# transformation is needed.
places.crs = {"init": "epsg:4326"}

# Load the countries polygons
countries = geopandas.GeoDataFrame.from_file("ne_110m_admin_0_countries.shp")
# Drop all columns except the name and polygon geometry
countries = countries[["name", "geometry"]]

# Perform the spatial join
result = geopandas.tools.sjoin(places, countries, how="left")

# Print the results...
print(result.head())

Note that the spatial join feature is still fairly new and hasn't made it into the stable branch yet - you'll need to download and install the development version.

https://github.com/geopandas/geopandas

The result looks like this:

       name_left             geometry  index_right      name_right
0         London    POINT (-0.1 51.5)           57  United Kingdom
1          Paris     POINT (2.4 48.9)           55          France
2  San Francisco  POINT (-122.4 37.8)          168   United States

You can also use GeoPandas to plot the data with matplotlib:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1)
countries.plot(ax=ax, color="#cccccc")
places.plot(ax=ax, markersize=5, color="#cc0000")
plt.show()

Best Answer

Related Solutions

[GIS] Speed up row-wise point in polygon with Geopandas

[GIS] Assign a point to polygon using pandas and shapely

Related Question