Python – Broadcasting Shapely Point on Numpy Array

numpypythonshapely

I have some code that looks something like this:

def get_objs(all_objs, line, my_distance):
    """
    Gets all objs within my_distance of line.

    all_objs is a numpy array of shape (x, 4) denoting bounding-box coordinates.
    line is a LineString object with an arbitrary number of points. 
    """
    line_objs = []

    for o in all_objs:
        x_avg, y_avg = np.mean([o[0], o[1]]), np.mean([o[2], o[3]])

        p = Point(x_avg, y_avg)  # Point is a Shapely type

        if p.distance(line) <= my_distance:
            line_objs.append(o)

    return line_objs

My question is whether I can broadcast in this case, because this code is really slow on my data and I suspect a vectorised broadcast would help a ton. I have checked the numpy docs, but they don't make it clear that you can broadcast your inputs to a new type and then do something with that type.

Likewise the Shapely docs don't make it clear that you can apply the .difference() function in element-wise fashion so that you can compare to some geometric object. Perhaps I am being a bit optimistic. What I have in mind is something like:

x_avgs = (all_objs[:, 0] + all_objs[:, 1]) / 2
y_avgs = (all_objs[:, 2] + all_objs[:, 3]) / 2
xy_avgs = np.array((x_avgs, y_avgs)).transpose()

# element-wise initialisation and comparison, followed by
# masking for true-y elements.
line_objs = (Point(xy_avgs).distance(line) < my_distance).nonzero()

return line_objs

The above does not work, since Point()s don't accept arrays 🙁


After some (very statistically shaky) benchmarking, a quick consultation with timeit:

>>> setup = "import numpy as np; a = [(x, y, x, y) for x in range(100) for y in range(100)];"
>>> print(timeit.timeit("[(np.mean([o[0], o[1]]), np.mean([o[2], o[3]])) for o in a]", setup=setup, number=100))
30.91517815797124
>>> print(timeit.timeit("[((o[0] + o[1]) / 2, (o[2] + o[3]) / 2) for o in a]", setup=setup, number=100))
0.3588130779680796

… That's insane! That's nearly 2 orders of magnitude on a sample much smaller than mine. Perhaps vectorisation isn't necessary after all.

Best Answer

Shapely can give you this information so much faster, indeed. You can broadcast to a MultiPoint() object, which is sequence of Point() objects, in a single step.

Next, you may want to create a buffer for the line. A buffer around a line is the polygon representing the area in which all points are within a given distance. Because you want to return the rows from all_objs that fall within the boundaries, you need to test individual points from the multipoint object (otherwise a simple intersection would have done, e.g. multipoint & line.buffer(distance)). Iterating over a multipoint gives you individual Point() instances, but much faster than if you created separate Point() objects for each row.

First, just use ndarray.sum() as the base of your averages, then use numpy.stack() to combine the averages into a single array (shape: (..., 2)). Then use asMultiPoint() to create a Shapely MultiPoint() object from the array:

from shapely.geometry import asMultiPoint

x_avgs = all_objs[:, :2].sum(axis=1) / 2
y_avgs = all_objs[:, 1:3].sum(axis=1) / 2
xy_avgs = np.stack((x_avgs, y_avgs), axis=1)

points = asMultiPoint(xy_avgs)

Next, you want to create your line buffer, and from that buffer create a prepared object as that provides the fastest method to test point containment:

from shapely.prepared import prep
prepped_buffer = prep(line.buffer(distance))

prepped_buffer has a .contains() method that returns True if a given point falls within its boundaries.

Finally, use the prepped_buffer to determine what rows represent a point that's within the buffer boundary. I'd use numpy.compress(), and map() to avoid switching between Python and C constantly:

# boolean array indicating what points were contained within distance from the line
contained = np.fromiter(map(prepped_buffer.contains, points), np.bool)
# take all rows from all_objs that are within distance from line
result = np.compress(contained, all_objs, axis=0)

I'm feeding the map() result into np.compress() via np.fromiter() as that's a lot faster than using list(); the latter has to create individual Python bool references, while np.fromiter() creates a much more compact object with single bits for the boolean values (8 to a byte).

This takes abo ut 80ms on 10.000 values. I ran a setup in a separate IPython cell, creating a function for the above process:

# setup
import numpy as np
from shapely.geometry import LineString
a = np.array([(x, y, x, y) for x in range(100) for y in range(100)])
line = LineString([(0, 0), (1, 1)])
distance = 0.5

def select_by_linebuffer(a, line, distance):
    x_avgs = a[:, :2].sum(axis=1) / 2
    y_avgs = a[:, 1:3].sum(axis=1) / 2
    xy_avgs = asMultiPoint(np.stack((x_avgs, y_avgs), axis=1))
    prepped_buffer = prep(line.buffer(distance))
    return np.compress(
        np.fromiter(map(prepped_buffer.contains, points), dtype=np.bool),
        a,
        axis=0
    )

then used the %timeit IPython magic command to time the whole operation:

%timeit select_by_linebuffer(a, line, distance)
# 82.9 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparing this to your get_objs() function:

%timeit get_objs(a, line, distance)
# 534 ms ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So ~80ms vs over 500ms, a win by a factor of 6 or 7. As you increase the size of the input, this only gets (much) worse, as the vectorised method has very little overhead:

# 250k rows
a = np.array([(x, y, x, y) for x in range(500) for y in range(500)])
%timeit select_by_linebuffer(a, line, distance)
# 86.9 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit get_objs(a, line, distance)
12.5 s ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So for 250.000 rows, the vectorised method took a few milliseconds more time, while the Python loop approach took 12.5 seconds (!).

I tried to run the test on 1 million rows and my MacBook Pro 2.9 GHz Intel Core i7 laptop took such a long time to run get_objs() that after 5 minutes I aborted the run, while the vectorized version managed to run the job in 121 milliseconds.

The difference is that with np.compress() you get a single numpy array, while your get_objs() function built up a Python list of separate arrays. I'd argue that that's much better.

Related Question