I have some code that looks something like this:
def get_objs(all_objs, line, my_distance):
"""
Gets all objs within my_distance of line.
all_objs is a numpy array of shape (x, 4) denoting bounding-box coordinates.
line is a LineString object with an arbitrary number of points.
"""
line_objs = []
for o in all_objs:
x_avg, y_avg = np.mean([o[0], o[1]]), np.mean([o[2], o[3]])
p = Point(x_avg, y_avg) # Point is a Shapely type
if p.distance(line) <= my_distance:
line_objs.append(o)
return line_objs
My question is whether I can broadcast in this case, because this code is really slow on my data and I suspect a vectorised broadcast would help a ton. I have checked the numpy docs, but they don't make it clear that you can broadcast your inputs to a new type and then do something with that type.
Likewise the Shapely docs don't make it clear that you can apply the .difference()
function in element-wise fashion so that you can compare to some geometric object. Perhaps I am being a bit optimistic. What I have in mind is something like:
x_avgs = (all_objs[:, 0] + all_objs[:, 1]) / 2
y_avgs = (all_objs[:, 2] + all_objs[:, 3]) / 2
xy_avgs = np.array((x_avgs, y_avgs)).transpose()
# element-wise initialisation and comparison, followed by
# masking for true-y elements.
line_objs = (Point(xy_avgs).distance(line) < my_distance).nonzero()
return line_objs
The above does not work, since Point()
s don't accept arrays 🙁
After some (very statistically shaky) benchmarking, a quick consultation with timeit
:
>>> setup = "import numpy as np; a = [(x, y, x, y) for x in range(100) for y in range(100)];"
>>> print(timeit.timeit("[(np.mean([o[0], o[1]]), np.mean([o[2], o[3]])) for o in a]", setup=setup, number=100))
30.91517815797124
>>> print(timeit.timeit("[((o[0] + o[1]) / 2, (o[2] + o[3]) / 2) for o in a]", setup=setup, number=100))
0.3588130779680796
… That's insane! That's nearly 2 orders of magnitude on a sample much smaller than mine. Perhaps vectorisation isn't necessary after all.
Best Answer
Shapely can give you this information so much faster, indeed. You can broadcast to a
MultiPoint()
object, which is sequence ofPoint()
objects, in a single step.Next, you may want to create a buffer for the line. A buffer around a line is the polygon representing the area in which all points are within a given distance. Because you want to return the rows from
all_objs
that fall within the boundaries, you need to test individual points from the multipoint object (otherwise a simple intersection would have done, e.g.multipoint & line.buffer(distance)
). Iterating over a multipoint gives you individualPoint()
instances, but much faster than if you created separatePoint()
objects for each row.First, just use
ndarray.sum()
as the base of your averages, then usenumpy.stack()
to combine the averages into a single array (shape:(..., 2)
). Then useasMultiPoint()
to create a ShapelyMultiPoint()
object from the array:Next, you want to create your line buffer, and from that buffer create a prepared object as that provides the fastest method to test point containment:
prepped_buffer
has a.contains()
method that returnsTrue
if a given point falls within its boundaries.Finally, use the
prepped_buffer
to determine what rows represent a point that's within the buffer boundary. I'd usenumpy.compress()
, andmap()
to avoid switching between Python and C constantly:I'm feeding the
map()
result intonp.compress()
vianp.fromiter()
as that's a lot faster than usinglist()
; the latter has to create individual Pythonbool
references, whilenp.fromiter()
creates a much more compact object with single bits for the boolean values (8 to a byte).This takes abo ut 80ms on 10.000 values. I ran a setup in a separate IPython cell, creating a function for the above process:
then used the
%timeit
IPython magic command to time the whole operation:Comparing this to your
get_objs()
function:So ~80ms vs over 500ms, a win by a factor of 6 or 7. As you increase the size of the input, this only gets (much) worse, as the vectorised method has very little overhead:
So for 250.000 rows, the vectorised method took a few milliseconds more time, while the Python loop approach took 12.5 seconds (!).
I tried to run the test on 1 million rows and my MacBook Pro 2.9 GHz Intel Core i7 laptop took such a long time to run
get_objs()
that after 5 minutes I aborted the run, while the vectorized version managed to run the job in 121 milliseconds.The difference is that with
np.compress()
you get a single numpy array, while yourget_objs()
function built up a Python list of separate arrays. I'd argue that that's much better.