Shapely can give you this information so much faster, indeed. You can broadcast to a MultiPoint()
object, which is sequence of Point()
objects, in a single step.
Next, you may want to create a buffer for the line. A buffer around a line is the polygon representing the area in which all points are within a given distance. Because you want to return the rows from all_objs
that fall within the boundaries, you need to test individual points from the multipoint object (otherwise a simple intersection would have done, e.g. multipoint & line.buffer(distance)
). Iterating over a multipoint gives you individual Point()
instances, but much faster than if you created separate Point()
objects for each row.
First, just use ndarray.sum()
as the base of your averages, then use numpy.stack()
to combine the averages into a single array (shape: (..., 2)
). Then use asMultiPoint()
to create a Shapely MultiPoint()
object from the array:
from shapely.geometry import asMultiPoint
x_avgs = all_objs[:, :2].sum(axis=1) / 2
y_avgs = all_objs[:, 1:3].sum(axis=1) / 2
xy_avgs = np.stack((x_avgs, y_avgs), axis=1)
points = asMultiPoint(xy_avgs)
Next, you want to create your line buffer, and from that buffer create a prepared object as that provides the fastest method to test point containment:
from shapely.prepared import prep
prepped_buffer = prep(line.buffer(distance))
prepped_buffer
has a .contains()
method that returns True
if a given point falls within its boundaries.
Finally, use the prepped_buffer
to determine what rows represent a point that's within the buffer boundary. I'd use numpy.compress()
, and map()
to avoid switching between Python and C constantly:
# boolean array indicating what points were contained within distance from the line
contained = np.fromiter(map(prepped_buffer.contains, points), np.bool)
# take all rows from all_objs that are within distance from line
result = np.compress(contained, all_objs, axis=0)
I'm feeding the map()
result into np.compress()
via np.fromiter()
as that's a lot faster than using list()
; the latter has to create individual Python bool
references, while np.fromiter()
creates a much more compact object with single bits for the boolean values (8 to a byte).
This takes abo ut 80ms on 10.000 values. I ran a setup in a separate IPython cell, creating a function for the above process:
# setup
import numpy as np
from shapely.geometry import LineString
a = np.array([(x, y, x, y) for x in range(100) for y in range(100)])
line = LineString([(0, 0), (1, 1)])
distance = 0.5
def select_by_linebuffer(a, line, distance):
x_avgs = a[:, :2].sum(axis=1) / 2
y_avgs = a[:, 1:3].sum(axis=1) / 2
xy_avgs = asMultiPoint(np.stack((x_avgs, y_avgs), axis=1))
prepped_buffer = prep(line.buffer(distance))
return np.compress(
np.fromiter(map(prepped_buffer.contains, points), dtype=np.bool),
a,
axis=0
)
then used the %timeit
IPython magic command to time the whole operation:
%timeit select_by_linebuffer(a, line, distance)
# 82.9 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparing this to your get_objs()
function:
%timeit get_objs(a, line, distance)
# 534 ms ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So ~80ms vs over 500ms, a win by a factor of 6 or 7. As you increase the size of the input, this only gets (much) worse, as the vectorised method has very little overhead:
# 250k rows
a = np.array([(x, y, x, y) for x in range(500) for y in range(500)])
%timeit select_by_linebuffer(a, line, distance)
# 86.9 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit get_objs(a, line, distance)
12.5 s ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So for 250.000 rows, the vectorised method took a few milliseconds more time, while the Python loop approach took 12.5 seconds (!).
I tried to run the test on 1 million rows and my MacBook Pro 2.9 GHz Intel Core i7 laptop took such a long time to run get_objs()
that after 5 minutes I aborted the run, while the vectorized version managed to run the job in 121 milliseconds.
The difference is that with np.compress()
you get a single numpy array, while your get_objs()
function built up a Python list of separate arrays. I'd argue that that's much better.
Best Answer
The only thing your code is missing is automatisation of
indexline
, I am correct? Just use the first point and its x coordinate in each line to determine the correct order to fill the arrays.If you also want to automatically determine the shape of the array, you can use this: