Python – Convert Geopandas Points to Numpy Array

geopandasnumpypythonpython 3

I have a geopackage with a point feature. Which contains points along 19 lines. So, the number of points per line is exactly the same. Therefore, I could see it as a matrix in numpy. Moreover, each point has an attribute "AffectedbyRiver", where 1 (white) and 0 (red). How can a extract that information from geopandas to numpy array as an array 204-by-19 ? I mean, as a matrix with zeros and ones.

Note: The figure below is a screenshot from QGIS3x

import geopandas as gpd
preplot = gpd.read_file('geodatabase.gpkg', layer='SourceRegularPoints')
preplot.plot(figsize=(12,8))

Link to geopackage

here is my code (manually):

indexline = [19,18,17,16,15,14,13,12,11,10,1,2,3,4,5,6,7,8,9]


array = np.zeros((204,19))
x_coor =  np.zeros((204,19))
y_coor = np.zeros((204,19))

for i in indexline:
    array[:,i-1]=preplot.loc[preplot['org_fid']==i].AffectedbyRiver.values
    x_coor[:,i-1]=preplot.loc[preplot['org_fid']==i].X.values
    y_coor[:,i-1]=preplot.loc[preplot['org_fid']==i].y.values

plt.scatter(x_coor.flatten(),y_coor.flatten(),c=array.flatten(),cmap='gray')

Best Answer

The only thing your code is missing is automatisation of indexline, I am correct? Just use the first point and its x coordinate in each line to determine the correct order to fill the arrays.

indexline = preplot[preplot['distance']==0].sort_values('X').org_fid

array = np.zeros((204,19))
x_coor =  np.zeros((204,19))
y_coor = np.zeros((204,19))

for i in indexline:
    array[:,i-1]=preplot.loc[preplot['org_fid']==i].AffectedbyRiver.values
    x_coor[:,i-1]=preplot.loc[preplot['org_fid']==i].X.values
    y_coor[:,i-1]=preplot.loc[preplot['org_fid']==i].y.values

If you also want to automatically determine the shape of the array, you can use this:

x = gdf['distance'].unique().shape[0]
y = gdf['org_fid'].unique().shape[0]

array = np.zeros((x, y))
x_coor =  np.zeros((x, y))
y_coor = np.zeros((x, y))

Related Solutions

Python – Broadcasting Shapely Point on Numpy Array

Shapely can give you this information so much faster, indeed. You can broadcast to a MultiPoint() object, which is sequence of Point() objects, in a single step.

Next, you may want to create a buffer for the line. A buffer around a line is the polygon representing the area in which all points are within a given distance. Because you want to return the rows from all_objs that fall within the boundaries, you need to test individual points from the multipoint object (otherwise a simple intersection would have done, e.g. multipoint & line.buffer(distance)). Iterating over a multipoint gives you individual Point() instances, but much faster than if you created separate Point() objects for each row.

First, just use ndarray.sum() as the base of your averages, then use numpy.stack() to combine the averages into a single array (shape: (..., 2)). Then use asMultiPoint() to create a Shapely MultiPoint() object from the array:

from shapely.geometry import asMultiPoint

x_avgs = all_objs[:, :2].sum(axis=1) / 2
y_avgs = all_objs[:, 1:3].sum(axis=1) / 2
xy_avgs = np.stack((x_avgs, y_avgs), axis=1)

points = asMultiPoint(xy_avgs)

Next, you want to create your line buffer, and from that buffer create a prepared object as that provides the fastest method to test point containment:

from shapely.prepared import prep
prepped_buffer = prep(line.buffer(distance))

prepped_buffer has a .contains() method that returns True if a given point falls within its boundaries.

Finally, use the prepped_buffer to determine what rows represent a point that's within the buffer boundary. I'd use numpy.compress(), and map() to avoid switching between Python and C constantly:

# boolean array indicating what points were contained within distance from the line
contained = np.fromiter(map(prepped_buffer.contains, points), np.bool)
# take all rows from all_objs that are within distance from line
result = np.compress(contained, all_objs, axis=0)

I'm feeding the map() result into np.compress() via np.fromiter() as that's a lot faster than using list(); the latter has to create individual Python bool references, while np.fromiter() creates a much more compact object with single bits for the boolean values (8 to a byte).

This takes abo ut 80ms on 10.000 values. I ran a setup in a separate IPython cell, creating a function for the above process:

# setup
import numpy as np
from shapely.geometry import LineString
a = np.array([(x, y, x, y) for x in range(100) for y in range(100)])
line = LineString([(0, 0), (1, 1)])
distance = 0.5

def select_by_linebuffer(a, line, distance):
    x_avgs = a[:, :2].sum(axis=1) / 2
    y_avgs = a[:, 1:3].sum(axis=1) / 2
    xy_avgs = asMultiPoint(np.stack((x_avgs, y_avgs), axis=1))
    prepped_buffer = prep(line.buffer(distance))
    return np.compress(
        np.fromiter(map(prepped_buffer.contains, points), dtype=np.bool),
        a,
        axis=0
    )

then used the %timeit IPython magic command to time the whole operation:

%timeit select_by_linebuffer(a, line, distance)
# 82.9 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparing this to your get_objs() function:

%timeit get_objs(a, line, distance)
# 534 ms ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So ~80ms vs over 500ms, a win by a factor of 6 or 7. As you increase the size of the input, this only gets (much) worse, as the vectorised method has very little overhead:

# 250k rows
a = np.array([(x, y, x, y) for x in range(500) for y in range(500)])
%timeit select_by_linebuffer(a, line, distance)
# 86.9 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit get_objs(a, line, distance)
12.5 s ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So for 250.000 rows, the vectorised method took a few milliseconds more time, while the Python loop approach took 12.5 seconds (!).

I tried to run the test on 1 million rows and my MacBook Pro 2.9 GHz Intel Core i7 laptop took such a long time to run get_objs() that after 5 minutes I aborted the run, while the vectorized version managed to run the job in 121 milliseconds.

The difference is that with np.compress() you get a single numpy array, while your get_objs() function built up a Python list of separate arrays. I'd argue that that's much better.

Python GDAL – Convert Numpy Masked Array to Raster: Comprehensive Guide

In your method, change it like the following lines and that will solve the problem:

def array2raster(newRasterfn, rasterOrigin, pixelWidth, pixelHeight, array, path):
    """ This function works fine, until the raster mask. It does the work whatsoever. 
        Not to be used unless necessary. """

    # Create a local copy of the array
    temp = array.copy()

    # Given that the array is a masked array, the mask is "non existent".
    # Consequently, all values bigger than the maximum value (temp > np.max(temp))
    # Can be stored as "nan".
    temp[np.where(temp > np.max(temp))] = np.nan

In the process of writing the raster, outband.SetNoDataValue will recognize the nan as the data to set to "whatever the number" you want.

Best Answer

Related Solutions

Python – Broadcasting Shapely Point on Numpy Array

Python GDAL – Convert Numpy Masked Array to Raster: Comprehensive Guide

Related Question