Determining Adjacency in GeoPandas GeoDataFrame – Optimization Techniques

adjacencygeopandasmemoryoptimizationtopology

Main question

What is an efficient (i.e., good balance of memory & processing time) way to compute adjacency using a GeoPandas GeoDataFrame?

For my the rest of my question, I'll focus on the case where the GeoDataFrame contains Polygon shapes, but my question is fundamentally agnostic towards the type of geometries contained in the GeoDataFrame.

Reproducible example

# Loading important libraries
import geopandas as gpd
from shapely.geometry import Polygon

# Creating a function that makes square polygons
def make_square_polygon(centroid_x, centroid_y, side_length):
    return (Polygon(
        [(centroid_x - side_length/2, centroid_y - side_length/2),
         (centroid_x + side_length/2, centroid_y - side_length/2),
         (centroid_x + side_length/2, centroid_y + side_length/2),
         (centroid_x - side_length/2, centroid_y + side_length/2)]))

# Creating the GeoDataFrame
gdf = gpd.GeoDataFrame({'id':list(range(1,10)),
                        'geometry':[make_square_polygon(0.5, 0.5, 1),
                                    make_square_polygon(1.5, 0.5, 1),
                                    make_square_polygon(2.5, 0.5, 1),
                                    make_square_polygon(0.5, 1.5, 1),
                                    make_square_polygon(1.5, 1.5, 1),
                                    make_square_polygon(2.5, 1.5, 1),
                                    make_square_polygon(0.5, 2.5, 1),
                                    make_square_polygon(1.5, 2.5, 1),
                                    make_square_polygon(2.5, 2.5, 1),]},
                       geometry='geometry')

# Plotting the GeoDataFrame
gdf.plot(color='white', edgecolor='black')

The code above produces the following plot (I've manually added the IDs in red):
GeoDataFrame with nine squares arranged in 3x3 grid

Ultimately, my goal is to find out what polygons are neighbors (i.e., which polygons are touching each other). So for the example above, the results would represent the following:

  • for the square with ID 1, the IDs of the squares that touch it are: 2, 4, and 5
  • for the square with ID 2, the IDs of the squares that touch it are: 1, 3, 4, 5, and 6
  • for the square with ID 3, the IDs of the squares that touch it are: 2, 5, and 6
  • for the square with ID 4, the IDs of the squares that touch it are: 1, 2, 5, 7, and 8
  • for the square with ID 5, the IDs of the squares that touch it are: 1, 2, 3, 4, 6, 7, 8 and 9
  • for the square with ID 6, the IDs of the squares that touch it are: 2, 3, 5, 8 and 9
  • for the square with ID 7, the IDs of the squares that touch it are: 4, 5, and 8
  • for the square with ID 8, the IDs of the squares that touch it are: 4, 5, 6, 7, and 9
  • for the square with ID 9, the IDs of the squares that touch it are: 5, 6, and 8

What the results should look like

The result could come out in multiple different ways.

Optimal results

Optimally, the results would look like this:

  1. an array with 9 positions, each one representing the rows in the original GeoDataFrame. This array is populated with lists of the IDs or indices of the adjacent geometries (similar to what I have described in the example above).

Lists of adjacent items

  1. a long Mx2 matrix with each row containing a pair of IDs or indices from the original GeoDataFrame that touch each other. In this case, M represents the number of pairs of features that touch each other.

Adjacency pairs

Note that the list in the example above has several repeats (e.g., 1 & 2 and 2 & 1). Ideally, this list might even be further simplified to remove those duplicates.

2nd best results

Less optimally, the results would look like this:

  1. a 9×9 matrix with Trues and Falses

Adjacency matrix

  1. a long (R²)x3 matrix representing all possible combinations of pairs of elements (eg: 1-1, 1-2, 1-3, 2-1, 2-2, 2-3, …). Here, the first and second column would represent the IDs or indices of the elements of the original GeoDataFrame and the third column would be filled with boolean values representing whether or not the features intersect. In this case, R represents the number of elements in the original GeoDataFrame.

All pairs with boolean results

I classify these as "less optimal" because they require a lot of memory to store (especially when dealing with large datasets).

My attempt

Here's my attempt at writing a function that performs the operation described above:

def get_adj_matrix(input_gdf):
    nrows = input_gdf.shape[0]
    geometry_column_name = input_gdf._geometry_column_name
    temp = input_gdf.merge(input_gdf, how='cross')
    adj_matrix = (gpd.GeoSeries(temp[f'{geometry_column_name}_x'])
                  .touches(gpd.GeoSeries(temp[f'{geometry_column_name}_y']))
                  .values.reshape((nrows, nrows)))
    return adj_matrix

print(get_adj_matrix(gdf))
# array([[False,  True, False,  True,  True, False, False, False, False],
#        [ True, False,  True,  True,  True,  True, False, False, False],
#        [False,  True, False, False,  True,  True, False, False, False],
#        [ True,  True, False, False,  True, False,  True,  True, False],
#        [ True,  True,  True,  True, False,  True,  True,  True,  True],
#        [False,  True,  True, False,  True, False, False,  True,  True],
#        [False, False, False,  True,  True, False, False,  True, False],
#        [False, False, False,  True,  True,  True,  True, False,  True],
#        [False, False, False, False,  True,  True, False,  True, False]])

The code above works just fine. The problem, however, is two-fold:

  • The code is messy/ugly. It's weird that I need to do a cross merge, create a vector that has (R²) elements and then reshape it back to be an RxR matrix.
  • The function is very fast because it uses NumPy's/Pandas'/GeoPandas' vectorized methods. However, for larger datasets, it is going to consume A LOT of memory because this approach requires us to make a dataframe of size (R²)x(2*C) and then a matrix of size RxR. (R represents the number of elements/rows in the original GeoDataFrame and C represents the number of columns in the original GeoDataFrame).

Simply put, my approach generates a 2nd best answer (as defined above) and I'd like to compute an Optimal answer which not only looks optimal in the end, but also avoids going through the large memory-consuming intermediate steps where huge RxR matrices or (R²)x(2*C) DataFrames are generated.

Back to my main question

How can I quickly compute adjacency among elements within a GeoDataFrame and have the results be computed in a way that doesn't take up much unnecessary memory (as exposed in the "Optimal results" section above) and doesn't depend on Python's rather slow for-loops? If possible, it'd be great to avoid the use of apply methods as well.

Best Answer

After researching some more, I found the libpysal library, which has tools to calculate exactly what I'm looking for. Here is how we can get the results in multiple different formats:

Setup

# Importing libpysal
import libpysal as lp

# Calculating adjacency
gdf_neighbors = lp.weights.Queen.from_dataframe(gdf)

The gdf_neigbors object above can be manipulated to generate the results in several different ways.

Full adjecency matrix

We can get the full adjacency matrix using the full() method. This spits out a tuple with two items: the first one containing the adjacency matrix and the second one containing the list of indices from the original dataframe:

gdf_adj_mtx, gdf_adj_mtx_indices = gdf_neighbors.full()

print(gdf_adj_mtx)
#array([[0., 1., 0., 1., 1., 0., 0., 0., 0.],
#      [1., 0., 1., 1., 1., 1., 0., 0., 0.],
#      [0., 1., 0., 0., 1., 1., 0., 0., 0.],
#      [1., 1., 0., 0., 1., 0., 1., 1., 0.],
#      [1., 1., 1., 1., 0., 1., 1., 1., 1.],
#      [0., 1., 1., 0., 1., 0., 0., 1., 1.],
#      [0., 0., 0., 1., 1., 0., 0., 1., 0.],
#      [0., 0., 0., 1., 1., 1., 1., 0., 1.],
#      [0., 0., 0., 0., 1., 1., 0., 1., 0.]]) 

print(gdf_adj_mtx_indices)
# [0, 1, 2, 3, 4, 5, 6, 7, 8]

Adjacency list with duplicates

We can generate a list containing only the adjacent pairs using the to_adjlist() method.

However, it should be noted that this method produces duplicates. For example, this method will contain (0 & 1) and also (1 & 0).

gdf_adj_list = gdf_neighbors.to_adjlist()

print(gdf_adj_list.shape)
# (40, 3)

print(gdf_adj_list)
#     focal  neighbor  weight
# 0       0         1     1.0
# 1       0         3     1.0
# 2       0         4     1.0
# 3       1         0     1.0
# 4       1         2     1.0
# 5       1         3     1.0
# 6       1         4     1.0
# 7       1         5     1.0
# 8       2         1     1.0
# 9       2         4     1.0
# 10      2         5     1.0
# 11      3         0     1.0
# 12      3         1     1.0
# 13      3         4     1.0
# 14      3         6     1.0
# 15      3         7     1.0
# 16      4         0     1.0
# 17      4         1     1.0
# 18      4         2     1.0
# 19      4         3     1.0
# 20      4         5     1.0
# 21      4         6     1.0
# 22      4         7     1.0
# 23      4         8     1.0
# 24      5         1     1.0
# 25      5         2     1.0
# 26      5         4     1.0
# 27      5         7     1.0
# 28      5         8     1.0
# 29      6         3     1.0
# 30      6         4     1.0
# 31      6         7     1.0
# 32      7         3     1.0
# 33      7         4     1.0
# 34      7         5     1.0
# 35      7         6     1.0
# 36      7         8     1.0
# 37      8         4     1.0
# 38      8         5     1.0
# 39      8         7     1.0

Adjacency list without duplicates

If we really want to eliminate those duplicates seen above, we must perform one small cleanup:

# Getting the full adjacency list
gdf_adj_list = gdf_neighbors.to_adjlist()

# Trimming down and removing duplicates
gdf_adj_list_no_dups = gdf_adj_list.loc[gdf_adj_list['focal']<gdf_adj_list['neighbor']]

print(gdf_adj_list_no_dups.shape)
# (20, 3)

print(gdf_adj_list_no_dups)
#     focal  neighbor  weight
# 0       0         1     1.0
# 1       0         3     1.0
# 2       0         4     1.0
# 4       1         2     1.0
# 5       1         3     1.0
# 6       1         4     1.0
# 7       1         5     1.0
# 9       2         4     1.0
# 10      2         5     1.0
# 13      3         4     1.0
# 14      3         6     1.0
# 15      3         7     1.0
# 20      4         5     1.0
# 21      4         6     1.0
# 22      4         7     1.0
# 23      4         8     1.0
# 27      5         7     1.0
# 28      5         8     1.0
# 31      6         7     1.0
# 36      7         8     1.0

Notice how the list above eliminates those "mirrored" duplicates.

Sources

I found all of this information in libpysal's documentation, specifically the portion that discusses spatial weights and contiguity. This page also has super valuable resources regarding how to use SciPy's compressed sparse graph module to store the results. It's all super useful and well worth the read.