Determining Adjacency in GeoPandas GeoDataFrame – Optimization Techniques

adjacencygeopandasmemoryoptimizationtopology

Main question

What is an efficient (i.e., good balance of memory & processing time) way to compute adjacency using a GeoPandas GeoDataFrame?

For my the rest of my question, I'll focus on the case where the GeoDataFrame contains Polygon shapes, but my question is fundamentally agnostic towards the type of geometries contained in the GeoDataFrame.

Reproducible example

# Loading important libraries
import geopandas as gpd
from shapely.geometry import Polygon

# Creating a function that makes square polygons
def make_square_polygon(centroid_x, centroid_y, side_length):
    return (Polygon(
        [(centroid_x - side_length/2, centroid_y - side_length/2),
         (centroid_x + side_length/2, centroid_y - side_length/2),
         (centroid_x + side_length/2, centroid_y + side_length/2),
         (centroid_x - side_length/2, centroid_y + side_length/2)]))

# Creating the GeoDataFrame
gdf = gpd.GeoDataFrame({'id':list(range(1,10)),
                        'geometry':[make_square_polygon(0.5, 0.5, 1),
                                    make_square_polygon(1.5, 0.5, 1),
                                    make_square_polygon(2.5, 0.5, 1),
                                    make_square_polygon(0.5, 1.5, 1),
                                    make_square_polygon(1.5, 1.5, 1),
                                    make_square_polygon(2.5, 1.5, 1),
                                    make_square_polygon(0.5, 2.5, 1),
                                    make_square_polygon(1.5, 2.5, 1),
                                    make_square_polygon(2.5, 2.5, 1),]},
                       geometry='geometry')

# Plotting the GeoDataFrame
gdf.plot(color='white', edgecolor='black')

The code above produces the following plot (I've manually added the IDs in red):

Ultimately, my goal is to find out what polygons are neighbors (i.e., which polygons are touching each other). So for the example above, the results would represent the following:

for the square with ID 1, the IDs of the squares that touch it are: 2, 4, and 5
for the square with ID 2, the IDs of the squares that touch it are: 1, 3, 4, 5, and 6
for the square with ID 3, the IDs of the squares that touch it are: 2, 5, and 6
for the square with ID 4, the IDs of the squares that touch it are: 1, 2, 5, 7, and 8
for the square with ID 5, the IDs of the squares that touch it are: 1, 2, 3, 4, 6, 7, 8 and 9
for the square with ID 6, the IDs of the squares that touch it are: 2, 3, 5, 8 and 9
for the square with ID 7, the IDs of the squares that touch it are: 4, 5, and 8
for the square with ID 8, the IDs of the squares that touch it are: 4, 5, 6, 7, and 9
for the square with ID 9, the IDs of the squares that touch it are: 5, 6, and 8

What the results should look like

The result could come out in multiple different ways.

Optimal results

Optimally, the results would look like this:

an array with 9 positions, each one representing the rows in the original GeoDataFrame. This array is populated with lists of the IDs or indices of the adjacent geometries (similar to what I have described in the example above).

a long Mx2 matrix with each row containing a pair of IDs or indices from the original GeoDataFrame that touch each other. In this case, M represents the number of pairs of features that touch each other.

Note that the list in the example above has several repeats (e.g., 1 & 2 and 2 & 1). Ideally, this list might even be further simplified to remove those duplicates.

2nd best results

Less optimally, the results would look like this:

a 9×9 matrix with Trues and Falses

a long (R²)x3 matrix representing all possible combinations of pairs of elements (eg: 1-1, 1-2, 1-3, 2-1, 2-2, 2-3, …). Here, the first and second column would represent the IDs or indices of the elements of the original GeoDataFrame and the third column would be filled with boolean values representing whether or not the features intersect. In this case, R represents the number of elements in the original GeoDataFrame.

I classify these as "less optimal" because they require a lot of memory to store (especially when dealing with large datasets).

My attempt

Here's my attempt at writing a function that performs the operation described above:

def get_adj_matrix(input_gdf):
    nrows = input_gdf.shape[0]
    geometry_column_name = input_gdf._geometry_column_name
    temp = input_gdf.merge(input_gdf, how='cross')
    adj_matrix = (gpd.GeoSeries(temp[f'{geometry_column_name}_x'])
                  .touches(gpd.GeoSeries(temp[f'{geometry_column_name}_y']))
                  .values.reshape((nrows, nrows)))
    return adj_matrix

print(get_adj_matrix(gdf))
# array([[False,  True, False,  True,  True, False, False, False, False],
#        [ True, False,  True,  True,  True,  True, False, False, False],
#        [False,  True, False, False,  True,  True, False, False, False],
#        [ True,  True, False, False,  True, False,  True,  True, False],
#        [ True,  True,  True,  True, False,  True,  True,  True,  True],
#        [False,  True,  True, False,  True, False, False,  True,  True],
#        [False, False, False,  True,  True, False, False,  True, False],
#        [False, False, False,  True,  True,  True,  True, False,  True],
#        [False, False, False, False,  True,  True, False,  True, False]])

The code above works just fine. The problem, however, is two-fold:

The code is messy/ugly. It's weird that I need to do a cross merge, create a vector that has (R²) elements and then reshape it back to be an RxR matrix.
The function is very fast because it uses NumPy's/Pandas'/GeoPandas' vectorized methods. However, for larger datasets, it is going to consume A LOT of memory because this approach requires us to make a dataframe of size (R²)x(2*C) and then a matrix of size RxR. (R represents the number of elements/rows in the original GeoDataFrame and C represents the number of columns in the original GeoDataFrame).

Simply put, my approach generates a 2nd best answer (as defined above) and I'd like to compute an Optimal answer which not only looks optimal in the end, but also avoids going through the large memory-consuming intermediate steps where huge RxR matrices or (R²)x(2*C) DataFrames are generated.

Back to my main question

How can I quickly compute adjacency among elements within a GeoDataFrame and have the results be computed in a way that doesn't take up much unnecessary memory (as exposed in the "Optimal results" section above) and doesn't depend on Python's rather slow for-loops? If possible, it'd be great to avoid the use of apply methods as well.

Best Answer

After researching some more, I found the libpysal library, which has tools to calculate exactly what I'm looking for. Here is how we can get the results in multiple different formats:

Setup

# Importing libpysal
import libpysal as lp

# Calculating adjacency
gdf_neighbors = lp.weights.Queen.from_dataframe(gdf)

The gdf_neigbors object above can be manipulated to generate the results in several different ways.

Full adjecency matrix

We can get the full adjacency matrix using the full() method. This spits out a tuple with two items: the first one containing the adjacency matrix and the second one containing the list of indices from the original dataframe:

gdf_adj_mtx, gdf_adj_mtx_indices = gdf_neighbors.full()

print(gdf_adj_mtx)
#array([[0., 1., 0., 1., 1., 0., 0., 0., 0.],
#      [1., 0., 1., 1., 1., 1., 0., 0., 0.],
#      [0., 1., 0., 0., 1., 1., 0., 0., 0.],
#      [1., 1., 0., 0., 1., 0., 1., 1., 0.],
#      [1., 1., 1., 1., 0., 1., 1., 1., 1.],
#      [0., 1., 1., 0., 1., 0., 0., 1., 1.],
#      [0., 0., 0., 1., 1., 0., 0., 1., 0.],
#      [0., 0., 0., 1., 1., 1., 1., 0., 1.],
#      [0., 0., 0., 0., 1., 1., 0., 1., 0.]]) 

print(gdf_adj_mtx_indices)
# [0, 1, 2, 3, 4, 5, 6, 7, 8]

Adjacency list with duplicates

We can generate a list containing only the adjacent pairs using the to_adjlist() method.

However, it should be noted that this method produces duplicates. For example, this method will contain (0 & 1) and also (1 & 0).

gdf_adj_list = gdf_neighbors.to_adjlist()

print(gdf_adj_list.shape)
# (40, 3)

print(gdf_adj_list)
#     focal  neighbor  weight
# 0       0         1     1.0
# 1       0         3     1.0
# 2       0         4     1.0
# 3       1         0     1.0
# 4       1         2     1.0
# 5       1         3     1.0
# 6       1         4     1.0
# 7       1         5     1.0
# 8       2         1     1.0
# 9       2         4     1.0
# 10      2         5     1.0
# 11      3         0     1.0
# 12      3         1     1.0
# 13      3         4     1.0
# 14      3         6     1.0
# 15      3         7     1.0
# 16      4         0     1.0
# 17      4         1     1.0
# 18      4         2     1.0
# 19      4         3     1.0
# 20      4         5     1.0
# 21      4         6     1.0
# 22      4         7     1.0
# 23      4         8     1.0
# 24      5         1     1.0
# 25      5         2     1.0
# 26      5         4     1.0
# 27      5         7     1.0
# 28      5         8     1.0
# 29      6         3     1.0
# 30      6         4     1.0
# 31      6         7     1.0
# 32      7         3     1.0
# 33      7         4     1.0
# 34      7         5     1.0
# 35      7         6     1.0
# 36      7         8     1.0
# 37      8         4     1.0
# 38      8         5     1.0
# 39      8         7     1.0

Adjacency list without duplicates

If we really want to eliminate those duplicates seen above, we must perform one small cleanup:

# Getting the full adjacency list
gdf_adj_list = gdf_neighbors.to_adjlist()

# Trimming down and removing duplicates
gdf_adj_list_no_dups = gdf_adj_list.loc[gdf_adj_list['focal']<gdf_adj_list['neighbor']]

print(gdf_adj_list_no_dups.shape)
# (20, 3)

print(gdf_adj_list_no_dups)
#     focal  neighbor  weight
# 0       0         1     1.0
# 1       0         3     1.0
# 2       0         4     1.0
# 4       1         2     1.0
# 5       1         3     1.0
# 6       1         4     1.0
# 7       1         5     1.0
# 9       2         4     1.0
# 10      2         5     1.0
# 13      3         4     1.0
# 14      3         6     1.0
# 15      3         7     1.0
# 20      4         5     1.0
# 21      4         6     1.0
# 22      4         7     1.0
# 23      4         8     1.0
# 27      5         7     1.0
# 28      5         8     1.0
# 31      6         7     1.0
# 36      7         8     1.0

Notice how the list above eliminates those "mirrored" duplicates.

Sources

I found all of this information in libpysal's documentation, specifically the portion that discusses spatial weights and contiguity. This page also has super valuable resources regarding how to use SciPy's compressed sparse graph module to store the results. It's all super useful and well worth the read.

Related Solutions

SpatiaLite GeoPandas – Insert GeoPandas GeoDataFrame into SpatiaLite Database

Okay, meanwhile I found out how this can be achieved. I don't know if this is the most efficient way but for my tables which are not huge it works fast enough.

The idea is that geopandas stores geometries as shapely geometry objects in each row of the geometry column of a GeoDataFrame (which is in fact just a GeoSeries) so that most of shapely's methods can be applied to it. Using the shapely.wkb.dumps method, each geometry object in the geometry column can be replaced with its well-known-binary representation (using the well-known-text representation given by shapely.wkt.dumps also worked, however, I did not try which one is faster).

Then, the whole GeoDataFrame including the geometry column as WKB is written into a new database table. That followed, a new Spatialite-compatible geometry column is added to the table (in Spatialite, geometry columns apparently can not be created together with a table but have to be added afterwards). This new column is now filled with Spatialite geometry objects by parsing the WKB representations from the former geopandas geometry column.

Last but not least, the former geometry column is dropped from the database table as it is not needed anymore since it has been replaced by the new one.

Find the full code below. Hopefully everything is explained adequately by the comments. However, I'll still appreciate if anybody knows a cleaner or faster solution.

import geopandas as gpd

# shapely method to convert geometry objects into their well-known-binary representation
import shapely.wkb

# sqlite/spatialite
from sqlalchemy import create_engine, event
from sqlite3 import dbapi2 as sqlite

# file operations
import os


def writeIntoDatabase():
    # read shapefile into GeoDataFrame
    print('reading shapefile')
    gdf = gpd.GeoDataFrame.from_file('AddressPoints.shp')

    # make sure that the database does not exist yet, otherwise it will be opened instead of overwritten which will 
    # cause errors in this example
    if os.path.exists('TestDB.sqlite'):
        os.remove('TestDB.sqlite')

    # create database engine and create sqlite database
    engine = create_engine('sqlite:///TestDB.sqlite', module=sqlite)

    # load spatialite extension for sqlite. make sure that mod_spatialite.dll is located in a folder that is in your 
    # system path
    @event.listens_for(engine, 'connect')
    def connect(dbapi_connection, connection_rec):
        dbapi_connection.enable_load_extension(True)
        dbapi_connection.execute('SELECT load_extension("mod_spatialite.dll")')

    # create spatialite metadata
    print('creating spatial metadata...')
    engine.execute("SELECT InitSpatialMetaData(1);")

    # convert all values from the geopandas geometry column into their well-known-binary representations
    gdf['geometry'] = gdf.apply(lambda x: shapely.wkb.dumps(x.geometry), axis=1)

    # write the geodataframe into the spatialite database, creating a new table 'AddressPoints' and replacing any 
    # existing of the same name
    print('writing into database...')
    gdf.to_sql('AddressPoints', engine, if_exists='replace', index=False)

    # add a Spatialite geometry column called 'geom' to the table, using ESPG 4326, data type POINT and 2 dimensions 
    # (x, y)
    engine.execute("SELECT AddGeometryColumn('AddressPoints', 'geom', 4326, 'POINT', 2);")

    # update the yet empty geom column by parsing the well-known-binary objects from the geometry column into 
    # Spatialite geometry objects
    engine.execute("UPDATE AddressPoints SET geom=GeomFromWKB(geometry, 4326);")

    # drop the geometry column from the GeoDataFrame (and all other columns but one to keep it concise, adapt this to
    #  your needs) which are not needed anymore. unfortunately, there is no DROP TABLE support in sqlite3, 
    # so a heavy workaround is needed via a temporary table.
    connection = engine.connect()
    with connection.begin() as trans:
        connection.execute("BEGIN TRANSACTION;")
        connection.execute("CREATE TABLE AddressPoints_backup(Add_Number, geom);")
        connection.execute("INSERT INTO AddressPoints_backup SELECT Add_Number, geom from AddressPoints;")
        connection.execute("DROP TABLE AddressPoints;")
        connection.execute("CREATE TABLE AddressPoints(Add_Number, geom);")
        connection.execute("INSERT INTO AddressPoints SELECT Add_Number, geom FROM AddressPoints_backup;")
        connection.execute("DROP TABLE AddressPoints_backup;")
        trans.commit()


# reading some spatial data from the database to see if it worked
def readFromDatabase():
    # create database engine and open existing sqlite database
    engine = create_engine('sqlite:///TestDB.sqlite', module=sqlite)

    # load spatialite extension for sqlite. make sure that mod_spatialite.dll is located in a folder that is in your 
    # system path
    @event.listens_for(engine, 'connect')
    def connect(dbapi_connection, connection_rec):
        dbapi_connection.enable_load_extension(True)
        dbapi_connection.execute('SELECT load_extension("mod_spatialite.dll")')

    # select X and Y coordinates from the POINT geometries in the database table
    x = engine.execute("SELECT X(geom) FROM AddressPoints;")
    y = engine.execute("SELECT Y(geom) FROM AddressPoints;")

    # print results
    xy = zip(x, y)
    for row in xy:
        print(row)

# start functions
writeIntoDatabase()
readFromDatabase()

[GIS] GeoPandas GeoDataFrame plot statistics – how

The GeoPandas plotting methods are there for convenience. They do override the standard pandas plotting methods. For now, the easiest way to get access to basic pandas plotting is probably through their functional versions, something like this:

from pandas.tools.plotting import plot_frame
plot_frame(mygeodataframe, kind='scatter', ...)

Feel free to create an issue about this on the GitHub repository (or better yet, help implement a solution).