Python – Parallelization of RTree Index for Spatial Indexing

pythonrtreespatial-index

Is it possible to share rtree index in memory with multiple processes? I have succeeded to use rtree in multiprocessing environment with joblib, but the problem is that each process has its own copy of the index. With a fairly large multi gigabyte index and several processes it runs out of the memory.

I know that rtree is not thread-safe, but with only read operations would it be possible to store index in shared memory?

Best Answer

One of the authors says no, read operations are not thread safe either.

However you could explore using a lock to prevent reads from happening simultaneously. Or running a separate process/thread whose sole job is to tend to the index, and querying it by passing data through queues.

If most of your code's execution time is spent doing queries then this would probably hurt performance too much to be feasible, but if hitting the index is just a small part of a larger processing pipeline then it could work out.

"Multi gigabyte index" also suggests that you might be storing [a copy of] the query-able data in the index itself. If your goal is to simply reduce the memory footprint then you could perhaps leave obj set to None and instead use the returned ID to look up the data which lives in shared memory, while continuing to make a copy of the index for each separate process.