In doing geospatial data science work, it is very important to think about optimizing the code you are writing. How can you make datasets with hundreds of millions of rows aggregate or join faster? This is where concepts such as spatial indices come in. In this post, I will talk about how a spatial index gets implemented, what its benefits and limitations are, and take a look at Uber’s open source H3 indexing library for some cool spatial data science applications. Let’s get started!
A regular index is the kind of thing you might find at the end of a book: a list of words and where they have shown up in the text. This helps you quickly look up any reference to a word you’re interested in within a certain text. Without this handy tool, you would need to manually look through every page of your book, searching for that one mention you wanted to read about.
In modern databases, this issue of querying and searching is also very pertinent. Indexing often makes looking up data faster than filtering, and you can create indices based on a column of interest. For geospatial data in particular, engineers often need to look at operations like “intersection” or “is nearby to”. How can we make a spatial index so that these operations are as fast as possible? First, let’s take a look at some of this geospatial data:
Let’s say that we want to run a query to determine if these two shapes are intersecting. By construction, spatial databases create their index out of a bounding box that contains the geometry:
For answering whether these two features intersect, the database will compare whether the two bounding boxes have any area in common. As you can see, this can quickly lead to false positives. To fix this issue, spatial databases like PostGIS typically partition these large bounding boxes into smaller and smaller ones:
These partitions are stored in R-trees. R-trees are a hierarchical data structure: they keep track of the large “parent” bounding box, its children, its children’s children and so on. Every parent’s bounding box contains its children’s bounding boxes:
The operation “intersect” is one of the key operations that benefits from this structure. While querying an interesection, the database looks down this tree asking “does the current bounding box intersect the feature of interest?”. If yes, it looks at that bounding box’s children and asks the same question. As it does so, it is able to quickly traverse the tree, skipping the branches that do not have an intersection and thus improve the query’s performance. In the end, it returns the intersecting geometry as desired.
Let’s now take a concrete look at what using a regular row-wise procedure vs a spatial index looks like. I’ll be using 2 datasets representing NYC’s Census Tracts as well as City Facilities (both licensed through Open Data, and available here and here). First, let’s try out the “intersection” operation in GeoPandas on one of the Census Tract geometries. ‘Intersection’ in GeoPandas is a row-wise function checking each row of the column of interest against our geometry and looking at whether they intersect or not.
GeoPandas also offers a spatial index operation that uses R-trees and allows us to perform intersections as well. Here is a runtime comparison of the two methods over 100 runs of the intersection operation (note: because the default intersection function is slow, I only selected around 100 geometries from the original dataset):
As you can see, the spatial index approach offered much improved performance over the vanilla intersection method. In fact, here are the 95% confidence intervals for the runtimes of each: