Large-Scale Geospatial Data Analysis with R-Trees


The Python package Rtree provides an implementation of the R-tree data structure and comes with a number of handy features, such as nearest neighbor searches, intersection searches or multi-dimensional indexes.

We can conveniently install the package with Python’s package manager pip: pip install Rtree.

Basics

Before we handle geometries like points or polygons, we cover the basic usage of the Rtree package.

The index module helps us to construct a spatial index. This index is built up automatically by inserting bounding boxes of our objects. The bounding boxes are defined by specifying their left, bottom, right and top coordinates. Note that we insert a bounding box together with an identifier (in the above example 0 and 1). The ID will help us to identify the bounding box when performing queries:

The index is queried for a given rectangle, again specified by its left, bottom, right and top coordinates. The result of the intersection method are the IDs of the objects that are contained within the search window (examples 1–3). The result is empty in the case that the search window is beyond the bounds of data we have in the index (example 4). Similarly, we use the nearest method to find the k-nearest objects to a given search window:

Working with points, lines and polygons

In the previous section, we saw how an index is constructed by inserting bounding boxes of objects. We now want to continue by using points, lines and polygons for these objects. The package Shapely provides an easy way of working with these kind of geometries in Python:

Above, we first create a point, a line, and a polygon. Next, the bounding boxes of these objects are inserted into an index using IDs 0, 1, and 2. We now query the index for different search windows:

The illustration below shows the geometries and search windows:

Green: Point, Line and Polygon. Red: Search Windows. Image by the author.

Searching all trees in the Upper East Side

We finally have everything needed to extract all trees within the Upper East Side! We will go through a code snippet below, however the full version can be found here.

Green: Trees in New York City. Blue: Upper East Side. Orange: Bounding box of the Upper East Side. Image by the author, with map data from © Mapbox and © OpenStreetMap.

First, we load all required geometries using the GeoPandas package:

Next, we create an R-tree index containing all trees in New York City:

Now, we generate a list of potential candidates, i.e. all trees that are within the bounding box of the Upper East Side:

Finally, we iterate through all potential candidates to extract the ones that are fully within the Upper East Side:

In this article, we learned how R-trees organize geographic information by partitioning the underlaying space into rectangles. This structure makes R-trees extremely fast for spatial lookups. In our New York City street tree example, utilizing an R-tree reduced the number of operations by a factor of 60. We also saw how to work with R-trees in Python. The speed-up in our example was achieved with just four lines of code: initializing the index (1 line), constructing the index (2 lines), and using the intersection function to find nearby candidates (1 line).

So why are R-trees not used everywhere? While we gain time by reducing the number of search operations, we lose time by constructing the index. For the latter we literally have to iterate through the entire dataset. This makes R-trees not suitable for applications requiring only a small number of searches or applications where the index changes often (because of tree rebalancing).

R-trees have come a long way since their invention by Antonin Guttman in 1984. Nowadays, they are found in all sorts of applications, for example in computer graphics [2], video games [3], traffic control systems [4], and most prominently in databases for spatial data management [5]. And perhaps in your next geospatial data analysis, too!



Source link

Leave a Comment