The GPBoost algorithm extends linear mixed effects and Gaussian process models by replacing the linear fixed effects function with a non-parametric non-linear function modeled using tree-boosting. This article shows how the GPBoost algorithm implemented in the `GPBoost`

library can be used for modeling data with a spatial and grouped structure. We demonstrate the functionality of the `GPBoost`

library using European GDP data which is an example of areal spatial econometric data.

## Table of contents

**∘ ****Introduction**

· Table of contents

· Basic workflow of GPBoost

· Data description

· Data loading and short visualization**∘ ****Training a GPBoost model**

∘ **Choosing tuning parameters**

∘ **Model interpretation**

· Estimated random effects model

· Spatial effect map

· Understanding the fixed effects function**∘ ****Extensions**

· Separate random effects for different time periods

· Interaction between space and fixed effects predictor variables

· Large data

· Other spatial random effects models

· (Generalized) linear mixed effects and Gaussian process models

∘** ****References**

## Basic workflow of GPBoost

Applying a GPBoost model (= combined tree-boosting and random effects / GP models) involves the following main steps:

- Define a
`GPModel`

in which you specify the following:

– A random effects model (e.g., spatial random effects, grouped random effects, combined spatial and grouped, etc.)

– The likelihood (= distribution of the response variable conditional on fixed and random effects) - Create a
`gpb.Dataset`

with the response variable and fixed effects predictor variables - Choose tuning parameters, e.g., using the function
`gpb.grid.search.tune.parameters`

- Train the model with the
`gpboost / gpb.train`

function

In this article, we demonstrate these steps using a real world data set. Besides, we also show how to interpret fitted models, and we look at several extensions and additional features of the `GPBoost`

library. All results are obtained using `GPBoost`

version 1.2.1. This demo uses the R package, but the corresponding Python package provides the same functionality.

## Data description

The data used in this demo is European gross domestic product (GDP) data. It can downloaded from https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv. The data was collected by Massimo Giannini, University of Rome Tor Vergata, from Eurostat and kindly provided to Fabio Sigrist for a talk at the University of Rome Tor Vergata on June 16, 2023.

Data was collected for 242 European regions for the two years 2000 and 2021. I.e., the total number of data points is 484. The response variable is (log) GDP / capita. There are four predictor variables:

- L: (log) share of employment (empl/pop)
- K: (log) fixed capital/population
- Pop: log(population)
- Edu: share of tertiary education

Further, there are centroid spatial coordinates of every region (longitude and latitude), a spatial region ID for the 242 different European regions, and an additional spatial cluster ID which identifies the cluster the region belongs (there are two clusters).

## Data loading and short visualization

We first load the data and create a map illustrating the (log) GDP / capita over space. We create two maps: one with all data and another one when excluding some remote islands. In the commented code below, we also show how to create a spatial plot when no shape file for spatial polygons is available.

`library(gpboost)`

library(ggplot2)

library(gridExtra)

library(viridis)

library(sf)# Load data

data <- read.csv("https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv")

FID <- data$FID

data <- as.matrix(data[,names(data)!="FID"]) # convert to matrix since the boosting part currently does not support data.frames

covars <- c("L", "K", "pop", "edu")

# Load shape file for spatial plots

cur_tempfile <- tempfile()

download.file(url = "https://raw.githubusercontent.com/fabsig/GPBoost/master/data/shape_European_regions.zip", destfile = cur_tempfile)

out_directory <- tempfile()

unzip(cur_tempfile, exdir = out_directory)

shape <- st_read(dsn = out_directory)

# Create spatial plot of GDP

data_plot <- merge(shape, data.frame(FID = FID, y = data[,"y"]), by="FID")

p1 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(name="GDP / capita (log)", option = "B") +

ggtitle("GDP / capita (log)")

# Sample plot excluding islands

p2 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(name="GDP / capita (log)", option = "B") +

xlim(2700000,6400000) + ylim(1500000,5200000) +

ggtitle("GDP / capita (log) -- excluding islands")

grid.arrange(p1, p2, ncol=2)

# # Spatial plot without a shape file

# p1 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],

# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +

# geom_point(size=2, alpha=0.5) + scale_color_viridis(option = "B") +

# ggtitle("GDP / capita (log)")

# p2 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],

# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +

# geom_point(size=3, alpha=0.5) + scale_color_viridis(option = "B") +

# ggtitle("GDP / capita (log) -- Europe excluding islands") + xlim(-10,28) + ylim(35,67)

# grid.arrange(p1, p2, ncol=2)

In the following, we use a Gaussian process model with an exponential covariance function to model spatial random effects. Additionally, we include grouped random effects for the cluster variable cl. In the `GPBoost`

library, Gaussian process random effects are defined by the `gp_coords`

argument and grouped random effects via the `group_data`

argument of the `GPModel`

constructor. The above-mentioned predictor variables are used in the fixed effects tree-ensemble function. We fit the model using the `gpboost`

, or equivalently the `gpb.train`

, function. Note that we use tuning parameters that are selected below using cross-validation.

`gp_model <- GPModel(group_data = data[, c("cl")], `

gp_coords = data[, c("Long", "Lat")],

likelihood = "gaussian", cov_function = "exponential")

params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,

min_data_in_leaf = 10, lambda_l2 = 0)

nrounds <- 37

# gp_model$set_optim_params(params = list(trace=TRUE)) # To monitor hzperparameter estimation

boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])

gpboost_model <- gpboost(data = boost_data, gp_model = gp_model,

nrounds = nrounds, params = params,

verbose = 1) # same as gpb.train# same as gpb.train gpb.train

It is important that tuning parameters are appropriately chosen for boosting. There are no universal default values and every data set will likely need different tuning parameters. Note that this can take some time. Instead of a fixed grid search as below, one can also do a random grid search to speed things up (see `num_try_random`

). Below we show how tuning parameters can be chosen using the `gpb.grid.search.tune.parameters`

function. We use the mean square error (`mse`

) as prediction accuracy measure on the validation data. Alternatively, one can also use, e.g., the test negative log-likelihood (`test_neg_log_likelihood`

= default value if nothing is specified) which also takes prediction uncertainty into account.

`gp_model <- GPModel(group_data = data[, "cl"], `

gp_coords = data[, c("Long", "Lat")],

likelihood = "gaussian", cov_function = "exponential")

boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])

param_grid = list("learning_rate" = c(1,0.1,0.01),

"min_data_in_leaf" = c(10,100,1000),

"max_depth" = c(1,2,3,5,10),

"lambda_l2" = c(0,1,10))

other_params <- list(num_leaves = 2^10)

set.seed(1)

opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,

num_try_random = NULL, nfold = 4,

data = boost_data, gp_model = gp_model,

nrounds = 1000, early_stopping_rounds = 10,

verbose_eval = 1, metric = "mse") # metric = "test_neg_log_likelihood"

opt_params

# ***** New best test score (l2 = 0.0255393919591794) found for the following parameter combination: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 2, lambda_l2: 0, nrounds: 37

## Estimated random effects model

Information on the estimated random effects model can be obtained by calling the `summary`

function on the `gp_model`

. For Gaussian processes, `GP_var`

is the marginal variance, i.e., the amount of spatial correlation or structure spatial variation, and `GP_range`

is the range, or scale, parameter that measures how fast correlation decays over space. For an exponential covariance function, three times this value (approx. 17 here) is the distance where the (residual) spatial correlation is essentially zero (below 5%). As the results below show, the amount of spatial correlation is relatively small since the marginal variance of 0.0089 is small compared to the total variance of the response variable which is approx. 0.29. Additionally the error term and the cl grouped random effects also have small variances (0.013 and 0.022). We thus conclude that most of the variance in the response variable is explained by the fixed effects predictor variables.

`summary(gp_model)`

`## =====================================================`

## Covariance parameters (random effects):

## Param.

## Error_term 0.0131

## Group_1 0.0221

## GP_var 0.0089

## GP_range 5.7502

## =====================================================

## Spatial effect map

We can plot the estimated (“predicted”) spatial random effects at the training data locations by calling the `predict`

function on the training data; see the code below. Such a plot shows the spatial effect when factoring out the effect of the fixed effects predictor variables. Note that these spatial effects take into account both the spatial Gaussian process and the grouped region cluster random effects. If one wants to obtain only spatial random effects from the Gaussian process part, one can use the `predict_training_data_random_effects`

function (see the commented code below). Alternatively, one can also do spatial extrapolation (=“Krigging”), but this makes not a lot of sense for areal data.

`pred <- predict(gpboost_model, group_data_pred = data[1:242, c("cl")], `

gp_coords_pred = data[1:242, c("Long", "Lat")],

data = data[1:242, covars], predict_var = TRUE, pred_latent = TRUE)

data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean), by="FID")

plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(name="Spatial effect", option = "B") +

ggtitle("Spatial effect (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)

data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_cov), by="FID")

plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(name="Std. dev.", option = "B") +

ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)

grid.arrange(plot_mu, plot_sd, ncol=2)# Only spatial effecst from the Gaussian process

# rand_effs <- predict_training_data_random_effects(gp_model, predict_var = TRUE)[1:242, c("GP", "GP_var")]

# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,1]), by="FID")

# plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

# scale_fill_viridis(name="Spatial effect (mean)", option = "B") +

# ggtitle("Spatial effect from Gausian process (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)

# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,2]), by="FID")

# plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

# scale_fill_viridis(name="Uncertainty (std. dev.)", option = "B") +

# ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)

# grid.arrange(plot_mu, plot_sd, ncol=2)

## Understanding the fixed effects function

There are several tools for understanding the form of the fixed effects function. Below we consider variable importance measures, interaction measures, and dependence plots. Specifically, we look at

- SHAP values
- SHAP dependence plots
- Split-based variable importance
- Friedman’s H-statistic
- Partial dependence plots (one and two-dimensional)

As the results below show, the information obtained from SHAP values and SHAP dependence plots is the same as when looking at traditional variable importance measures and partial dependence plots. The most important variables are ‘K’ and ‘edu’. From the dependence plots, we conclude that there are non-linearities. For instance, the effect of K is almost flat for large and small values of K and increasing in-between. Further, the effect of edu is steeper for small values of edu and flattens for larger values of edu. Friedman’s H-statistic indicates that there are some interactions. For the two variables with the largest amount of interaction, L and pop, we create a two-dimensional partial dependence plot below.

`# SHAP values`

library(SHAPforxgboost)

shap.plot.summary.wrap1(gpboost_model, X = data[,covars]) + ggtitle("SHAP values")

`# SHAP dependence plots`

shap_long <- shap.prep(gpboost_model, X_train = data[,covars])

shap.plot.dependence(data_long = shap_long, x = covars[2],

color_feature = covars[4], smooth = FALSE, size = 2) +

ggtitle("SHAP dependence plot for K")

shap.plot.dependence(data_long = shap_long, x = covars[4],

color_feature = covars[2], smooth = FALSE, size = 2) +

ggtitle("SHAP dependence plot for edu")

`# Split-based feature importances`

feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)

gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",

main = "Split-based variable importances")

`# H-statistic for interactions. Note: there are no interactions if max_depth = 1`

library(flashlight)

fl <- flashlight(model = gpboost_model, data = data.frame(y = data[,"y"], data[,covars]),

y = "y", label = "gpb",

predict_fun = function(m, X) predict(m, data.matrix(X[,covars]),

gp_coords_pred = matrix(-100, ncol = 2, nrow = dim(X)[1]),

group_data_pred = matrix(-1, ncol = 1, nrow = dim(X)[1]),

pred_latent = TRUE)$fixed_effect)

plot(imp <- light_interaction(fl, v = covars, pairwise = TRUE)) +

ggtitle("H interaction statistic") # takes a few seconds

`# Partial dependence plots`

gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 2, xlab = covars[2],

ylab = "gdp", main = "Partial dependence plot" )

gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 4, xlab = covars[4],

ylab = "gdp", main = "Partial dependence plot" )

`# Two-dimensional partial dependence plot (to visualize interactions)`

i = 1; j = 3;# i vs j

gpb.plot.part.dep.interact(gpboost_model, data[,covars], variables = c(i,j), xlab = covars[i],

ylab = covars[j], main = "Pairwise partial dependence plot")

## Separate random effects for different time periods

In the above model, we have used the same random effects for both years 2000 and 2021. Alternatively, one can also use independent spatial and grouped random effects for different time periods (*independent under the prior, conditional on the data, there is dependence …*). In the `GPBoost`

library, this can be done via the `cluster_ids`

argument. `cluster_ids`

needs to be a vector of length the sample size, and every entry indicates the “cluster” to which an observation belongs to. Different clusters then have independent spatial and grouped random effects, but the hyperparameters (e.g., marginal variance, variance components, etc.) and the fixed effects function are the same across clusters.

Below we show how we can fit such a model and create two separate spatial maps. As the results below show, the spatial effects are quite different for the two years 2000 and 2021. In particular, there is less (residual) spatial variation (or heterogeneity) for the year 2021 compared to 2000. This is confirmed by looking at standard deviations of the predicted spatial random effects which is almost twice as large for 2000 compared to the year 2021 (see below). A further conclusion is that in 2000, there were more regions with “low” GDP numbers (spatial effects below 0), and this is no longer the case for 2021.

`gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],`

likelihood = "gaussian", cov_function = "exponential",

cluster_ids = c(rep(1,242), rep(2,242)))

boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])

params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,

min_data_in_leaf = 10, lambda_l2 = 1)

# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again

gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,

params = params, verbose = 0)

# Separate spatial maps for the years 2000 and 2021

pred <- predict(gpboost_model, group_data_pred = data[, c("cl")],

gp_coords_pred = data[, c("Long", "Lat")],

data = data[, covars], predict_var = TRUE, pred_latent = TRUE,

cluster_ids_pred = c(rep(1,242), rep(2,242)))

data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[1:242]), by="FID")

plot_mu_2000 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(name="Spatial effect", option = "B") +

ggtitle("Spatial effect for 2000 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)

data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[243:484]), by="FID")

plot_mu_2021 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(name="Spatial effect", option = "B") +

ggtitle("Spatial effect for 2021 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)

grid.arrange(plot_mu_2000, plot_mu_2021, ncol=2)

sd(pred$random_effect_mean[1:242]) # 0.2321267

sd(pred$random_effect_mean[243:484]) # 0.1286398

## Interaction between space and fixed effects predictor variables

In the above model, there is no interaction between the random effects and the fixed effects predictor variables. I.e., there is no interaction between the spatial coordinates and the fixed effects predictor variables. Such interaction can be modeled by additionally including the random effects input variables (= the spatial coordinates or the categorical grouping variable) in the fixed effects function. The code below shows how this can be done. As the variable importance plot below shows, the coordinates are not used in the tree-ensemble, and there is thus no such interaction detectable for this data set.

`gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],`

likelihood = "gaussian", cov_function = "exponential")

covars_interact <- c(c("Long", "Lat"), covars) ## add spatial coordinates to fixed effects predictor variables

boost_data <- gpb.Dataset(data = data[, covars_interact], label = data[, "y"])

params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,

min_data_in_leaf = 10, lambda_l2 = 1)

# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again

gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,

params = params, verbose = 0)

feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)

gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",

main = "Variable importances when including coordinates in the fixed effects")

## Large data

For large data sets, calculations with Gaussian processes are slow, and one has to use an approximation. The `GPBoost`

library implements several ones. For instance, setting `gp_approx="vecchia"`

in the `GPModel`

constructor will use a Vecchia approximation. The data set used in this article is relatively small, and we can do all calculations exactly.

## Other spatial random effects models

Above, we have used a Gaussian process to model spatial random effects. Since the data is areal data, another option is to use models that rely on neighborhood information such as CAR and SAR models. These models are currently not yet implemented in the `GPBoost`

library (*might be added in the future -> contact the author*).

Another option is to use a grouped random effects model defined by the categorical region ID variable for modeling spatial effects. The code below shows how this can be done in `GPBoost`

. However, this model essentially ignores the spatial structure.

`gp_model <- GPModel(group_data = data[, c("group", "cl")], likelihood = "gaussian")`

## (Generalized) linear mixed effects and Gaussian process models

(Generalized) linear mixed effects and Gaussian process models can also be run in the `GPBoost`

library. The code below shows how this can be done with the `fitGPModel`

function. Note that one needs to manually add a column of 1’s to include an intercept.

`X_incl_1 = cbind(Intercept = rep(1,dim(data)[1]), data[, covars])`

gp_model <- fitGPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],

likelihood = "gaussian", cov_function = "exponential",

y = data[, "y"], X = X_incl_1, params = list(std_dev=TRUE))

summary(gp_model)

`## =====================================================`

## Model summary:

## Log-lik AIC BIC

## 195.97 -373.94 -336.30

## Nb. observations: 484

## Nb. groups: 2 (Group_1)

## -----------------------------------------------------

## Covariance parameters (random effects):

## Param. Std. dev.

## Error_term 0.0215 0.0017

## Group_1 0.0003 0.0008

## GP_var 0.0126 0.0047

## GP_range 7.2823 3.6585

## -----------------------------------------------------

## Linear regression coefficients (fixed effects):

## Param. Std. dev. z value P(>|z|)

## Intercept 16.2816 0.4559 35.7128 0.0000

## L 0.4243 0.0565 7.5042 0.0000

## K 0.6493 0.0212 30.6755 0.0000

## pop 0.0134 0.0097 1.3812 0.1672

## edu 0.0100 0.0009 10.9645 0.0000

## =====================================================