Mixed Effects Machine Learning for Longitudinal & Panel Data with GPBoost (Part III) | by Fabio Sigrist


A demo of GPBoost in Python & R using real-world data

Illustration of longitudinal data: time series plots for different subjects (idcode) — Image by author

In Part I and Part II of this series, we showed how random effects can be used for modeling high-cardinality categorical in machine learning models, and we gave an introduction to the GPBoost library which implements the GPBoost algorithm combining tree-boosting with random effects. In this article, we demonstrate how the Python and R packages of the GPBoost library can be used for longitudinal data (aka repeated measures or panel data). You might want to first read Part II of this series as it gives a first introduction to the GPBoost library. GPBoost version 1.2.1 is used in this demo.

Table of contents

1 Data: description, loading, and sample split
2 Modeling options for longitudinal data in GPBoost
· · 2.1 Subject grouped random effects
· · 2.2 Fixed effects only
· · 2.3 Subject and time grouped random effects
· · 2.4 Subject random effects with temporal random slopes
· · 2.5 Subject-specific AR(1) / Gaussian process models
· · 2.6 Subject grouped random effects and a joint AR(1) model
3 Training a GPBoost model
4 Choosing tuning parameters
5 Prediction
6 Conclusion and references

The data used in this demo is the wages data which was already used in Part II. It can be downloaded from here. The data set contains a total of 28’013 samples for 4’711 persons for which data was measured over several years. Such data is called longitudinal data, or panel data, since for every subject (person ID =idcode), data was collected repeatedly over time (years = t). In other words, the samples for every level of the categorical variable idcode are repeated measurements over time. The response variable is the logarithmic real wage (ln_wage), and the data includes several predictor variables such as age, total work…



Source link

Leave a Comment