High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set, or in other words, there are few data points per level of a categorical variable. Machine learning methods can have difficulties with high-cardinality variables. In this article, we argue that random effects are an effective tool for modeling high-cardinality categorical variables in machine learning models. In particular, we empirically compare several versions of two of the most successful machine learning methods, tree-boosting and deep neural networks, as well as linear mixed effects models using multiple tabular data sets with high-cardinality categorical variables. Our results show that, first, machine learning models with random effects perform better than their counterparts without random effects, and, second, tree-boosting with random effects outperforms deep neural networks with random effects.

**Table of contents**

· 1 Introduction

· 2 Random effects for modeling high-cardinality categorical variables

· 3 Comparison of different methods using real-world data sets

· 4 Conclusion

· References

A simple strategy for dealing with categorical variables is to use one-hot encoding or dummy variables. But this approach often does not work well for high-cardinality categorical variables due to the reasons described below. For neural networks, a frequently adopted solution is entity embeddings [Guo and Berkhahn, 2016] which map every level of a categorical variable into a low-dimensional Euclidean space. For tree-boosting, a simple approach is to assign a number to every level of a categorical variable, and then consider this as a one-dimensional numeric variable. An alternative solution implemented in the `LightGBM`

boosting library [Ke et al., 2017] works by partitioning all levels into two subsets using an approximate approach [Fisher, 1958] when finding splits in the tree-building algorithm. Further, the `CatBoost`

boosting library [Prokhorenkova et al., 2018] implements an approach based on ordered target statistics calculated using random partitions of the training data for handling categorical predictor variables.

Random effects can be used as an effective tool for modeling high-cardinality categorical variables. In the regression case with a single high-cardinality categorical variable, a random effects model can be written as

where j=1,…,ni is the sample index within level i with ni being the number of samples for which the categorical variable attains level i, and i denotes the level with q being the total number of levels of the categorical variable. The total number of samples is thus n = n0 + n1 + … + nq. Such a model is also called mixed effects model since it contains both fixed effects F(xij) and random effects bi. xij are the fixed effects predictor variables or features. Mixed effects models can be extended to other response variable distributions (e.g., classification) and multiple categorical variables.

Traditionally, random effects were used in linear models in which it is assumed that F is a linear function. In the last years, linear mixed effects models have been extended to non-linear ones using random forest [Hajjem et al., 2014], tree-boosting [Sigrist, 2022, 2023a], and most recently (in terms of first public preprint) deep neural networks [Simchoni and Rosset, 2021, 2023]. In contrast to classical independent machine learning models, the random effects introduce dependence among samples.

**Why are random effects useful for high-cardinality categorical variables?**

For high-cardinality categorical variables, there is little data for every level. Intuitively, if the response variable has a different (conditional) mean for many levels, traditional machine learning models (with, e.g., one-hot encoding, embeddings, or simply one-dimensional numeric variables) may have problems with over- or underfitting for such data. From the point of view of a classical bias-variance trade-off, independent machine learning models may have difficulties balancing this trade-off and finding an appropriate amount of regularization. For instance, overfitting may occur which means that a model has a low bias but high variance.

Broadly speaking, random effects act as a prior, or regularizer, which models the difficult part of a function, i.e., the part whose “dimension” is similar to the total sample size, and, in doing so, provide an effective way for finding a balance between over- and underfitting or bias and variance. For instance, for a single categorical variable, random effects models will shrink estimates of group intercept effects towards the global mean. This process is sometimes also called “information pooling”. It represents a trade-off between completely ignoring the categorical variable (= underfitting / high bias and low variance) and giving every level in the categorical variable “complete freedom” in estimation (= overfitting / low bias and high variance). Importantly, the amount of regularization, which is determined by the variance parameters of the model, is learned from the data. Specifically, in the above single-level random effects model, a (point) prediction for the response variable for a sample with predictor variables xp and categorical variable having level i is given by

where F(xp) is the trained function evaluated at xp, σ²_1 and σ² are variance estimates, and yi and Fi are sample means of yij and F(xij), respectively, for level i. Ignoring the categorical variable would give the prediction yp = F(xp), and a fully flexible model without regularization gives yp = F(xp) + ( yi — Fi). I.e., the difference between these two extreme cases and the random effects model is the shrinkage factor σ²_1 / (σ²/ni + σ²_1 and σ²) (which goes to zero if the number of samples ni for level i is large). Related to this, random effects models allow for more efficient (i.e., lower variance) estimation of the fixed effects function F(.) [Sigrist, 2022].

In line with this argumentation, Sigrist [2023a, Section 4.1] find in empirical experiments that tree-boosting combined with random effects (“GPBoost”) outperforms classical independent tree-boosting (“LogitBoost”) more, the lower the number of samples per level of a categorical variable, i.e., the higher the cardinality of a categorical variable. The results are reproduced above in Figure 1. These results are obtained by simulating binary classification data with 5000 samples, a non-linear predictor function, and a categorical variable with successively more levels, i.e., fewer samples per level; see Sigrist [2023a] for more details. The results show that the difference in the test error of GPBoost and LogitBoost is the larger, the fewer samples there are per level of the categorical variable (= the higher the number of levels).

In the following, we compare several methods using multiple real-world data sets with high-cardinality categorical variables. We use all the publicly available tabular data sets from Simchoni and Rosset [2021, 2023] and also the same experimental setting as in Simchoni and Rosset [2021, 2023]. In addition, we include the Wages data set analyzed in Sigrist [2022].

We consider the following methods:

- ’Linear’: linear mixed effects models
- ’NN Embed’: deep neural networks with embeddings
- ’LMMNN’: combining deep neural networks and random effects [Simchoni and Rosset, 2021, 2023]
- ’LGBM_Num’: tree-boosting by assigning a number to every level of categorical variables and considering these as one-dimensional numeric variables
- ’LGBM_Cat’: tree-boosting with the approach of
`LightGBM`

[Ke et al., 2017] for categorical variables - ’CatBoost’: tree-boosting with the approach of
`CatBoost`

[Prokhorenkova et al., 2018] for categorical variables - ’GPBoost’: combining tree-boosting and random effects [Sigrist, 2022, 2023a]

Note that, recently (version 1.6 and later), the `XGBoost`

library [Chen and Guestrin, 2016] has also implemented the same approach as `LightGBM`

for handling categorical variables. We do not consider this as a separate approach here.

We use the following data sets:

For all methods with random effects, we include random effects for every categorical variable mentioned in Table 1 with no prior correlation among random effects. The Rossmann, AUImport, and Wages data sets are longitudinal data sets. For these, we also include linear and quadratic random slopes; see (the future) Part III of this series. See Simchoni and Rosset [2021, 2023] and Sigrist [2023b] for more details on the data sets.

We perform 5-fold cross-validation (CV) on every data set with the test mean squared error (MSE) to measure prediction accuracy. See Sigrist [2023b] for detailed information on the experimental setting. Code for pre-processing the data with instructions on how to download the data and code for running the experiments can be found here. Pre-processed data for modeling can also be found on the above webpage for data sets for which the license of the original source permits it.

The results are summarized in Figure 2 which shows average relative differences to the lowest test MSE. This is obtained by first calculating the relative difference of a test MSE of a method compared to the lowest MSE for every data set, and then taking the average over all data sets. Detailed results can be found in Sigrist [2023b]. We observe that combined tree-boosting and random effects (GPBoost) has the highest prediction accuracy with an average relative difference to the best results of approx. 7%. The second best results are obtained by the categorical variables approach of `LightGBM`

(LGMB_Cat) and neural networks with random effects (LMMNN) both having an average relative difference to the best method of approx. 17%. CatBoost and linear mixed effects models perform substantially worse having an average relative difference to the best method of almost 50%. Given that CatBoost is specifically designed to handle categorical features, this is somewhat sobering. Overall worst perform neural networks with embeddings having an average relative difference to the best result of more than 150%. Tree-boosting with the categorical variables transformed to one-dimensional numeric variables (LGBM_Num) performs slightly better with an average relative difference to the best result of approximately 100%. In their online documentation, `LightGBM `

recommends *“For a categorical feature with high cardinality, it often works best to treat the feature as numeric” (as of July 6, 2023)*. We clearly come to a different conclusion.

We have empirically compared several methods on tabular data with high-cardinality categorical variables. Our results show that, first, machine learning models with random effects perform better than their counterparts without random effects, and, second, tree-boosting with random effects outperforms deep neural networks with random effects. While there may be several possible reasons for the latter finding, this is in line with the recent work of Grinsztajn et al. [2022] who find that tree-boosting outperforms deep neural networks (and also random forest) on tabular data without high-cardinality categorical variables. Similarly, Shwartz-Ziv and Armon [2022] conclude that tree-boosting “outperforms deep models on tabular data.”

In Part II of this series, we will show how to apply the `GPBoost`

library with a demo using on one of the above-mentioned real-world data sets. In Part III, we will show how longitudinal, aka panel, data can be modeled with the `GPBoost`

library.

- T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
- W. D. Fisher. On grouping for maximum homogeneity. Journal of the American statistical Association, 53(284):789–798, 1958.
- L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 507–520. Curran Associates, Inc., 2022.
- C. Guo and F. Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.
- A. Hajjem, F. Bellavance, and D. Larocque. Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6):1313–1328, 2014.
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3149–3157, 2017.
- L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pages 6638–6648, 2018.
- R. Shwartz-Ziv and A. Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
- F. Sigrist. Gaussian Process Boosting. The Journal of Machine Learning Research, 23(1):10565–10610, 2022.
- F. Sigrist. Latent Gaussian Model Boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1894–1905, 2023a.
- F. Sigrist. A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables. arXiv preprint arXiv::2307.02071 2023b.
- G. Simchoni and S. Rosset. Using random effects to account for high-cardinality categorical features and repeated measures in deep neural networks. Advances in Neural Information Processing Systems, 34:25111–25122, 2021.
- G. Simchoni and S. Rosset. Integrating Random Effects in Deep Neural Networks. Journal of Machine Learning Research, 24(156):1–57, 2023.