A tour of the most important technological breakthroughs behind modern industrial recommender systems
Recommender systems are among the fastest-evolving industrial Machine Learning applications today. From a business point of view, this is not a surprise: better recommendations bring more users. It’s as simple as that.
The underlying technology however is far from simple. Ever since the rise of deep learning — powered by the commoditization of GPUs — recommender systems have become more and more complex.
In this post, we’ll take a tour of a handful of the most important modeling breakthroughs from the past decade, roughly reconstructing the pivotal points marking the rise of deep learning in recommender systems. It’s a story of technological breakthroughs, scientific exploration, and an arms race spanning continents and cooperations.
Buckle up. Our tour starts in 2017’s Singapore.
Any discussion of deep learning in recommender systems would be incomplete without a mention of one of the most important breakthroughs in the field, Neural Collaborative Filtering (NCF), introduced in He et al (2017) from the University of Singapore.
Prior to NCF, the gold standard in recommender systems was matrix factorization, in which we learn latent vectors (aka embeddings) for both users and items, and then generate recommendations for a user by taking the dot product between the user vector and the item vectors. The closer the dot product is to 1, as we know from linear algebra, the better the predicted match. As such, matrix factorization can be simply viewed as a linear model of latent factors.
The key idea in NCF is to replace the inner product in matrix factorization with a neural network. In practice, this is done by first concatenating the user and item embeddings, and then passing them into a multi-layer perceptron (MLP) with a single task head that predicts user engagement such as click. Both the MLP weights and the embedding weights (which map ids to their respective embeddings) are then learned during model training via backpropagation of loss gradients.
The hypothesis behind NCF is that user/item interactions aren’t linear, as assumed in matrix factorization, but instead non-linear. If that’s true, we should see better performance as we add more layers to the MLP. And that’s precisely what He et al find. With 4 layers, they’re able to beat the best matrix factorization algorithms at the time by around 5% hit rate on the Movielens and Pinterest benchmark datasets.
He et al proved that there’s immense value of deep learning in recommender systems, marking the pivotal transition away from matrix factorization and towards deep recommenders.
Our tour continues from Singapore to Mountain View, California.
While NCF revolutionized the domain of recommender system, it lacks an important ingredient that turned out to be extremely important for the success of recommenders: cross features. The idea of cross features has been popularized in Google’s 2016 paper “Wide & Deep Learning for Recommender Systems”.
What is a cross feature? It’s a second-order feature that’s created by “crossing” two of the original features. For example, in the Google Play Store, first-order features include the impressed app, or the list of user-installed apps. These two can be combined to create powerful cross-features, such as
which is 1 if the user has Netflix installed and the impressed app is Hulu.
Cross features can also be more generalized such as
and so on. The authors argue that adding cross features of different granularities enables both memorization (from more granular crosses) and generalization (from less granular crosses).
The key architectural choice in Wide&Deep is to have both a wide module, which is a linear layer that takes all cross features directly as inputs, and a deep module, which is essentially an NCF, and then combine both modules into a single output task head that learns from user/app engagements.
And indeed, Wide&Deep works remarkably well: the authors find a lift in online app acquisitions of 1% by going from deep-only to wide and deep. Consider that Google makes tens of Billions in revenue each year from its Play Store, and it’s easy to see how impactful Wide&Deep was.
Wide&Deep has proven the significance of cross features, however it has a huge downside: the cross features need to be manually engineered, which is a tedious process that requires engineering resources, infrastructure, and domain expertise. Cross features à la Wide & Deep are expensive. They don’t scale.
Enter “Deep and Cross neural networks” (DCN), introduced in a 2017 paper, also from Google. The key idea in DCN is to replace the wide component in Wide&Deep with a “cross neural network”, a neural network dedicated to learning cross features of arbitrarily high order.
What makes a cross neural network different from a standard MLP? As a reminder, in an MLP, each neuron in the next layer is a linear combination of all layers in the previous layer:
By contrast, in the cross neural network the next layer is constructed by forming second-order combinations of the first layer with itself:
Hence, a cross neural network of depth L will learn cross features in the form of polynomials of degrees up to L. The deeper the neural network, the higher-order interactions are learned.
And indeed, the experiments confirm that DCN works. Compared to a model with just the deep component, DCN has a 0.1% lower logloss (which is considered to be statistically significant) on the Criteo display ads benchmark dataset. And that’s without any manual feature engineering, as in Wide&Deep!
(It would have been nice to see a comparison between DCN and Wide&Deep. Alas, the authors of DCN didn’t have a good method to manually create cross features for the Criteo dataset, and hence skipped this comparison.)
Next, our tour takes us from 2017’s Google to 2017’s Huawei.
Huawei’s solution for deep recommendation, “DeepFM”, also replaces manual feature engineering in the wide component of Wide&Deep with a dedicated neural network that learns cross features. However, unlike DCN, the wide component is not a cross neural network, but instead a so-called FM (“factorization machine”) layer.
What does the FM layer do? It’s simply taking the dot-products of all pairs of embeddings. For example, if a movie recommender takes 4 id-features as inputs, such as user id, movie id, actor ids, and director id, then the model learns embeddings for all of these id features, and the FM layer computes 6 dot products, corresponding to the combinations user-movie, user-actor, user-director, movie-actor, movie-director, and actor-director. It’s a comeback of the idea of matrix factorization. The output of the FM layer is then combined with the output of the deep component into a sigmoid-activated output, resulting in the model’s prediction.
And indeed, as you may have guessed, DeepFM has been shown to work. The authors show that DeepFM beats a host of the competitors (including Google’s Wide&Deep) by more than 0.37% and 0.42% in terms of AUC and Logloss, respectively, on company-internal data.
Let’s leave Google and Huawei for now. The next stop on our tour is 2019’s Meta.
Meta’s DLRM (“deep learning for recommender systems”) architecture, presented in Naumov et al (2019), works as follows: all categorical features are transformed into embeddings using embedding tables. All dense features are being passed into an MLP that computes embeddings for them as well. Importantly, all embeddings have the same dimension. Then, we simply compute the dot products of all pairs of embeddings, concatenate them into a single vector, and pass that vector through a final MLP with a single sigmoid-activated task head that produces the prediction.
DLRM, then, is almost something like a simplified version of DeepFM: if you take DeepFM and drop the deep component (keeping just the FM component), you have something like DLRM, but without DLRM’s dense MLP.
In experiments, Naumov et al show that DLRM beats DCN in terms of both training and validation accuracy on the Criteo display ads benchmark dataset. This result indicates that the deep component in DCN may indeed be redundant, and all that we really need in order to make the best possible recommendations are just the feature interactions, which in DLRM are captured with the dot products.
In contrast to DCN, the feature interactions in DLRM are limited to be second-order only: they’re just dot products of all pairs of embeddings. Going back to the movie example (with features user, movie, actors, director), the second-order interactions would be user-movie, user-actor, user-director, movie-actor, movie-director, and actor-director. A third-order interaction would be something like user-movie-director, actor-actor-user, director-actor-user, and so on. Certain users may be fans of Steven Spielberg movies starring Tom Hanks, and there should be a cross feature for that! Alas, in standard DLRM, there isn’t. That’s a major limitation.
Enter DHEN, the final landmark paper in our tour of modern recommender systems. DHEN stands for “Deep Hierarchical Ensemble Network”, and the key idea is to create a “hierarchy” of cross features that grows deeper with the number of DHEN layers.
It’s easiest to understand DHEN with a simple example first. Suppose we have two input features going into DHEN, and let’s denote them by A and B (which could stand for user ids and video ids, for example). A 2-layer DHEN module would then create the entire hierarchy of cross features up to second order, namely:
A, AxA, AxB, B, BxB,
where “x” is either one or a combination of the following 5 interactions:
- dot product,
- linear: y = Wx, or
- the cross module from DCN.
DHEN is a beast, and its computational complexity (due to its recursive nature) is nightmare. In order to get it to work, the authors of the DHEN paper had to invent a new distributed training paradigm called “Hybrid Sharded Data Parallel”, which achieves 1.2X higher throughput than the (then) state-of-the-art.
But most importantly, the beast works: in their experiments on internal click-through rate data, the authors measure a 0.27% improvement in NE compared to DLRM, using a stack of 8 (!) DHEN layers.
And this concludes our tour. Allow me to summarize each of these landmarks with a single headline:
- NCF: All we need are embeddings for users and items. The MLP will handle the rest.
- Wide&Deep: Cross features matter. In fact, they’re so important we feed them directly into the task head.
- DCN: Cross features matter, but shouldn’t be engineered by hand. Let the cross neural network handle that.
- DeepFM: Let’s generate cross features in the FM layer instead, and still keep the deep component from Wide&Deep.
- DRLM: FM is all we need — and also another, dedicated MLP for dense features.
- DHEN: FM is not enough. We need a hierarchy of higher-order (beyond second order), hierarchical feature interactions. And also a bunch of optimizations to make it work in practice.
And the journey is really just getting started. At the time of this writing, DCN has evolved into DCN-M, DeepFM has evolved into xDeepFM, and the leaderboard of the Criteo competition has been claimed by Huawei’s latest invention, FinalMLP.
Given the huge economic incentive for better recommendations, it’s guaranteed that we’ll continue to see new breakthroughs in this domain for the foreseeable future. Watch this space.