Building Powerful Recommender Systems with Deep Learning | by Lina Faik | Jul, 2023

Illustration by the author

A Step-by-Step Implementation Using the PyTorch Library TorchRec

Recommending the right product to customers at the right time is a prevalent challenge across industries. For instance, bankers are constantly looking to suggest highly relevant services to their existing or potential customers. Retailers strive to recommend appealing products that meet customer tastes. Similarly, social networks aim to build captivating feeds to foster user adoption.

Despite being a widely explored use case, achieving satisfactory performance results remains arduous due to the unique nature of the problem. The main reasons include the presence of abundant categorical data, which often leads to scarcity issues, and the computational aspect of the use case, which poses scalability problems. It’s only recently that recommendation models have harnessed neural networks.

In this context, Meta has developed and made openly available a deep learning recommendation model (DRLM). The model is particularly remarkable for combining the principles of collaborative filtering and predictive analysis and being suitable for large-scale production.


The objective of this article is to guide you through a step-by-step implementation using the PyTorch library TorchRec, enabling you to effectively address your own recommendation use case.

After reading this article, you will understand:

  1. How does the DLRM model work?
  2. What sets DLRM models apart and makes them powerful and scalable?
  3. How can you implement your own recommendation system from end to end?

The article requires general knowledge of the recommender system problem and familiarity with the pytorch library. The experimentations described in the article were carried out using the libraries TorchRec and PyTorch. You can find the code here on GitHub.

Let’s first dive into the complexities of the DLRM model and explore its underlying principles and mechanisms.

1.1. An Overview of the Model Design

To provide a more tangible illustration, let’s consider the scenario of an online retailer looking to create a personalized feed for every customer visiting their website.

In order to achieve this, the retailer can train a model that predicts the probability of a customer purchasing a particular product. This model assigns a score to each product for each individual customer, based on various factors. The feed is built by ranking the scores.

In this case, the model can learn from historical data that encompasses a range of information for each customer and product. This includes numerical variables like customer age and product price, as well as categorical characteristics such as product type, color, and more.

Here’s where the DLRM model excels: it possesses the remarkable ability to leverage both numerical and categorical variables, even when dealing with a large number of unique categories. This enables the model to comprehensively analyze and understand complex relationships between the features. To understand why, let’s take a look at the architecture model in Figure 1.

Figure 1 — DLRM model architecture, illustration by the author, inspired from [5]

Categorical features

DLRM learns an embedding table for each categorial feature and uses them to map these variables to dense representations. Hence, each categorical feature is represented as a vector of the same length.

Numerical features

DLRM processes numerical features through an MLP, called bottom MLP. The output of this MLP has the same dimension as the previous embedding vectors.

Pairwise interaction

DLRM calculates the dot product between all pairs of embedding vectors and the processed numerical features This allows the model to include second-order feature interaction.

Concatenation and final output

DLRM concatenates these dot products with the processed numerical features and uses the results to feed another MLP, called the top MLP. The final probability is obtained by passing the output of this MLP to a sigmoid function.

1.2. The Model Implementation

While the potential of the model appears promising in theory, its practical implementation presents a computational hurdle.

Usually, recommendation use cases involve handling vast volumes of data. Using DLRM models, in particular, introduces an very large number of parameters, higher than common deep learning models. Consequently, this amplifies the computational demands associated with their implementation.

  1. The majority of parameters in DLRMs can be attributed to embeddings as they consist of multiple tables, each demanding a large memory. This makes DLRMs computationally demanding, both in terms of memory capacity and bandwidth.
  2. Although the memory footprint of MLP parameters is smaller, they still require substantial computational resources.

To mitigate the memory bottleneck, DLRM relies a unique combination of model parallelism for the embeddings and data parallelism for the MLPs.

This section provides a detailed step-by-step guide on how to implement your own recommendation system from start to finish.

2.1. Data Transformation and Batch Construction

The first step involves converting the data into tensors and organizing them into batches for input into the model.

To illustrate this process, let’s consider this dataframe as an example.

For sparse features, we need to concatenate the values into a single vector and compute the lengths. This can be accomplished using the KeyedJaggedTensor.from_lengths_sync function, which takes both elements as input. Here’s an example of the Python script:

values = sample[cols_sparse].sum(axis=0).sum(axis=0)
values = torch.tensor(values).to(device)
# values = tensor([1, 0, 2, 0, 2, 2, 0, 2, 0, 1, 0, 1, 2, 0], device='cuda:0')

lengths = torch.tensor(
pd.concat([sample[feat].apply(lambda x: len(x)) for feat in cols_sparse],
# lengths = tensor([1, 1, 1, 1, 1, 2, 3, 2, 2, 0], device='cuda:0', dtype=torch.int32)

sparse_features = KeyedJaggedTensor.from_lengths_sync(

For dense features and labels, the process is more straightforward. Here’s an example of the Python script:

dense_features = torch.tensor(sample[cols_dense].values, dtype=torch.float32).to(device)
labels = torch.tensor(sample[col_label].values, dtype=torch.int32).to(device)

By using the outputs from the previous steps, it becomes possible to construct a batch. Here’s an example of the Python script:

batch = Batch(

For a more comprehensive implementation, you can refer to the file in the corresponding GitHub repository.

2.2. Model Initialization and Optimization Setup

The next step involves initializing the model, as demonstrated in the following Python code:

# Initialize the model and set up optimization

# Define the dimensionality of the embeddings used in the model
embedding_dim = 10

# Calculate the number of embeddings per feature
num_embeddings_per_feature = {c: len(v) for c, v in map_sparse.items()}

# Define the layer sizes for the dense architecture
dense_arch_layer_sizes = [512, 256, embedding_dim]

# Define the layer sizes for the overall architecture
over_arch_layer_sizes = [512, 512, 256, 1]

# Specify whether to use Adagrad optimizer or SGD optimizer
adagrad = False

# Set the epsilon value for Adagrad optimizer
eps = 1e-8

# Set the learning rate for optimization
learning_rate = 0.01

# Create a list of EmbeddingBagConfig objects for each sparse feature
eb_configs = [
num_embeddings=num_embeddings_per_feature[feature_name + '_enc'],
feature_names=[feature_name + '_enc'],
for feature_idx, feature_name in enumerate(cols_sparse)

# Initialize the DLRM model with the embedding bag collection and architecture specifications
dlrm_model = DLRM(
tables=eb_configs, device=device

# Create a DLRMTrain instance for handling training operations
train_model = DLRMTrain(dlrm_model).to(device)

# Choose the appropriate optimizer class for the embedding parameters
embedding_optimizer = torch.optim.Adagrad if adagrad else torch.optim.SGD

# Set the optimizer keyword arguments
optimizer_kwargs = {"lr": learning_rate}
if adagrad:
optimizer_kwargs["eps"] = eps

# Apply the optimizer to the sparse architecture parameters

# Initialize the dense optimizer with the appropriate parameters
dense_optimizer = KeyedOptimizerWrapper(
optimizer_with_params(adagrad, learning_rate, eps),

# Create a CombinedOptimizer instance to handle optimization
optimizer = CombinedOptimizer([dense_optimizer])

The model can then be trained and evaluated using the following code:

loss, (loss2, logits, labels) = train_model(batch)

For a more comprehensive implementation, you can refer to the file in the corresponding GitHub repository.

✔ The DLRM model presents a compelling approach to effectively combine numerical and categorical features using embeddings, enabling the model to capture intricate patterns and relationships.

✔ Although its architecture requires considerable computational resources, its implementation incorporates a unique combination of model parallelism and data parallelism, making the model scalable to production.

✔ However, due to limited data availability, the model’s performance has not been extensively tested across diverse real-world datasets. This raises uncertainty about its effectiveness in practical scenarios.

✔ Additionally, the model necessitates tuning a considerable number of parameters, further complicating the process.

✔ Considering this, simpler models like LGBM may offer comparable performance with easier implementation, tuning, and long-term maintenance, without the same computational overhead.

[1] M Naumov & al, Deep Learning Recommendation Model for Personalization and Recommendation Systems, May 2019

[2] Github repository of the Facebook team’s initial implementation of the DLRM model available as open source

[3] DLRM: An advanced, open source deep learning recommendation model, Meta AI Blog, July 2019

[4] Pytorch library for modern production recommendation systems, torchec

[5] Vinh Nguyen, Tomasz Grel and Mengdi Huang, Optimizing the Deep Learning Recommendation Model on NVIDIA GPUs, June 2020

Source link

Leave a Comment