In this article, we will explore how policy gradient reinforcement learning can be used for A/B optimization. It is a simple demo to observe the policy gradient method, where we will deeply understand the underlying mechanism and visualize the learning process step by step.
Reinforcement learning is a fundamental concept in machine learning, along with supervised, self-supervised and unsupervised learning. In reinforcement learning, an agent tries to find the optimal set of actions within an environment to maximize a reward. Reinforcement learning has become widely known as a method that can beat the best players at Go and chess when combined with neural networks as highly flexible agents.
The neural network used as the agent learns to optimize the policy step by step by maximizing the reward obtained. Several strategies have been developed to update the parameters of the neural network, such as policy gradient, q-learning or actor critic learning. The policy gradient method is the closest to backpropagation, which is commonly used for supervised and self-supervised learning of neural networks. However, in reinforcement learning, we do not directly evaluate each action as we would in supervised learning, but rather we try to maximize the total reward and let the neural network decide the individual actions to take. The actions are chosen from a probability distribution, which provides a high degree of flexibility for exploration. At the beginning of the optimization, the actions are chosen randomly and the agent explores different policies. Over time, some actions prove to be more useful than others and the probability distribution collapses to unambiguous decisions. Unlike other reinforcement learning methods, the user does not have to control this balance between exploration and exploitation, the optimal balance is found by the gradient policy method itself.
Often, the optimal policy to maximize the reward is achieved by a sequence of actions, where each action leads to a new state of the environment. However, the gradient policy method can also be used to find the optimal action that statistically gives the highest reward. Such a situation is often found when performing A/B optimization, which is a very common technique for choosing the better of two options. For example, in marketing, A/B tests are used to select the advertising scheme that leads to higher sales. Which ad would you rather click on? Option A: “Get the most out of your data: I am a professional Data Scientist — I can help you analyze your data” or option B “Struggling with your data? Professional Data Analyst has free capacity to help you automate your data analysis”?
The difficulty with A/B optimization is that the click rate is variable. For example, after seeing the ad on a website, each user may have different preferences, be in different moods, and therefore react differently. Because of this variability, we need statistical techniques to choose the better ad scheme. Common methods for comparing options A and B are hypothesis tests, such as t-tests. To perform a t-test, the two potential versions of the ad must be displayed for a period of time to collect click rates from users. In order to get a significant evaluation of the preferred ad scheme, a fairly long period of exploration is required, with the disadvantage of a potential loss of revenue since during exploration the better and worse ads are randomly displayed equally often. It is advantageous to maximize the click rate by showing the better ad more often as soon as possible. By performing the A/B optimization with the gradient policy method, the agent will first randomly explore both variants A and B, but will receive higher rewards for the ad that leads to higher click rates, so the agent will quickly learn to show the better ad to users more often and maximize the click rate and revenue.
In our example, we have two ad creative options, where we assume that option A has a click probability of 30% and option B has a click probability of 40%. We run an ad campaign with 1000 ad impressions. If we perform exploration only and show both options equally often, we can expect an average click rate of 35% and an average of 350 clicks in total. If we knew that B would be clicked on more often, we would show only B and get an average of 400 clicks. However, if we were unlucky and chose to display only A, we would get an average of only 300 clicks. With the policy gradient method, which we will explore in more detail in a moment, we can achieve an average of 391 clicks, clearly showing that quickly applying the learned policy leads to a number of clicks that is almost as high as if we had chosen the better option B in the first place.
We run the A/B optimization with a gradient policy method on a small neural network using the TensorFlow library. First, we need some imports.
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
The neural network contains only one layer with a single neuron that decides which ad to show. Since we have no information about the user’s preference, location, time, or anything else, the decision is based on a zero input to the neural network and we do not need the non-linearity that we would achieve with large neural networks. Training is achieved by adjusting the bias of this neuron.
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(1, activation="sigmoid", input_shape=(1,)))
A function is used to select the action, either displaying option A or option B, using the neural network. The function is decorated with tf.function(), which creates a static computation graph, allowing it to run much faster than in eager mode. Using TensorFlow’s GradientTape function, we collect gradients during the ad selection process. Each time a user enters the website, the neural network produces an output that is treated as the probability of selecting ad variant A or variant B to be presented to the user. Since we only have one neuron with sigmoid activation, the output is a single number between 0 and 1. If the output is 0.5, there is a 50% chance that ad B will be displayed and a 50% chance that ad A will be displayed. If the output is 0.8, there is an 80% chance that ad B will be displayed and the remaining 20% chance that ad A will be displayed. The action is selected by comparing the output of the neural network to a uniformly distributed random number between 0 and 1. If the random number is smaller than the output, the action is True (1) and ad B is selected; if the random number is greater than the output, the action is False (0) and ad A is selected. The loss value measures the difference between the output of the neural network and the selected action using binary_crossentropy_loss. Then we create the gradients of the loss with respect to the model parameters.
with tf.GradientTape() as tape:
output = model(np.array([[0.0]])) # [0 ... 1]
action = (tf.random.uniform((1, 1)) < output) # [0 or 1]
loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(action, output))
grads = tape.gradient(loss, model.trainable_variables)
return output, action, loss, grads
We run the training over 1000 ad impressions. In each steps, the ad is presented once and a new user has the chance to click the ad. To evaluate the learning process, we count the total number of clicks after this period. The learning rate is defined as 0.5. We will discuss the influence of the learning rate on the total number of clicks later.
STEPS = 1000
LR = 0.5
Now, let us run the ad campaign. The neural network will improve its prediction over time. Through reinforcement learning, training and application occur simultaneously. In practice, the selected ad is now displayed on the website and we have to wait and see if the user clicks on the ad or leaves the website without clicking it. In the code, we simply simulate whether a user clicks or not. As described above, ad A has a 30% chance of being clicked, while ad B has a 40% chance of being clicked. The click can be treated directly as a reward to train the neural network. The reward is used to modify the gradients. If the user clicked the ad, the gradient of that action is left unchanged, but if the user did not click the ad, the gradient is reversed. Finally, gradient descent updates the parameters of the neural network by assigning new weight and bias values to the neural network.
for step in range(STEPS):
output, action, loss, grads = action_selection(model)
if action == False: # Action A
reward = float(np.random.random() < 0.4)
if action == True: # Action B
reward = float(np.random.random() < 0.5)
grads_adjusted = 
for var_index in range(len(model.trainable_variables)):
grads_adjusted.append((reward-0.5)*2 * grads[var_index])
The following figure summarizes the evolution of the learning process.
In total, the campaign of 1000 ad impressions shown in the previous figure has resulted in a total of 393 clicks, which is quite close to 400, the number of clicks we would expect if we selected only the better ad B. We first review the learning process by observing all graphs at initial step = 1. We observe that the neural network output starts at 0.5, resulting in a random selection of the ad with 50% probability for both ad B and ad A. The binary_crossentropy_loss measures the difference between the model output and the action taken. Since the action is either 0 or 1, the initial loss value is the negative logarithm of the model output 0.5, which is approximately 0.7. Since we only have a single neuron in our neural network, the gradients contain two scalar values for the weight and bias of that neuron. If ad A is selected, the gradient of the bias is a positive number, and if ad B is selected, the gradient of the bias is a negative number. The gradient of the weight parameter is always zero because the input to the neural network is zero. The reward is highly random because there is only a 30% – 40% chance that the ad will be clicked. If the ad is clicked, we get a reward and the gradient is unchanged, otherwise we reverse the gradient. The adjusted gradients are multiplied by the learning rate and subtracted from the initial parameters of the neural network. We can see that the bias value starts at zero and becomes more negative when positive adjusted gradients are applied and more positive when negative adjusted gradients are applied.
During the ad campaign, the output of the neural network tends towards one, increasing the chance that ad B will be selected. However, even if the model output is already close to one, there is still a small chance to display ad A. As the model output approaches one, there is a small loss value if action B is selected and we obtain small negative gradients, but in the rarer cases that ad A is chosen, a larger loss value is obtained, visible as occasional peaks and as large positive gradients. After the rewards are collected, it is observed that some of these positive peaks are reversed in the adjusted gradients, as these actions did not lead to clicks. As ad B has a higher click probability, the small negative adjusted gradients are applied more often then the positive gradients stemming from clicks on ad A. Thus, the bias value of the model increases by small steps and on the rare occasions that ad A is selected and clicked, the bias value decreases. The output of the model is provided by the sigmoid function applied to the model bias value.
In this demo, we observed, that a neural network can learn to select the better of two options and apply that option more frequently to maximize rewards. On average 391 clicks will be obtained in this setting, where ad A has a click probability of 30% and ad B has a click probability of 40%. In practice, these probabilities would be much lower, and the difference between them might be smaller, making it harder for the neural network to explore the better option. The policy gradient method has the advantage of automatically adjusting the balance between exploration and exploitation. However, this balance is affected by the learning rate. Higher learning rates would result in a shorter exploration phase and faster application of the learned policy, as shown in the following graph, where the learning rate is increased from 0.01 to 10. The model output, averaged over 100 individual ad campagins, increases more rapidly with increasing learning rate up to a learning rate of 1. However, at higher learning rates, there is a risk of adapting to the wrong action, which may appear better only during the short exploration period. At high learning rates the model output is adjusting too quickly, leading to unstable decisions.
Thus, there is an optimal learning rate to choose, which may be difficult to find in practice since nothing is known about the click probabilities beforehand. Varying the learning rate from 0.01 to 10.0 shows that a broad maximum for the sum of clicks is obtained for learning rates between 0.1 and 2.0. Higher learning rates clearly increase the standard deviation, demonstrating the instability of the learning process, also leading to a reduction of the average sum of clicks.
This demo shows how reinforcement learning can be used for A/B optimization. It is a simple example to illustrate the underlying process of the policy gradient method. We have learned how the neural network updates its parameters based on adjusted gradients depending on whether the selected ad was clicked. Applying the learned policy quickly maximizes the click rate. However, choosing the optimal learning rate can be difficult in practice.
You can find the complete code and a streamlit demo on huggingface.co: https://huggingface.co/spaces/Bernd-Ebenhoch/AB_optimization