## An in-depth explanation of “Thompson Sampling”, a more efficient alternative to A/B testing for online learning

Imagine you have two ads to choose from: the red one and the blue one. Of course, you would like to show your users the ad with the highest click rate.

But how do you find out which ad has the highest click rate?

The most common approach to answering this question is doing an A/B test. This implies setting apart some users and showing the first ad to half of them and the second ad to the other half. Finally, you can compute the click rate of each alternative and select the best one.

Suppose that, at the end of the A/B test, you have the following results:

The blue version is clearly superior to the red one: a click rate of 18% against a click rate of 11%. But **this means that we lost many opportunities: we could have shown the blue ad to many more users, and thus we could have obtained many more clicks**.

On the other hand, what if we stopped the experiment very early, say after only 20 users?

We know intuitively that, after 20 users, the results are not reliable enough to send the best-performing variant to all the users.

In general, **the problem with A/B tests** is that:

**if we set apart too many users, we lose opportunities on the less-performing variants**;**if we set apart too few users, the test is inconclusive**.

In other words, A/B tests are inefficient because they are too static. Ideally, we would need a smart system that is able to learn dynamically as it gets more data.

This system should:

- explore the different alternatives when the results are too small to be reliable;
- exploit the results when they start becoming reliable enough, by sending more and more traffic to the best-performing alternative.

Good news: such a system exists and is called Thompson Sampling.

The approach we have seen above tried to evaluate each alternative with a single number: its click rate. The problem with this approach is that a single number doesn’t express the uncertainty associated with the estimate itself.

To solve this problem, Thompson Sampling proposes to use a full probability distribution rather than a single number.

The objective of the probability distribution is to express the uncertainty about the estimate of the metric.

Once we have our distributions — one for each alternative — Thompson Sampling works by drawing a random number from each distribution. Then, the alternative associated with the highest number is shown to the user.

What’s the point of doing this? Well, the idea is that if the distributions express a high uncertainty, the outcome depends much on the chance.

In other terms, the less confidence in our belief, the more the system will explore different alternatives. On the contrary, as confidence increases, the system increasingly exploits the best-performing alternative.

Let’s see two probability distributions that could be obtained from the results we have seen above.

If you try to extract random numbers from these two distributions, you will find out that the number drawn from the red distribution is greater than the number drawn from the blue distribution 24% of the time. This proves numerically our intuition that the difference is still not statistically significant.

But what after 10,000 impressions?

Now we are very confident that the blue page performs better than the red page. And, in fact, it’s practically impossible that the number drawn from the red distribution will be greater than the one drawn from the blue distribution.

In our example, since we have a binary outcome (click or miss), the go-to distribution is the Beta distribution. The cool thing about Beta is that it’s entirely based on two parameters, `a`

and `b`

that can be interpreted in a very straightforward way:

`a`

: number of successes (in our case number of clicks).`b`

: number of failures (in our case number of misses).

The expected value of the distribution is `a / (a + b)`

, which is our quantity of interest: the click-through-rate.

The Beta distribution is also available in Scipy and so is very easy to calculate:

`import numpy as np`

from scipy import stats# input: number of clicks and number of misses

clicks = 1

misses = 4

# get 1000 equally spaced points between 0 and 1 for plotting purposes

x = np.linspace(start = 0, stop = 1, num = 1_000)

# calculate probability distribution function of beta

beta_pdf = stats.beta(a = clicks, b = misses).pdf(x = x)

Let’s plot some examples. Take a click rate of 20%: what happens to the Beta distribution when the number of impressions increases?

As we expected, as the number of users increases, the results are more and more certain: this translates to a distribution that is increasingly concentrated around the expected value: 20%.

In other words, **working with probability distributions allows us to assign a quantitative measure of certainty to our qualitative evaluation**.

If you took Statistics 101, you may wonder: “Wait a minute. According to the Central Limit Theorem, if we have independent trials we should use the Normal distribution. So why are we using the Beta distribution?”

Indeed, this is a good point. Let’s see how to compute both the Beta and the Normal probability distribution function in Python.

`import numpy as np`

from scipy import stats# input: number of clicks and number of misses

clicks = 1

misses = 4

# compute n and click rate

n = clicks + misses

click_rate = clicks / n

# get 1000 equally spaced points between 0 and 1 for plotting purposes

x = np.linspace(start = 0, stop = 1, num = 1_000)

# calculate the probability distribution function of beta

beta_pdf = stats.beta(a = clicks, b = misses).pdf(x = x)

# calculate the probability distribution function of normal

normal_pdf = stats.norm(

loc = click_rate,

scale = np.sqrt(click_rate * (1 - click_rate) / n)

).pdf(x = x)

Let’s repeat this process for different numbers of users and compare the two distributions:

As you can see, the Beta distribution and the Normal distribution become more and more similar as the number of impressions grows. And they become practically the same thing after just 50 iterations.

So, using the Beta or the Normal distribution wouldn’t make much difference. This is great news because it means that — thanks to CLT — we can always use the Normal distribution, regardless of the metric we choose.

Let’s make an example to see Thompson Sampling at work.

We want to test 4 versions of an advertisement: grey, red, green, and blue. Suppose that we also know the real click rate for each version.

As in the previous paragraph, we will use the Beta distribution. But we need a small adjustment. Since the parameters for Beta (`a`

and `b`

) must be strictly greater than 0, in case at least one between `a`

and `b`

is zero, then we will add 1 to each of them.

`import numpy as np`def draw_from_beta(clicks, misses):

"""Draw a random number from Beta."""

if min(clicks, misses) == 0:

clicks += 1

misses += 1

return np.random.beta(a=clicks, b=misses)

For each new user, we must do the following:

- Based on the current count of clicks and misses for each variant, get the corresponding Beta distribution.
- Draw a number from each variant’s distribution obtained at point 1.
- Show the user the variant associated with the highest number.
- Update the counter with the outcome obtained on the current user (click or miss).

Let’s see a graphical representation of this process for the first 1,000 users:

As you can see, after 100 iterations, our belief is still not aligned with the truth: the expected value of the green variant is greater than the blue. But this is just due to chance. As the experience sums up, our estimates will converge to the truth. This means that:

- the mean of the distribution will be closer to the true rate;
- the standard deviation of the distribution will be closer and closer to zero.

Let’s see how these two quantities evolve during the first 400 iterations.

As we have seen, after 1,000 impressions this is the result:

Thompson Sampling is so effective that, after only 1,000 iterations, it has already concentrated 50.6% of the displays on the best alternative (blue) and 37.7% on the second-best (green).

On the contrary, what would happen if we use the A/B testing approach, sending every alternative to the same amount of users? Showing each ad to 250 users would produce the following result:

Using Thompson Sampling we have 145 clicks, with A/B testing we have instead 135 clicks. This means 7.4% more clicks thanks to Thompson Sampling! And the difference would become even bigger if we carry out more iterations.

Thompson Sampling is perfect for online learning because it addresses efficiently the exploration/exploitation dilemma.

It does it by assigning a probability distribution to each variant that should be tested. The distribution serves the purpose of expressing the uncertainty associated with the estimate.

The fact that Thompson Sampling dynamically adapts to the knowledge accumulated from the previous iterations makes it more efficient than A/B testing.

For instance, we have seen an example with 4 variants and — in just 1,000 iterations — Thompson Sampling was able to get 7% more clicks than A/B testing.