## Applying Reinforcement Learning strategies to real-world use cases, especially in dynamic pricing, can reveal many surprises

In the vast world of decision-making problems, one dilemma is particularly owned by Reinforcement Learning strategies: exploration versus exploitation. Imagine walking into a casino with rows of slot machines (also known as “one-armed bandits”) where each machine pays out a different, unknown reward. Do you explore and play each machine to discover which one has the highest payout, or do you stick to one machine, hoping it’s the jackpot? This metaphorical scenario underpins the concept of the Multi-armed Bandit (MAB) problem. The objective is to find a strategy that maximizes the rewards over a series of plays. While exploration offers new insights, exploitation leverages the information you already possess.

Now, transpose this principle to dynamic pricing in a retail scenario. Suppose you are an e-commerce store owner with a new product. You aren’t certain about its optimal selling price. How do you set a price that maximizes your revenue? Should you explore different prices to understand customer willingness to pay, or should you exploit a price that has been performing well historically? Dynamic pricing is essentially a MAB problem in disguise. At each time step, every candidate price point can be seen as an “arm” of a slot machine and the revenue generated from that price is its “reward.” Another way to see this is that the objective of dynamic pricing is to swiftly and accurately measure how a customer base’s demand reacts to varying price points. In simpler terms, the aim is to pinpoint the demand curve that best mirrors customer behavior.

In this article, we’ll explore four Multi-armed Bandit algorithms to evaluate their efficacy against a well-defined (though not straightforward) demand curve. We’ll then dissect the primary strengths and limitations of each algorithm and delve into the key metrics that are instrumental in gauging their performance.

Traditionally, demand curves in economics describe the relationship between the price of a product and the quantity of the product that consumers are willing to buy. They generally slope downwards, representing the common observation that as price rises, demand typically falls, and vice-versa. Think of popular products such as smartphones or concert tickets. If prices are lowered, more people tend to buy, but if prices skyrocket, even the ardent fans might think twice.

Yet in our context, we’ll model the demand curve slightly differently: we’re putting price against probability. Why? Because in dynamic pricing scenarios, especially digital goods or services, it’s often more meaningful to think in terms of the likelihood of a sale at a given price than to speculate on exact quantities. In such environments, each pricing attempt can be seen as an exploration of the likelihood of success (or purchase), which can be easily modeled as a Bernoulli random variable with a probability *p *depending on a given test price.

Here’s where it gets particularly interesting: while intuitively one might think the task of our Multi-armed Bandit algorithms is to unearth that ideal price where the probability of purchase is highest, it’s not quite so straightforward. In fact, our ultimate goal is to maximize the revenue (or the margin). This means we’re not searching for the price that gets the most people to click ‘buy’ — we’re searching for the price that, when multiplied by its associated purchase probability, gives the highest expected return. Imagine setting a high price that fewer people buy, but each sale generates significant revenue. On the flip side, a very low price might attract more buyers, but the total revenue might still be lower than the high price scenario. So, in our context, talking about the ‘demand curve’ is somewhat unconventional, as our target curve will primarily represent the probability of purchase rather than the demand directly.

Now, getting to the math, let’s start by saying that consumer behavior, especially when dealing with price sensitivity, isn’t always linear. A linear model might suggest that for every incremental increase in price, there’s a constant decrement in demand. In reality, this relationship is often more complex and nonlinear. One way to model this behavior is by using logistic functions, which can capture this nuanced relationship more effectively. Our chosen model for the demand curve is then:

Here, *a* denotes the maximum achievable probability of purchase, while *b* modulates the sensitivity of the demand curve against price changes. A higher value of *b* means a steeper curve, approaching more rapidly to lower purchase probabilities as the price increases.

For any given price point, we’ll be then able to obtain an associated purchase probability, *p*. We can then input *p* into a Bernoulli random variable generator to simulate the response of a customer to a particular price proposal. In other words, given a price, we can easily emulate our reward function.

Next, we can multiply this function by the price in order to get the expected revenue for a given price point:

Unsurprisingly, this function does not reach its maximum in correspondence with the highest probability. Also, the price associated with the maximum does not depend on the value of the parameter *a*, while the maximum expected return does.

With some recollection from calculus, we can also derive the formula for the derivative (you’ll need to use a combination of both the product and the chain rule). It’s not exactly a relaxing exercise, but it’s nothing too challenging. Here is the analytical expression of the derivative of the expected revenue:

This derivative allows us to find the exact price that maximizes our expected revenue curve. In other words, by using this specific formula in tandem with some numerical algorithms, we can easily determine the price that sets it to 0. This, in turn, is the price that maximizes the expected revenue.

And this is exactly what we need, since by fixing the values of *a* and *b*, we’ll immediately know the target price that our bandits will have to find. Coding this in Python is a matter of a few lines of code: