Once again we’re off to the casino, and this time it’s situated in sunny Monte Carlo, made famous by its appearance in the classic movie Madagascar 3: Europe’s Most Wanted (although there’s a slight chance that it was already famous).
In our last visit to a casino we looked at the multi-armed bandit and used this as a way to visualise the problem of how to choose the best action when confronted with many possible actions.
In terms of Reinforcement Learning the bandit problem can be thought of as representing a single state and the actions available within that state. Monte Carlo methods extend this idea to cover multiple, interrelated, states.
Additionally, in the previous problems we’ve looked at, we’ve always been given a full model of the environment. This model defines both the transition probabilities, that describe the chances of moving from one state to the next, and the reward received for making this transition.
In Monte Carlo methods this isn’t the case. No model is given and instead the agent must discover the properties of the environment through exploration, gathering information as it moves from one state to the next. In other words, Monte Carlo methods learn from experience.
Additionally, an interactive version of this article can be found in notebook form, where you can actually run all of the code snippets described below.
All of the previous articles in this series can be found here: A Baby Robot’s Guide To Reinforcement Learning.
And, for a quick recap of the theory and terminology used in this article, check out State Values and Policy Evaluation in 5 minutes.
In the prediction problem we want to find how good it is to be in a particular state of the environment. This “goodness” is represented by the state…