The Monte Carlo (MC) and the Temporal-Difference (TD) methods are both fundamental technics in the field of reinforcement learning; they solve the prediction problem based on the experiences from interacting with the environment rather than the environment’s model. However, the TD method is a combination of MC methods and Dynamic Programming (DP), making it differs from the MC method in the aspects of the update rule, the bootstrapping, and the bias/variance. TD methods are also proven to have better performance and faster convergence compared to the MC in most cases.
In this post, we’ll compare TD and MC, or more specifically, the TD(0) and constant-α MC methods, on a simple grid environment and a more comprehensive Random Walk  environment. Hoping this post can help readers interested in Reinforcement Learning better understand how each method updates the state-value function and how their performance differs in the same testing environment.
We will implement algorithms and comparisons in Python, and libraries used in this post are as follows:
The introduction of TD(0) and constant-α MC
The constant-α MC method is a regular MC method with a constant step size parameter α, and this constant parameter helps to make the value estimate more sensible to the recent experience. In practice, the choice of the α value depends on a trade-off between stability and adaptability. The following is the MC method’s equation for updating the state-value function at time t:
The TD(0) is a special case of TD(λ) that only looks one step ahead and is the simplest form of TD learning. This method updates the state-value function with TD error, the difference between the estimated value of the state and the reward plus the estimated value of the next state. A constant step size parameter α works the same as in the MC method above. The following is the TD(0)’s equation for updating the state-value function at time t:
Generally speaking, the difference between MC and TD methods happens on three aspects:
- Update rule: MC methods update values only after the episode ends; this could be problematic if the episode is very long, which slows down the program, or in the continuing task that does not have episodes at all. On the contrary, TD method updates value estimates at each time step; this is online learning and can be particularly useful in continuing tasks.
- Bootstrapping: The term “bootstrapping” in reinforcement learning refers to updating value estimates based on other value estimates. TD(0) method bases its update on the value of the following state, so it is a bootstrapping method; on the contrary, MC does not use bootstrapping as it updates value directly from the returns (G).
- Bias/Variance: MC methods are unbiased because they estimate the value by weighing the actual returns observed without making estimates during the episode; however, MC methods have high variance, especially when the number of samples is low. On the contrary, TD methods have biases because they use bootstrapping, and the bias can vary based on the actual implementation; TD methods have low variance because it uses the immediate reward plus the estimate of the next state, which smooths out the fluctuation that raises from the randomness in rewards and actions.
Evaluating TD(0) and constant-α MC on simple gridworld setup
To make their difference more straightforward, we can set up a simple Gridworld test environment with two fixed trajectories, run both algorithms on the setup until converged, and check how they update the values differently.
First we can setup the testing environment with the following codes:
The left figure above shows a simple gridworld environment setup. All the colored cells represent terminal states; the agent gets a +1 reward when stepping into the red cell but gets a -1 reward when stepping into blue cells. All other steps on the grid return a reward of zero. The right figure above marks two preset paths: one arrives at the blue cell while another stops at the red cell; the intersection of the paths helps maximize the value difference between the two methods.
Then we can use the equations in the previous section to evaluate the environment. We do not discount the return nor the estimate, and set the α to a small value 1e-3. When the absolute sum of the value increments is lower than a threshold of 1e-3, we consider the values converged.
The result of the evaluation is as follows:
The two algorithms’ different ways of estimating values become pretty apparent in the image above. The MC method is loyal to the returns of the path, so the values on each path directly represent how it ends. Nevertheless, the TD method provides a better prediction, especially on the blue path — the values on the blue path before the intersection also indicate the possibility of getting to the red cell.
With this minimal case in mind, we are ready to move to a much more complicated example and try to find out the difference in performance between the two methods.
The Random Walk task is a simple Markov reward process proposed by Sutton et al. for TD and MC prediction purposes, as the images below show. In this task, the agent starts from the center node C. The agent takes a right or left step with equal probability on each node. There are two terminal states on both ends of the chain. The reward of getting into the left end is 0, and getting into the right end is +1. All the steps before the termination generate a reward of 0.
And we can use the following code to create the Random Walk environment:
=====Test: checking environment setup=====
Links: None ← Node A → Node B
Reward: 0 ← Node A → 0
Links: Node A ← Node B → Node C
Reward: 0 ← Node B → 0
Links: Node B ← Node C → Node D
Reward: 0 ← Node C → 0
Links: Node C ← Node D → Node E
Reward: 0 ← Node D → 0
Links: Node D ← Node E → None
Reward: 0 ← Node E → 1
The true value for each node of the environment under a random policy is [1/6, 2/6, 3/6, 4/6, 5/6]. The value was calculated by the policy evaluation with Bellam equation:
Our task here is to find how close the values estimated by both algorithms are to the true value; we can arbitrarily assume that the algorithm produces a closer value function to the true value function, measured by the averaged root mean square error (RMS), indicating a better performance.
With the environment ready, we can start running both methods on the Random Walk environment and compare they performance. First let’s take a look at both algorithms:
As mentioned earlier, the MC method should wait until the episode ends to update the values from the tail of the trajectory, while the TD method updates the values incrementally. This difference brings a trick when initializing the state-value function: In MC, the state-value function does not include the terminal states, while in TD(0), the function should include the terminal state with the value of 0 because the TD(0) method always looks one step ahead before the episode ends.
The α parameter selection in this implementation references the that proposed in the book ; the parameters for the MC method are [0.01, 0.02, 0.03, 0.04] while that for the TD method are [0.05, 0.10, 0.15]. I wondered why the author didn’t choose the same parameter set on both algorithms until I ran the MD method with parameters for TD: the TD parameters are too high for the MC method and thus cannot unveil the MC’s best performance. Therefore, we will stick with the book’s setup in the parameter sweep. Now, let’s run both algorithms to find out their performance on the Random Walk setup.
The result after 100 runs of comparison is as the image above shows. The TD method generally yields better value estimations than the MC methods, and the TD with α = 0.05 can get really close to the true value. The graph also shows the MC method has a higher variance compared to the TD ones, as the orchid lines fluctuate more than the steel blue lines.
It is worth noticing that for both algorithms, when the α is (relatively) high, the RMS loss first goes down and then up again. Such a phenomenon is due to the combined effect of the value initialization and the α value. We initialized a relatively high value of 0.5, which is higher than the true value of Nodes A and B. As the random policy makes a 50% chance of choosing a “wrong” step, which takes the agent away from the right terminal state, a higher α value will also emphasize the wrong steps and make the result away from the true value.
Now let’s try to reduce the initial value to 0.1 and run the comparison again, see if the problem alleviated:
The lower initial value apparently helps relieve the problem; there are no noticeable “go down, then go up” effects. However, the side effect of a lower initial value is that the learning is less efficient, as the RMS loss never goes below 0.05 after 150 episodes. Therefore, there is a trade-on between the initial value, the parameter, and the algorithm’s performance.
One last piece I want to bring up in this post is the comparison of batch training on both algorithms.
Consider we’re facing the following situation: we only accumulated a limited number of experiences on the Random Walk task, or we can only run a certain number of episodes due to the limitation of time and computation. The idea of batch updating  was proposed to deal with such a situation by fully utilizing the existing trajectories.
The idea of batch training is to update the values on a batch of trajectories repeatedly until the values converge to an answer. The values will only be updated once all batch experiences are fully processed. Let’s implement the batch training for both algorithms on the Random Walk environment to see if the TD method still performs better than the MC method.
The result of batch training shows the TD method is still doing better than the MC method with limited experience, and the gap between the performance of the two algorithms is quite apparent.
In this post, we discussed the difference between the constant-α MC method and TD(0) methods and compared their performance on the Random Walk task. The TD method overrides the MC methods in all the tests in this post, so considering TD as a method for reinforcement learning tasks is a preferable choice. However, it doesn’t mean that TD is always better than MC, as the latter has a most obvious advantage: no bias. If we’re facing a task that has no tolerance for biases, then MC could be a better choice; otherwise, TD could better deal with general cases.
 Midjourney Terms of Service: https://docs.midjourney.com/docs/terms-of-service
 Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.
My GitHub repo of this post: [Link].