The Monte Carlo (MC) and the Temporal-Difference (TD) methods are both fundamental technics in the field of reinforcement learning; they solve the prediction problem based on the experiences from interacting with the environment rather than the environment’s model. However, the TD method is a combination of MC methods and Dynamic Programming (DP), making it differs from the MC method in the aspects of the update rule, the bootstrapping, and the bias/variance. TD methods are also proven to have better performance and faster convergence compared to the MC in most cases.

In this post, we’ll compare TD and MC, or more specifically, the TD(0) and constant-α MC methods, on a simple grid environment and a more comprehensive Random Walk [2] environment. Hoping this post can help readers interested in Reinforcement Learning better understand how each method updates the state-value function and how their performance differs in the same testing environment.

We will implement algorithms and comparisons in Python, and libraries used in this post are as follows:

`python==3.9.16`

numpy==1.24.3

matplotlib==3.7.1

## The introduction of TD(0) and constant-α MC

The constant-α MC method is a regular MC method with a constant step size parameter α, and this constant parameter helps to make the value estimate more sensible to the recent experience. In practice, the choice of the α value depends on a trade-off between stability and adaptability. The following is the MC method’s equation for updating the state-value function at time t:

The TD(0) is a special case of TD(λ) that only looks one step ahead and is the simplest form of TD learning. This method updates the state-value function with TD error, the difference between the estimated value of the state and the reward plus the estimated value of the next state. A constant step size parameter α works the same as in the MC method above. The following is the TD(0)’s equation for updating the state-value function at time t:

Generally speaking, the difference between MC and TD methods happens on three aspects:

**Update rule:**MC methods update values only after the episode ends; this could be problematic if the episode is very long, which slows down the program, or in the continuing task that does not have episodes at all. On the contrary, TD method updates value estimates at each time step; this is online learning and can be particularly useful in continuing tasks.**Bootstrapping**: The term “bootstrapping” in reinforcement learning refers to updating value estimates based on other value estimates. TD(0) method bases its update on the value of the following state, so it is a bootstrapping method; on the contrary, MC does not use bootstrapping as it updates value directly from the returns (G).**Bias/Variance**: MC methods are unbiased because they estimate the value by weighing the actual returns observed without making estimates during the episode; however, MC methods have high variance, especially when the number of samples is low. On the contrary, TD methods have biases because they use bootstrapping, and the bias can vary based on the actual implementation; TD methods have low variance because it uses the immediate reward plus the estimate of the next state, which smooths out the fluctuation that raises from the randomness in rewards and actions.

## Evaluating TD(0) and constant-α MC on simple gridworld setup

To make their difference more straightforward, we can set up a simple Gridworld test environment with two fixed trajectories, run both algorithms on the setup until converged, and check how they update the values differently.

First we can setup the testing environment with the following codes: