# Solving Reinforcement Learning Racetrack Exercise with Off-policy Monte Carlo Control

In the section Off-policy Monte Carlo Control of the book Reinforcement Learning: An Introduction 2nd Edition (page 112), the author left us with an interesting exercise: using the weighted importance sampling off-policy Monte Carlo method to find the fastest way driving on both tracks. This exercise is comprehensive that asks us to consider and build almost every component of a reinforcement learning task, like the environment, agent, reward, actions, conditions of termination, and the algorithm. Solving this exercise is fun and helps us build a solid understanding of the interaction between algorithm and environment, the importance of a correct episodic task definition, and how the value initialization affects the training outcome. Through this post, I hope to share my understanding and solution to this exercise with everyone interested in reinforcement learning.

As mentioned above, this exercise asks us to find a policy that makes a race car drive from the starting line to the finishing line as fast as possible without running into gravel or off the track. After carefully reading the exercise descriptions, I listed some key points that are essential to complete this task:

• Map representation: maps in this context are actually 2D matrices with (row_index, column_index) as coordinates. The value of each cell represents the state of that cell; for instance, we can use 0 to describe gravel, 1 for the track surface, 0.4 for the starting region, and 0.8 for the finishing line. Any row and column index outside the matrix can be considered as out-of-boundary.
• Car representation: we can directly use the matrix’s coordinates to represent the car’s position;
• Velocity and control: the velocity space is discrete and consists of horizontal and vertical speeds that can be represented as a tuple (row_speed, col_speed). The speed limit on both axes is (-5, 5) and incremented by +1, 0, and -1 on each axis in each step; therefore, there are a total of 9 possible actions in each step. Literally, the speed cannot be both zero except at the starting line, and the vertical velocity, or row speed, cannot be negative as we don’t want our car to drive back to the starting line.
• Reward and episode: the reward for each step before crossing the finishing line is -1. When the car runs out of the track, it’ll be reset to one of the starting cells. The episode ends ONLY when the car successfully crosses the finishing line.
• Starting states: we randomly choose starting cell for the car from the starting line; the car’s initial speed is (0, 0) according to the exercise’s description.
• Zero-acceleration challenge: the author proposes a small zero-acceleration challenge that, at each time step, with 0.1 probability, the action will not take effect and the car will remain at its previous speed. We can implement this challenge in training instead of adding the feature to the environment.

The solution to the exercise is separated into two posts; in this post, we’ll focus on building a racetrack environment. The file structure of this exercise is as follows:

|-- race_track_env
| |-- maps
| | |-- build_tracks.py // this file is used to generate track maps
| | |-- track_a.npy // track a data
| | |-- track_b.npy // track b data
| |-- race_track.py // race track environment
|-- exercise_5_12_racetrack.py // the solution to this exercise

And the libraries used in this implementation are as follows:

python==3.9.16
numpy==1.24.3
matplotlib==3.7.1
pygame==2.5.0

We can represent track maps as 2D matrices with different values indicating track states. I want to be loyal to the exercise, so I’m trying to build the same maps shown in the book by assigning matrix values manually. The maps will be saved as separate .npy files so that the environment can read the file in training instead of generating them in the runtime.