Trick or ReTreat: Implementing Reinforcement Learning 👻
So far, I have introduced the basic concepts of Reinforcement Learning (RL) and outlined the environment and rules for my RL project, Trick or ReTreat.
In this post, I’ll explore two key libraries in RL: OpenAI’s Gym and Stable Baselines3, both of which streamline the process of building and training RL agents. Using the Lunar Lander environment, I’ll showcase two demos: one showing an agent taking random actions and another after training the agent.
OpenAI’s Gym
In this project, I will use OpenAI’s Gymnasium library (a branch of the original Gym library) to build and train my reinforcement learning agent for Trick or ReTreat. It is a common library to use in reinforcement learning.
To get familiar with Gymnasium, I began by experimenting with its pre-defined environments. This approach helped me gain a solid understanding before diving into creating a custom environment for Trick or ReTreat. To start, I’ll give a brief introduction to Gymnasium, summarising its key features and functions based on its documentation.
Environments
Thinking back to the previous post, an environment represents everything an agent interacts with. This includes the various states, the actions the agent can take and the placement of rewards. Let’s see how we can represent this in OpenAI’s Gymnasium.
Initialising Environments
Using pre-defined environments from Gymnasium is fairly straightforward - simply call make()
function to initialise an environment.
Interacting with the Environment
In Gymnasium, environments are represented as classes. Below is an overview of key functions for interacting with these environments
1. reset()
-
Resets environment to its initial state.
-
Must be called before running
step()
to start a new episode. -
Initial state of environments usually have randomness to ensure exploring, the randomness can be controlled by the
seed
parameter.
2. step()
-
Input: The action A that the agent takes.
-
Returns the following:
-
New State (called observation): Environment’s new state after taking action A.
-
Reward: The reward received for taking action A.
-
Terminated: A Boolean indicating whether the agent has reached a goal or completed the task
-
Truncated: A Boolean that is True if the episode ends due to step limits or other specified conditions.
-
Info: A dictionary of additional information, useful for debugging or analysis.
-
3. render()
-
To help visualise what the agent observes in the environment.
-
render_mode
is commonly set to:-
rgb_array
: Returns an image of the environment as an array of shape (x, y, 3). -
human
: Displays environment for real time.
-
4. close()
-
Used to clean up and close the environment after use.
-
This includes closing all rendering windows.
Spaces
Spaces define the format of actions and observations. They describe the types of actions the agent can perform and the types of observations the agent can receive from the environment.
Each environment must have the following two types of spaces defined:
1. action_space: Represents all the possible actions an agent can take.
2. observation_space: Represents all possible states that the environment can return to the agent.
Each of these spaces must be defined with a type.
Different Types of Spaces
The main types of spaces are:
1. Box:
-
Used for continuous values within a specified range.
-
Example for Actions: An agent can move within an angle range of 0 to 180 degrees.
-
Example for Observations: A temperature sensor that measures temperatures between 30 and 40 degrees
2. Discrete:
-
Represents a finite set of actions or states.
-
Example for Actions: An agent can move in only four directions: up, down, left and right.
-
Example for Observations: A game board with two possible states: empty or containing a reward.
3. MultiBinary:
-
Represents binary values, often used for switches or sensors that are on/off.
-
Example for Actions: Buttons on a controller, pressed or not pressed.
-
Example for Observations: A light sensor that returns 1 if activated and 0 if not.
These are the basic spaces but there are others for more complex environments.
Stable Baselines3
After watching many tutorials, I noticed that a lot of people use the Stable Baselines3 library.
Stable Baselines3 is a library for RL, specifically designed to work with OpenAI Gym environments. It simplifies the process of setting up, training and evaluating RL agents. In the example below, I’ll share my first attempt at using RL with Stable Baselines3.
Wrappers
Wrappers are templates used to modify the functionality of environments without changing the original environment setup. For instance, if you want to change rewards or reshape some observations, you can do this using a wrapper instead of having to re-create the environment.
There are various types of wrappers depending on the specific changes or functionalities you need from the environment.
While you can use wrappers directly from Gymnasium, the wrappers in Stable Baselines3 are simpler and more effective. They are designed to enhance the training process of RL agents.
Vectorised Environments
One of the most commonly used wrappers in Stable Baselines3 is the vectorised environment wrapper. These wrappers allow you to manage multiple instances of an environment at the same time. It allows the agent to interact on multiple states of the environment simultaneously leading to faster training and greater diversity in the agent’s experiences.
In the demo below, I used DummyVecEnv
to run four instances of the Lunar Lander environment.
Lunar Lander
The Lunar Lander environment is a classic control reinforcement learning task where an agent must successfully land on a landing pad while managing its speed, angle and engine thrust.
Below is an overview of agent’s action space, observation space, rewards and episode end conditions.
Action Space
The agent can take any of the following four discrete actions:
-
0: Do nothing
-
1: Fire the left engine
-
2: Fire the main engine
-
3: Fire the right engine
Observation Space
The observation space is defined as a box with the following bounds:
-
Lower Bound: [-2.5, -2.5, -10, -10, -6.2831855, -10, 0, 0]
-
Upper Bound: [2.5, 2.5, 10, 10, 6.2831855, 10, 1, 1]
-
Shape: 8-dimensional vector
The observation vector contains the following elements:
1. X Position: Horizontal position of the lander.
2. Y Position: Vertical position of the lander.
3. X Velocity: Velocity of the lander along the x-axis.
4. Y Velocity: Velocity of the lander along the y-axis.
5. Angle of Lander: The current angle of the lander.
6. Angular Velocity: The rotation speed of the lander.
7. Left Leg Contact: 1 if the left leg is in contact with the ground, 0 if not.
8. Right Leg Contact: 1 if the right leg is in contact with the ground, 0 if not.
Reward Structure
The goal of the agent is to land between the two flags. Rewards are given based on the following criteria:
-
The closer the lander is to the landing pad, the more points are awarded.
-
Points are also awarded for reducing the lander’s speed.
-
The reward decreases if the lander is more tilted.
-
Each leg in contact with the ground awards an additional 10 points.
-
Firing the side engines incurs a penalty of -0.03 points each time (indicated by red dots in the rendering).
-
Firing the main engine incurs a larger penalty of -0.3 points each time.
-
An additional reward of +100 points is given for a safe landing, while crashing results in a penalty of -100 points.
-
A reward above 200 points indicates good landing and performance of the agent.
Episode End Conditions
An episode can end in two ways:
-
Truncation: The episode is truncated when the agent scores 200 points.
-
Termination: The episode terminates if the lander crashes, goes out of bounds or becomes asleep.
Random Action Selection
To start, I explored the effects of the agent randomly selecting actions from its action space to see how the agent performs. This helped me familiarise myself with using basic concepts of both libraries. I limited the number of steps to 1000 to avoid lengthy runtimes.
Code Snippets
Setting up Lunar Lander Environment using gym.make()
, setting render_mode
as human to get visual output.
Testing the agent with random actions, I am using max_steps and episode parameters to control run time. Since using random action selection, the agent may take a long time to complete the task.
Results
I collected the total rewards per episode (capped at 1000), below are the results:
Episode | Score |
---|---|
1 | -238.99 |
2 | -151.35 |
3 | -293.43 |
4 | -498.66 |
5 | -94.48 |
It’s clear that the performance was poor, as all scores are negative. The agent fails to achieve the task through random actions, highlighting the need for training to better understand its environment.
Visualisation
Actions taken by the agent show no purpose, the agent struggles to control itself as it travels down towards the ground.
Proximal Policy Optimisation Algorithm
After exploring the Lunar Lander environment through random action selection, I wanted to improve the agent’s performance through training.
Stable Baselines3 documentation is very thorough and provides a list of all available RL algorithms for training. In the end, I chose Proximal Policy Optimisation (PPO) because it is relatively simple to understand and quite stable.
What is PPO?
PPO is a policy-based algorithm meaning it learns a policy by optimising probability of taking high rewarding actions. PPO makes gradual, controlled updates to the policy, so the agent doesn’t make drastic changes all at once.
After the first episode, the agent builds a policy based on its observations of the environment. In the following episodes, the agent continues to collect observations while recording the rewards associated with each action taken. PPO calculates the reward for each action and compares it to the expected outcome (previous episode results). Using these comparisons, PPO updates the action probabilities to increase the likelihood of taking actions that receive higher rewards. If the difference between the old and new action probabilities is too drastic, PPO uses ‘clipping’ to ensure stable learning.
Code Snippets
Before I train the agent, I first vectorise the environment to create four parallel instances of the environment. This allows faster training as the agent can explore multiple states at the same time. I decided not to render the training environment to avoid the added computational cost as visual inspection of training isn’t really needed.
I set the log path to better understand the training metrics in TensorBoard later.
I then initialised my model with PPO using the MLP policy (default policy). MLP policy is a simple neural network structure of fully connected layers, it takes in environment observations and outputs actions. Essentially, this policy performs updates using a neural network to help the agent ‘learn’. There are other policies that can be used, but for this demo MLP policy seemed the best choice since the Lunar Lander environment isn’t too complex.
After instantiating the model, I train the agent setting the total_timestep
to a million. Lunar Lander agent has several actions to try out, so I thought a high timestep gives the agent enough opportunity to explore the environment well.
To evaluate the model’s performance, I use evaluate_policy
. This is a Stable Baselines3 function which returns a tuple, average reward and standard deviation of rewards across a given number of episodes.
The results show the agent achieves a strong average reward of 249 across 10 episodes, indicating the success at learning and completing the task. Standard deviation of 34 suggests some variability in the episode scores, ideally the standard deviation should be a bit lower to show more consistency in the agent’s performance.
Testing Results
After training the model, I tested its performance over 10 episodes. Here are the results:
Episode | Score |
---|---|
1 | 17.01 |
2 | 262.80 |
3 | -28.20 |
4 | 266.47 |
5 | 293.32 |
6 | 242.91 |
7 | 278.50 |
8 | 250.77 |
9 | 241.46 |
10 | 280.66 |
Overall, agent shows a much stronger performance with most scores above 200. There are some low and even negative scores which may suggest the agent struggled in certain scenarios.
Visualisation
Clear improvement in the agent’s performance after training.
Summary
So far, I have covered the basics of reinforcement learning and demonstrated how to train agents using OpenAI’s Gymnasium and Stable Baselines3, applying the PPO algorithm in the Lunar Lander environment. While PPO worked well overall, there were a few cases where the agent’s performance was inconsistent, with some episodes scoring lower than expected.
Next, comes the real challenge! I will be applying what I have learnt so far to set up my own custom environment for Trick or ReTreat and attempt to train an agent using Q-learning!