Implementing Q-learning algorithm
In the previous article in this series, we learned the basics and theoretical introduction of the Q-learning algorithm. We have theoretical knowledge, now it’s time to put it into practice! In today’s article, we will learn the secrets of implementing the Q-learning algorithm in chosen environment. We will teach the agent to operate optimally and test the proposed solution.
The plan of the artice is as follows:
- Quick recap,
- Multi-armed bandit problem,
- Python code implementation,
- Summary.
Quick recap
Before starting this article, be sure that you have read and understood the previous one well. The information contained there will be very useful to us today. The article is available at the link: https://wizzdev.com/blog/what-is-q-learning/
If, while reading the previous article, you created a project and the environment described there for developing machine learning methods, you can also use it today. The entire implementation is available on our Github repository at: https://github.com/wizzdev-pl/Q-learning-blog-post .
Multi-armed bandit problem
Before we start implementing our Q-learning solution there is one last thing to mention. The biggest problem with reinforcement learning is estimating the value of the reward. You often have to go back in time a lot to highlight a key moment or step that translated into receiving a particular reward. Currently our learning loop chooses always the highest reward possible in each step but in practice the problem is a bit more complex.
Let’s discover the Multi-armed bandit problem.
Source: https://en.wikipedia.org/wiki/Multi-armed_bandit
Imagine you are in a casino with several slot machines (also called “one-armed bandits”). Each machine has a different probability of winning, but you don’t know what these probabilities are. Your goal is to maximize your winnings by deciding which machines to play. Your dilemma is divided into:
- Exploration: You need to try out different machines to learn about their winning probabilities,
- Exploitation: Once you have some information, you want to use it to play the machines that seem to give the best rewards.
The challenge is finding the optimal balance between exploring new options and exploiting known ones to maximize your overall gains.
We are going to use that concept in our program, by randomly executing the drawn action to explore, we will also learn how to gradually reduce exploration as you learn. For now the most important is that an exploration strategy that always selects the action with the highest estimated reward we call “greedy”. We will however, use the ε-greedy strategy which simply means from time to time, with probability ε, we perform random action instead of the highest predicted reward.
Python code implementation
Hurray, we’ve gone through all the theory stuff we need! We know what Q-learning is, what the rewards are, how the learning loop looks like and what improvements can be made to achieve better results. It’s time to check it in practice!
Now is a good time to choose the problem we will deal with. We could, of course, try to deal with the CartPole from the previous article, but let’s choose something more real. Today we will optimize taxi driver!
For our environment we will choose “Taxi-v3” (https://gymnasium.farama.org/environments/toy_text/taxi/), and as the documentation suggest: the Taxi Problem involves navigating to passengers in a grid world, picking them up and dropping them off at one of four locations. As you can see above, our driver is not working well at the moment, but we will change that soon! Let’s firstly explore the problem purely based on the information from documentation. We know that there are four designated pick-up and drop-off locations marked with different colors, the taxi starts off at a random square and the passenger at one of the designated locations. Our goal for this task is to move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends. We can also see from there that there are six possible actions that we can take and we have a staggering number of 500 states. Our Q matrix will have a size of 500 rows and 6 columns. Let’s try to implement a learning loop.
NOTE!
We highly encourage you to read all documentation of the problem before attempting to solve it. Try to learn as many details as possible that could help you optimize the solution.
Beside our environment we will need a numpy library to implement matrix calculation and tqdm for nice visualization of the learning process. Install it with:
pip install numpy tqdm
Let’s walk through the code first and then run it:
env = gym.make(“Taxi-v3”) # Create environment
print(f”Available actions: {env.action_space.n}”)
print(f”Available observations: {env.observation_space.n}”)
n_observations = env.observation_space.n
n_actions = env.action_space.n
Q_matrix = np.zeros((n_observations, n_actions))
lr = 0.4 # learning rate
discount_factor = 0.6
exploration_factor = 0.7 # epsilon
min_exploration_factor = 0.2 # epsilon minimum
decay_rate = 1e-4 # First we explore a lot and over time we explore exponentially less and less
We start with creating an environment and printing available states and actions (If someone hasn’t read the documentation, it’s easy to see that this information is also available in the code). We then move to setting our learning constants, nothing new here. Let’s move to the main part – learning loop:
print(“— Training —“)
for e in tqdm(range(10000)): # Steps of the simulation
current_state, _ = env.reset()
# env.render() # Turn off rendering for training
done = False
while not done:
if np.random.uniform(0, 1) < exploration_factor:
# Choose action (here: sample/random action)
action = env.action_space.sample()
else:
# Get column with the estimated action
action = np.argmax(Q_matrix[current_state])
next_state, reward, done, _, _ = env.step(action)
# Update Q matrix
Q_matrix[current_state, action] = (1.0 – lr) * Q_matrix[
current_state, action
] + lr * (reward + discount_factor * max(Q_matrix[next_state]))
current_state = next_state
exploration_factor = max(
min_exploration_factor, np.exp(-decay_rate * e))
print(“Training ended with Q matrix:”)
print(Q_matrix)
For this problem we choose training with 10000 steps, we firstly reset the environment to the initial state, for training to speed things up we turn off the rendering GUI, don’t worry we will turn it back on for testing the solution. Next we set a help flag that will indicate the end of the simulation (if we have reached a critical point and we need to reset the simulation), this flag also controls the inner while loop which we’ll get to now. We firstly choose the action that will be executed, but remembering the multi-arm bandit problem we firstly draw if the action will be random (exploration) or the action with best estimated reward (exploitation). We execute the chosen action and remember their outcome. Next we move to the most important step – updating the Q matrix. If you don’t remember the formula for updating matrix weights, go back to the chapter with Bellman’s equation and revise it. After the matrix update we assign a new state as the current state for next loop iteration. The last little detail is the update of the exploration factor, we want to firstly explore a lot and over time exponentially less and less,however, we still want to maintain some minimal level of exploration and that’s what the last line is for.
Once you have learned the algorithm, it’s time to test it! The entire code (which we will place in the qlearning_demo.py file), including the testing addition, looks as follows:
from tqdm import tqdm
import numpy as np
import gymnasium as gym
def main():
env = gym.make(“Taxi-v3”) # Create environment
print(f”Available actions: {env.action_space.n}”)
print(f”Available observations: {env.observation_space.n}”)
# Q learning matrix will have dimensions observations x actions
n_observations = env.observation_space.n
n_actions = env.action_space.n
Q_matrix = np.zeros((n_observations, n_actions))
# Learning constants
lr = 0.4
discount_factor = 0.6
exploration_factor = 0.7
min_exploration_factor = 0.2
# First we explore a lot and over time we explore exponentially less and less
decay_rate = 1e-4
print(“— Training —“)
for e in tqdm(range(10000)): # Steps of the simulation
current_state, _ = env.reset() # Reset the environment
# env.render() # Turn off rendering for training
done = False
while not done:
if np.random.uniform(0, 1) < exploration_factor:
# Choose action (here: sample/random action)
action = env.action_space.sample()
else:
# Get column with the est action
action = np.argmax(Q_matrix[current_state])
next_state, reward, done, _, _ = env.step(
action) # Execute the action
# Update Q matrix
Q_matrix[current_state, action] = (1.0 – lr) * Q_matrix[
current_state, action
] + lr * (reward + discount_factor * max(Q_matrix[next_state]))
current_state = next_state
exploration_factor = max(
min_exploration_factor, np.exp(-decay_rate * e))
print(“Training ended with Q matrix:”)
print(Q_matrix)
print(“— Testing —“)
epochs = 50
total_reward = []
env = gym.make(“Taxi-v3”, render_mode=“human”)
for _ in tqdm(range(epochs)):
done = False
epoch_reward = 0
current_state, _ = env.reset()
while not done:
action = np.argmax(Q_matrix[current_state])
current_state, reward, done, _, _ = env.step(action)
epoch_reward += reward
total_reward.append(epoch_reward)
print(f”Total reward: {total_reward}”)
env.close() # Close the environment
if __name__ == “__main__”:
main()
Let’s run it with:
python3 qlearning_demo.py
Result:
Great success! Our driver performs his task correctly!
NOTE!
Try changing values of the learning hyperparameters and discover how behavior changes. Is searching for optimal settings of these parameters an easy task?
You can easily notice that the learning process is very slow at the very beginning. This is because the values of the Q matrix are equal to 0, the algorithm must mine to update these weights. Over time, the weights have a certain value and the learning process accelerates.
Summary
Today we saw how simple and quick it is to implement a working Q-learning algorithm. We used in practice the theoretical knowledge we acquired earlier and saw how to analyze the problem and take as much as we can from available information. However, there still remains one noticeable disadvantage of this solution: as the number of actions and states increases, our Q matrix grows to huge sizes, increasing memory and time complexity, which makes training significantly more difficult. The solution to this problem may be to introduce neural networks into this algorithm, but we will talk about this another time.
External links: