wizzdev image

How to implement Q-Learning algorithm for gymnasium enviroment? Q-learining part #2

Implementing Q-learning algorithm

In the previous article in this series, we learned the basics and theoretical introduction of the Q-learning algorithm. We have theoretical knowledge, now it’s time to put it into practice! In today’s article, we will learn the secrets of implementing the Q-learning algorithm in chosen environment. We will teach the agent to operate optimally and test the proposed solution.

 

The plan of the artice is as follows:

  • Quick recap,
  • Multi-armed bandit problem,
  • Python code implementation,
  • Summary.

 

Quick recap

Before starting this article, be sure that you have read and understood the previous one well. The information contained there will be very useful to us today. The article is available at the link: https://wizzdev.com/blog/what-is-q-learning/

If, while reading the previous article, you created a project and the environment described there for developing machine learning methods, you can also use it today. The entire implementation is available on our Github repository at: https://github.com/wizzdev-pl/Q-learning-blog-post .

 

Multi-armed bandit problem

Before we start implementing our Q-learning solution there is one last thing to mention. The biggest problem with reinforcement learning is estimating the value of the reward. You often have to go back in time a lot to highlight a key moment or step that translated into receiving a particular reward. Currently our learning loop chooses always the highest reward possible in each step but in practice the problem is a bit more complex.

 

Let’s discover the Multi-armed bandit problem.

Source: https://en.wikipedia.org/wiki/Multi-armed_bandit

 

Imagine you are in a casino with several slot machines (also called “one-armed bandits”). Each machine has a different probability of winning, but you don’t know what these probabilities are. Your goal is to maximize your winnings by deciding which machines to play. Your dilemma is divided into: 

  • Exploration: You need to try out different machines to learn about their winning probabilities,
  • Exploitation: Once you have some information, you want to use it to play the machines that seem to give the best rewards.

The challenge is finding the optimal balance between exploring new options and exploiting known ones to maximize your overall gains.

We are going to use that concept in our program, by randomly executing the drawn action to explore, we will also learn how to gradually reduce exploration as you learn. For now the most important is that an exploration strategy that always selects the action with the highest estimated reward we call “greedy”. We will however, use the ε-greedy strategy which simply means from time to time, with probability ε, we perform random action instead of the highest predicted reward.

 

Python code implementation

 

Hurray, we’ve gone through all the theory stuff we need! We know what Q-learning is, what the rewards are, how the learning loop looks like and what improvements can be made to achieve better results. It’s time to check it in practice!

Now is a good time to choose the problem we will deal with. We could, of course, try to deal with the CartPole from the previous article, but let’s choose something more real. Today we will optimize taxi driver!

 

For our environment we will choose “Taxi-v3” (https://gymnasium.farama.org/environments/toy_text/taxi/), and as the documentation suggest: the Taxi Problem involves navigating to passengers in a grid world, picking them up and dropping them off at one of four locations. As you can see above, our driver is not working well at the moment, but we will change that soon! Let’s firstly explore the problem purely based on the information from documentation. We know that there are four designated pick-up and drop-off locations marked with different colors, the taxi starts off at a random square and the passenger at one of the designated locations. Our goal for this task is to move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends. We can also see from there that there are six possible actions that we can take and we have a staggering number of 500 states. Our Q matrix will have a size of 500 rows and 6 columns. Let’s try to implement a learning loop.

 

NOTE!

We highly encourage you to read all documentation of the problem before attempting to solve it. Try to learn as many details as possible that could help you optimize the solution.

Beside our environment we will need a numpy library to implement matrix calculation and tqdm for nice visualization of the learning process. Install it with:

pip install numpy tqdm

 

Let’s walk through the code first and then run it:

 

env = gym.make(“Taxi-v3”)  # Create environment

print(f”Available actions: {env.action_space.n}”)

print(f”Available observations: {env.observation_space.n}”)

 

n_observations = env.observation_space.n

n_actions = env.action_space.n

Q_matrix = np.zeros((n_observations, n_actions))

 

lr = 0.4 # learning rate

discount_factor = 0.6

exploration_factor = 0.7 # epsilon

min_exploration_factor = 0.2 # epsilon minimum

decay_rate = 1e-4 # First we explore a lot and over time we explore exponentially less and less

 

We start with creating an environment and printing available states and actions (If someone hasn’t read the documentation, it’s easy to see that this information is also available in the code). We then move to setting our learning constants, nothing new here. Let’s move to the main part – learning loop:

 

print(“— Training —“)

for e in tqdm(range(10000)):  # Steps of the simulation

     current_state, _ = env.reset()

     # env.render()  # Turn off rendering for training

     done = False

     while not done:

         if np.random.uniform(0, 1) < exploration_factor:

             # Choose action (here: sample/random action)

             action = env.action_space.sample()

         else:

             # Get column with the estimated action

             action = np.argmax(Q_matrix[current_state])

         next_state, reward, done, _, _ = env.step(action)

         # Update Q matrix

         Q_matrix[current_state, action] = (1.0 lr) * Q_matrix[

             current_state, action

         ] + lr * (reward + discount_factor * max(Q_matrix[next_state]))

 

         current_state = next_state

 

     exploration_factor = max(

         min_exploration_factor, np.exp(-decay_rate * e))

print(“Training ended with Q matrix:”)

print(Q_matrix)

 

For this problem we choose training with 10000 steps, we firstly reset the environment to the initial state, for training to speed things up we turn off the rendering GUI, don’t worry we will turn it back on for testing the solution. Next we set a help flag that will indicate the end of the simulation (if we have reached a critical point and we need to reset the simulation), this flag also controls the inner while loop which we’ll get to now. We firstly choose the action that will be executed, but remembering the multi-arm bandit problem we firstly draw if the action will be random (exploration) or the action with best estimated reward (exploitation). We execute the chosen action and remember their outcome. Next we move to the most important step – updating the Q matrix. If you don’t remember the formula for updating matrix weights, go back to the chapter with Bellman’s equation and revise it. After the matrix update we assign a new state as the current state for next loop iteration. The last little detail is the update of the exploration factor, we want to firstly explore a lot and over time exponentially less and less,however, we still want to maintain some minimal level of exploration and that’s what the last line is for.

 

Once you have learned the algorithm, it’s time to test it! The entire code (which we will place in the qlearning_demo.py file), including the testing addition, looks as follows:

 

from tqdm import tqdm

import numpy as np

import gymnasium as gym

 

def main():

env = gym.make(“Taxi-v3”)  # Create environment

print(f”Available actions: {env.action_space.n}”)

print(f”Available observations: {env.observation_space.n}”)

 

# Q learning matrix will have dimensions observations x actions

n_observations = env.observation_space.n

n_actions = env.action_space.n

Q_matrix = np.zeros((n_observations, n_actions))

 

# Learning constants

lr = 0.4

discount_factor = 0.6

exploration_factor = 0.7

min_exploration_factor = 0.2

# First we explore a lot and over time we explore exponentially less and less

decay_rate = 1e-4

    

print(“— Training —“)

for e in tqdm(range(10000)):  # Steps of the simulation

     current_state, _ = env.reset()  # Reset the environment

     # env.render()  # Turn off rendering for training

     done = False

     while not done:

         if np.random.uniform(0, 1) < exploration_factor:

             # Choose action (here: sample/random action)

             action = env.action_space.sample()

         else:

             # Get column with the est action

             action = np.argmax(Q_matrix[current_state])

         next_state, reward, done, _, _ = env.step(

             action)  # Execute the action

         # Update Q matrix

         Q_matrix[current_state, action] = (1.0 lr) * Q_matrix[

             current_state, action

         ] + lr * (reward + discount_factor * max(Q_matrix[next_state]))

 

         current_state = next_state

 

     exploration_factor = max(

         min_exploration_factor, np.exp(-decay_rate * e))

print(“Training ended with Q matrix:”)

print(Q_matrix)

 

print(“— Testing —“)

epochs = 50

total_reward = []

env = gym.make(“Taxi-v3”, render_mode=“human”)

for _ in tqdm(range(epochs)):

     done = False

     epoch_reward = 0

     current_state, _ = env.reset()

     while not done:

         action = np.argmax(Q_matrix[current_state])

         current_state, reward, done, _, _ = env.step(action)

         epoch_reward += reward

     total_reward.append(epoch_reward)

print(f”Total reward: {total_reward}”)

env.close()  # Close the environment

 

if __name__ == “__main__”:

main()

 

Let’s run it with:

 

python3 qlearning_demo.py

 

Result:

Great success! Our driver performs his task correctly!

 

NOTE!

Try changing values of the learning hyperparameters and discover how behavior changes. Is searching for optimal settings of these parameters an easy task?

 

You can easily notice that the learning process is very slow at the very beginning. This is because the values ​​of the Q matrix are equal to 0, the algorithm must mine to update these weights. Over time, the weights have a certain value and the learning process accelerates. 

 

Summary

Today we saw how simple and quick it is to implement a working Q-learning algorithm. We used in practice the theoretical knowledge we acquired earlier and saw how to analyze the problem and take as much as we can from available information. However, there still remains one noticeable disadvantage of this solution: as the number of actions and states increases, our Q matrix grows to huge sizes, increasing memory and time complexity, which makes training significantly more difficult. The solution to this problem may be to introduce neural networks into this algorithm, but we will talk about this another time.

 

External links:

Recent entries

WizzDev Image

Does the use of AI violate copyrights?

The rise of AI has created new challenges for most industries. In the ever-present pursuit of expense cutting the use of pre-trained AI is rapidly reshaping all industries.  As Mercedes-Benz’s CEO,

WizzDev Image

Does the use of AI violate copyrights?

The rise of AI has created new challenges for most industries. In the ever-present pursuit of expense cutting the use of pre-trained AI is rapidly reshaping all industries.  As Mercedes-Benz’s CEO,

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.