wizzdev image

What is Q-learning? Q-Learning Part #1

Reinforcement learning (RL) is one of the most dynamic and fastest growing fields within artificial intelligence and machine learning. Unlike traditional supervised learning, where models are trained on a fixed dataset, reinforcement learning agents draw heavily on the idea of ​​how people learn – through trial and error, continually improving their behavior based on feedback from their actions. This unique learning paradigm has been instrumental in solving complex problems ranging from game playing and robotics to financial modeling and autonomous systems. In this article, we provide an overview of the fundamental concepts, key algorithms and we will introduce the most important information needed to create a working implementation of Q-learning, which we will talk about in the next article.

 

The plan of the article is as follows:

  • What is Q-learning,
  • How to reward a system,
  • How to choose action,
  • Learning process,
  • Gymnasium,
  • Summary.

 

What is Q-learning

Let’s start by going back to the core of machine learning and reminding ourselves what problems this field is best at solving. In traditional programming user inputs to the system (implemented usually on the computer) some input data as well as rules (usually defined as algorithms) and expects it to produce specific results, in ML (machine learning) we change this approach a bit and give as an input data as well as results and expect that the system will produce rules. In other words we do not provide the computer with a recipe in the form of algorithms or rules, as it’s in classical programming, however, we try to give the system tools that will allow it to learn such rules on its own. So far, most of you have probably dealt with or at least heard of the two main machine learning paradigms – supervised and unsupervised learning. 

 

Source: https://www.mathworks.com/discovery/machine-learning.html

These paradigms are great at dealing with tasks where input data samples need to be assigned with a specific output result (data pairs: input and expected corresponding data outputs) e.q. a combination of three features gives one class. However, what if we do not have clearly marked data, the data depends on the context of the situation, or the task is so complex that it is impossible to clearly create connections between a specific state and a specific result – the correctness of the decision is visible only in later stages? These problems mainly relate to everyday life such as:

  • playing chess – it is difficult to determine the correctness of a single pawn move, only subsequent turns show the result,
  • swimming – a single movement of one limb is difficult to consider as correct, only combining many movements in the right order gives success,

and many more.

 

NOTE!

Can you name more problems from everyday life where it is difficult to determine the rightness based on a single action? 

 

For this type of problem, the topic of today’s article – Reinforcement learning will be the best solution. RL introduces the concept where learning occurs through interaction with the environment – an agent interacts with the environment and receives an evaluation of the results of its activities (reinforcement), but not individual actions. The observations made by the agent are not only used for taking action, but also to improve the rules of agents behavior so that he can perform his task better in the future. 

Source: https://gymnasium.farama.org/content/basic_usage/

Today we will be focusing especially on the Q-learning which is a type of model-free reinforcement learning algorithm that helps an agent learn the value of taking a specific action in a given state. It’s called “model-free” because it doesn’t need a model of the environment to work. Q-learning is especially useful where the outcomes are unpredictable (stochastic) without needing any special adaptation.

 

We can define our task such as: learn how to behave in order to achieve the goal by acting in the environment (it affects the agent and the agent affects it) or simply create an agent that will be in a certain environment and will choose his actions the way to get the most rewards as possible

Agents’ actions transform the environment, which leads to changing the state. Some actions may lead to achieving the reward for them. Approaching the problem on more theoretic site we can say that Q-learning is basically the first order Markov chain (https://en.wikipedia.org/wiki/Markov_chain), which simply means that the probability of the state in the next step depends only on the current state and the action taken in this state and does not depend on the preceding states and actions.

 

Let’s introduce the symbols that we will be using:

  • states – s0, s1, s2, … – state of the environment, e.q. arrangement of pieces on the chessboard,
  • actions – a0, a1, a2, … – actions taken by the agent, e.q. one movement of piece on the chessboard,
  • reward – r0, r1, r2, … – reward received after the action (more about this in the next chapter),
  • episodea set of steps after which the state of the environment is reset, the episode ends with state sn where n is the number of steps in the episode, e.g. one game of chess,

Most importantly: the strategy of choosing action is called policy.

 

How to reward a system

As we mentioned previously the agent’s learning takes place on the basis of the correctness of his actions, so at this point we can ask a very valid question: how to reward an agent and also can we punish him for taking bad actions?

Well it turns out that the reward for taking specific action in a specific state can be either positive (good move) or negative (punishment for bad behavior). To reward the system we usually use a point based system e.g. +10 points for achieving a goal at the end of the episode, -1 point for a wrong move, -10 points for a bad episode. 

 

At this stage we can easily calculate the total reward for one full episode:

R = r1 + r2 + r3 + … + rn

However, we come to a certain problem: if the process is the Markov chain and the reward depends on the change in the state of the environment so it’s tough to determine what future rewards will be (the further we go, the less certain we are), the question is how to determine the total future reward in the learning process? Let’s think about that as: it is easy to determine the reward in the next step or for the next two steps if we know the environment well, however, it is tough to estimate what the reward will be in the next five steps. To resolve this problem we introduce a discount factor γ ∈ <0,1>.

R = r + γ*ri+1 + γ2 * ri+2 + …,

  • γ close to 0 – an algorithm that analyzes only current states,
  • γ close to 1 – the assumption that the same action always produces the same reward (not obtainable in a stochastic environment).

Discount factor is one of the learning hyperparameters that value affects the results obtained. We can now transform previous equation into the form of:

Ri = r + γ*ri+1 + γ2 * ri+2 + … = ri + γ * Ri-1

where Ri is the total future reward counted from step i, Ri-1 is a total discounted reward till the state i-1, γ is a discount factor and ri is reward in step i. 

NOTE!

Take a moment to analyze the above transformation and try to derive the equation yourself based on previous information. 

How to choose action

We know the basic concepts of reinforcement learning, how to reward and estimate the total reward of the system, let’s now focus on the next piece of the puzzle: how to choose an action to do? 

Classic Q-learning is based on a concept with Q matrix of size observations(states) x actions. Q matrix determines discounted reward after taking action a in state s:

Q(si, ai) = maxπRi+1

Function Q determines the quality of action that we can do in a specific state, and actions providing a higher reward will have higher quality.

We already know what Q(si, ai) is, and what Ri+1 is (we hope at this point everybody knows what max means) but what does π stands for in this equation? We need to come back to the first chapter of this article where we defined that: the strategy of choosing action is called policy (marked as π). It turns out that there are many different types of policies resulting in different final outcome, but for the purpose of today’s considerations we will assume that our policy will be defined as:

π(s) = argmaxaQ(s, a)

In other words we choose an action that maximizes the value of Q. Now we know how to choose an action to achieve the best results, but one question still remains unanswered: How to designate and update during learning the values of Q? We know that values of Q determine the discounted reward after taking an action (selected using the policy) in a specific state, but how to know the specific discounted reward? Was all our deliberation for nothing? NO! We have one more trick up our sleeve that we will use now.

 

Learning process

We are able to write a reward equation as a relationship between current and the next state, let’s use this fact to determine values of Q matrix.

Ladies and gentlemen the Bellman equation:

Q(s, a) = r + γ * maxa’(s’, a’)

Full maximum reward for a given state (s) is the current reward (r) and maximum future reward for the next state (γ * maxa’(s’, a’)).

  • s – current state,
  • s’ – next state,
  • a – current action,
  • a’ – next action,
  • r – current reward,
  • γ – discount factor.

The above equation is the key to an iterative determining the value of Q.

 

How will the training loop look like:

How to update Q matrix:

Q(s, a) = Q(s, a) + α(r + γmaxa’Q(s’, a’) – Q(s, a))

where α ∈ <0,1> is a learning rate – weight update factor.

 

NOTE!

Do you think this implementation of the learning loop has any disadvantages? Can you point them out? We will answer this question in the next article but maybe you can see it already. 

 

Gymnasium

That was a lot of theory, but trust us we will be needing it to implement a working solution in the next article. A good understanding of how the algorithm works is crucial to success. At this point to take a break from theory we can introduce an environment where we will be testing our solutions. Today we will be using the Gymnasium framework.

Source: https://pypi.org/project/gymnasium/ 

The Gymnasium library is a popular toolkit in Python for developing and comparing reinforcement learning algorithms. It provides a collection of environments, such as games and simulations, where agents can be trained and tested.

Let’s test it out!

 

NOTE!

We will implement our solutions using Python 3.10.12. We recommend that you create virtualenv (https://docs.python.org/3/library/venv.html), install all required libraries and test codes yourself!

 

First install the library:

pip install gymnasium

 

Now we will run a simple demo script. Create file gymnasium_demo.py and paste the code below:

 

import gymnasium as gym

 

def main():

# Environment: https://gymnasium.farama.org/environments/classic_control/cart_pole/

env = gym.make(“CartPole-v1”, render_mode=“human”)  # Create environment

env.reset()  # Reset the environment to base state

for _ in range(1000):  # Steps of the simulation

     action = env.action_space.sample()  # Choose action (here: sample/random action)

     env.render()  # Render GUI

     observation, reward, done, tr, info = env.step(action)  # Execute the action

     if done:

         env.reset()

env.close()  # Close the environment

 

if __name__ == “__main__”:

main()

 

NOTE!

All codes are also available on our repository in GitHub service: https://github.com/wizzdev-pl/Q-learning-blog-post. Feel free to clone the repository and experiment!

 

Let’s run the program with command:

python3 gymnasium_demo.py

 

We should see a window with our first simulation looking like this:

 

Let’s quickly walk through the code and explain what is happening there. In main function the first thing we do is creating the environment for simulation, and resetting it to to ensure clean start:
 

env = gym.make(“CartPole-v1”, render_mode=“human”)

env.reset()

For this particular example we use CartPole-v1 environment, you can explore all of the available ones on the Gymnasium website under section Environments. The next stage of our code is the main loop:

 

for _ in range(1000):

     action = env.action_space.sample()  # Choose action (here: sample/random action)

     env.render()

     observation, reward, done, tr, info = env.step(action)

     if done:

         env.reset()

In the main loop we first do the action, in this case random, note that different environments have got different available actions, the easiest way to discover available actions for particular environment is to visit it’s documentation: https://gymnasium.farama.org/environments/classic_control/cart_pole/. After the action is chosen we render the graphical interface so that we can see what happened (later we will render GUI only while testing and disable it for learning to speed up the process). Next step is the most important since it’s the execution of the action, execution of step function returns a lot of information about the simulation: observation(next step), reward, done flag (whether we reached critical point and we need to reset simulation), truncated (whether the truncation condition outside the scope is satisfied, we don’t need to worry about that for now) and info containing diagnostic information. In the end we check if we reached the terminal state and reset the simulation.

 

And that’s it! We’ve successfully run the simulation environment. We are one step closer to implementing the Q-learning algorithm, which we will cover in the next article in this series, so be sure to follow our next publications!

 

NOTE!

Try running different environments! Explore the Gymnasium website to find out what other environments are available. Find the information on how many states, actions and rewards are available in different environments.

 

Summary

Q-learning algorithm despite its simplicity among other reinforcement learning algorithms, allows you to obtain truly effective solutions. Today we learned the basics and theoretical introduction to the problem, we learned what the reward for the system is, how to teach systems and where to experiment and train our agent. We have gained strong theoretical foundations, soon, in the next article we will use the acquired knowledge to implement a working algorithm in the test environment.

 

External links:

Recent entries

AWS OpsWorks image

AWS OpsWorks EOL and how you can migrate?

When utilizing services like Chef and Puppet to automate server setup and management, AWS OpsWorks has proven to be an indispensable tool. But as of May 26, 2024, AWS OpsWorks Stacks

WizzDev Image

Does the use of AI violate copyrights?

The rise of AI has created new challenges for most industries. In the ever-present pursuit of expense cutting the use of pre-trained AI is rapidly reshaping all industries.  As Mercedes-Benz’s CEO,

AWS OpsWorks image

AWS OpsWorks EOL and how you can migrate?

When utilizing services like Chef and Puppet to automate server setup and management, AWS OpsWorks has proven to be an indispensable tool. But as of May 26, 2024, AWS OpsWorks Stacks

WizzDev Image

Does the use of AI violate copyrights?

The rise of AI has created new challenges for most industries. In the ever-present pursuit of expense cutting the use of pre-trained AI is rapidly reshaping all industries.  As Mercedes-Benz’s CEO,

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.