In the previous two articles we started exploring the interesting universe of reinforcement learning. First we went through the basics of third paradigm within machine learning – reinforcement learning. Just to freshen up our memory, we saw that approach of this type of learning is unlike the previously explored supervised and unsupervised learning. In reinforcement learning, self-learning agent  learns some type of interaction between it and the environment.

The agent wants to achieve some kind of goal within mentioned environment while it interacts with it. This interaction is divided into time steps. In each time step, action is performed by agent. This action changes the state of the environment and based on the success of it agent gets a certain reward. This way the agent learns what actions should be performed an which shouldn’t in a defined environment state.

This is oddly similar to the way we as humans learn. When we are babies, we experiment. We perform some actions and get a response from the environment based on it. If the response is positive (reward) we mark those actions as good, otherwise (punishment) we mark them as bad. 

Interaction with an Environment

In this article, we also mentioned mathematical framework that is used to describe this set of problems – Markov Decision Processes (MDPs). This framework gives us a formal description of the problem in which an agent seeks for the best policy, defined as the function π. Policy maps states of the environment to the best action that the agent can take in certain environment state. Inside of the MDPs, we recognize several concepts, like a set of statesS, a set of actions – A, expected reward that agent will get going from one state to another – Ra(s, s’), etc. 

Reinforcement Learning

One very important concept that we introduced is the value of taking action a in state s under policy π. It is represented as an expected reward, that agent will get if it starts from state s, takes the action a and follows policy π – qπ(s, a) :

where γ is the discount factor which determines how much importance we want to give to future rewards. In an essence, this value represents quality of an action. This brings us to the first reinforcement learning algorithm – Q-Learning.

Q – Learning

To recap what we discussed in this article, Q-Learning is is estimating the aforementioned value of taking action a in state s under policy π – q. That is how it got its name. During the training iterations it updates these Q-Values for each state-action combination. Essentially it is described by the formula:

Q-Value for a particular state-action combination can be observed as the quality of an action taken from that state. As you can see the policy still determines which state–action pairs are visited and updated, but nothing more. This is why Q-Learning is sometimes referred to as off-policy.

The interesting point in the formula is maxQ(St+1, a). This means that Q-value of the current step is based on the Q-value of the future step. It is confusing, I know. This means that we initialize Q-Values for St and St+1 to some random values at first. In the first training iteration we update Q-Value in the state St based on reward and on those random value of Q-Value in the state St+1. Since reward is still guiding our system this will eventually converge to the best result.

All these Q-Values are stored inside of the Q-Table, which is just the matrix with the rows for states and the columns for actions:

Example of Q-Table

To get it even more clear we can brake down Q-Learning into the steps. It would look something like this:

  1. Initialize all Q-Values in the Q-Table arbitrary, and the Q value of terminal-state to 0:
    Q(s, a) = n, ∀s ∈ S∀a ∈ A(s) 
    Q(terminal-state, ·) = 0
  2. Pick the action a, from the set of actions defined for that state A(s) defined by the policy π.
  3. Perform action a
  4. Observe reward R and the next state s’
  5. For all possible actions from the state s’ select the one with the highest Q-Value – a’.
  6. Update value for the state using the formula: 
    Q(s, a) ← Q(s, a) + α [R + γQ(s’, a’) − Q(s, a)]
  7. Repeat steps 2-5 for each time step until the terminal state is reached
  8. Repeat steps 2-6 for each episode

In the previous article, you can find implementation of this algorithm using Python.

It is important to note, that this type of learning can get stuck in the certain scenarios which might not be the best solution for the problem. For example, algorithm can learn that the best thing going from state s is performing action a’ and going to state s’. However, it never performed action a” and ending up in the state s”, which could be a better option. Because of this reason, we use parameter epsilon. It defines will we explore new actions and maybe come up with a better solution – exploration, or we will go with the already learned route – exploitation.


The problem with the Q-Learning is of course scaling. When we are talking about complicated environments, like the planing a video game, number of states and actions can grow. Table becomes a complicated approach for this problem. That is where artificial neural networks come into play.

Deep Q – Learning

Deep Q-Learning harness the power of deep learning with so-called Deep Q-Networks. These are standard feed forward neural networks which are utilized for calculating Q-Value. In this case, the agent has to store previous experiences in a local memory and use max output of neural networks to get new Q-Value.

Deep Q-Learning

The important thing to notice here is that Deep Q-Networks don’t use standard supervised learning, simply because we don’t have labeled expected output. We depend on the policy or value functions in reinforcement learning, so the target is continuously changing with each iteration. Because of this reason the agent doesn’t use just one neural network, but two of them. So, how does this all fit together? The first network, called Q-Network is calculating Q-Value in the state St, while the other network, called Target Network is calculating Q-Value in the state St+1.

Speaking more formally, given the current state St, the Q-Network retrieves the action-values Q(St,a). At the same time the Target-Network uses the next state St+1 to calculate Q(St+1, a) for the Temporal Difference target. In order to stabilize this training of two networks, on each N-th iteration parameters of the Q-Network are copied over to the Target Network. The whole process is presented in the image below.

Target Network and Q-Network

We already mentioned that the agent has to store previous experiences. Deep Q-Learning goes one step further and utilizes one more concept in order to improve the agent performance – experience replay. It is empirically proven that neural network training process is more stable when training is done on random batch of previous experiences. Experience replay is nothing more than the memory that stores those experiences in a form of a tuple <s, s’, a, r>:

  • s – State of the agent
  • a – Action that was taken in the state by the agent
  • r – Immediate reward received in state for action a
  • s’ – Next state of the agent after state s

Both networks use random batches of <s, s’, a, r> from the experience replay to calculate Q-Values and then do the backpropagation. The loss is calculated using the squared difference between target Q-Value and predicted Q-Value:

Note that this is performed only for the training of Q-Network, while parameters are transferred to Target Network later.

To sum it all up, we can split the whole process of Deep Q-Learning into steps:

  1. Provide the state of the environment to the agent. The agent uses Target Network and Q-Network to get the Q-Values of all possible actions in the defined state.
  2. Pick the action a, based on the epsilon value. Meaning, either select a random action (exploration) or select the action with the maximum Q-Value (exploitation).
  3. Perform action a
  4. Observe reward r and the next state s’
  5. Store these information in the experience replay memory <s, s’, a, r>
  6. Sample random batches from experience replay memory and perform training of the Q-Network.
  7. Each Nth iteration, copy the weights values from the Q-Network to the Target Network.
  8. Repeat steps 2-7 for each episode



In order to the code from this article, you have to have Python 3 installed on your machine. In this example, we are using Python 3.7. The implementation is done using TensorFlow 2.0. The complete guide on how to install and use Tensorflow 2.0 can be found here.

Also, you have to install Open AI Gym or to be more specific Atari Gym. You can install it by running:

pip install gym[atari]

If you are using Windows installation is not this straight forward, so you can follow this article in order to do it correctly.

Open AI Gym has its own API and the way it works. However, we will not go in depth of how interaction with the environment from the code is done. We will mention a few important topics here and there. Because of this we strongly suggest you can check out this article if you are not familiar with the concept and the API of Open AI Gym.

There is one more additional module you need to install in order for the code from this article to work and that is the progressbar library. It is not doing anything essential, it just for cosmetic purposes.


Just like in the previous article, we are using the Gym environment called Taxi-V2. This is one very simple environment. To sum it up, there are 4 locations in the environment and the goal of an agent (taxi) is to pick up the passenger at one location and drop him off in another. The agent can perform 6 actions (south, north, west, east, pickup, drop-off). You can find more information about this environment here.


First, we import all necessary modules and libraries:

Note that apart form standard libraries and modules like numpy, tensorflow and gym, we imported deque from collections. We will use it for experience replay memory. After this we can create the environment:

We use the make function to instantiate an object of the Taxi-v2 environment. The current state of the environment and the agent can be presented with the render method. The important thing is that we can access all states of the environment using observation_space property and all actions of the environment using action_space. This environment has 500 states and 6 possible actions. Apart from these methods, Open Gym API has two more methods we need to mention. The first one is the reset method which resets the environment and returns a random initial state. Another one is the step method which steps the environment by one time-step and performs an action.

After this we can finally implement the agent. The Deep Q-Learning agent is implemented within the Agent class. Here is how that looks like:

We know, that is a lot of code. Let’s split it up and explore some important parts of it. Of course, the whole agent is initialized inside of the constructor:

First we initialize size of the state and action space based on the environment object that is passed to this agent. We also initialize an optimizer and the experience reply memory. Then we build the Q-Network and the Target Network with the _build_compile_model method and align their weights with the alighn_target_model method. The _build_compile_model method is probably the most interesting one in this whole implementation, because it contains the core of the implementation. Let’s peek at it:

We see that the first layer that is used in this model is Embedding layer. This layer is most commonly used in a language processing, so you might be curious what is it doing here. The problem that we are facing with the Taxi-v2 environment is that it returns discrete value (single number) for the state. This means that we need to reduce number of potential values a little bit. The Embedding layer, the parameter input_dimensions refers to the number of values we have and output_dimensions refers to the vector space we want to reduce them. To sum it up, we want to represent 500 possible states by 10 values and Embedding layer is used for exactly this. After this layer, Reshape layer prepares data for feed-forward neural network with three Dense layers.

The whole expropriation-exploration concept we mentioned in the previous chapter is done inside of the act function. Based on the epsilon value we either invoke Q-Network to make a prediction, or we pick a random action. Like this:

Finally, lets take a look at the retrain method. In this method we pick random samples from the experience replay memory and train the Q-Network:

Now, when we are aware of the Agent class implementation, let’s create an object of it and prepare for training:

From the output of our this sample of the code, we can see structure of networks in the agent:

Now let’s run the training using the steps explained in the previous chapter:

You can notice that this process is fairly similar to the training process we explored in the previous article for standard Q-Learning.


In this article we explored Deep Q-Learning. This is the first type of reinforcement learning that utilize neural networks. By this we solved scaling problem we had with standard Q-Learning and paved the way for more complex systems. In the next article, we will see how we can add conventional networks in this whole picture.

Thank you for reading!

Read more posts from the author at Rubik’s Code.