A while back we wrote a couple texts about Reinforcement Learning. We explored how it is different from supervised and unsupervised learning and what kind of problem it is trying to solve. Also we found out what is Q-Learning, one way to actually do reinforcement learning, and saw how to implement it with Python, with neural networks and with convolutional neural networks. For this purpose we used Tensorflow 2 to implement various neural networks and use them within learning agents that we wrote by ourselves. However, it turned out that that is a lot of work and that there is always a possibility that some crucial part of the ecosystem is missed. That is why guys from Google gave us TF-Agents, an easier way to quickly utilize reinforcement learning.

However, let’s first run through the basics really quickly, just to remind ourselves which parts of the ecosystem exist in reinforcement learning and why is it so hard to cover all aspects of it, not to mention leaving enough room for changing configuration and fine-tuning models for better results. Unlike other types of learning (supervised and unsupervised) reinforcement learning is utilizing the power of interaction. The ecosystem that represents this type of learning is constructed from a learning agent and the environment. This agent that is trying to achieve some kind of  goal within that environment. In it’s pursue, it performs numerous actions. Every action changes the state of the environment and results in feedback from it. Feedback comes in a form of reward or punishment. Based on this agent learns, which actions are suitable in certain states.

In general, we use Markov Decision Processes (MDPs) as a mathematical tool to describe this process. MDPs give us a formal description of the problem in which an agent tries to find the best policy, meaning the the best action that agent can take in the certain state of the environment. Policy is defined as the function π. This function maps states of the environment to the best action that the agent can take in that state. MDPs are constructed of several concepts and described with the tuple (SA, Pa, Ra), where:

  • S – is the set of states. At each time step t, the agent gets the environment’s state – St, where St ∈ S.
  • A – Is set of actions that the agent can do in defined state. The agent makes the decision to perform an action, based on the state St – At, where At ∈ A(St)A(St) represents a set of possible actions in the state St.
  • Pa – Represents the probability that action in some state s will result in the time step t, will result in the state s’ in the time step t+1.
  • Ra – Or more precisely Ra(s, s’), represents expected reward received after going from state s to the state s’, as a result of action a.

The algorithm we explore in this article is based on one very important concept – the value of taking action a in state s under policy π. It represents expected reward that agent will get if it start from state s, take the action a and follow policy π – qπ(s, a) . As a mathematical formula it can be written down like this:

where γ is the discount factor which determines how much importance we want to give to future rewards. If you check Quora, you will find a theory that this value represents quality of an action, ergo it is denoted with q. However, a lot of people consider this just a coincidence. One way or another, this brings us to one of reinforcement learning algorithm that we implement in this article using TF-Agents – Q-Learning.


Q-Learning is an algorithm that estimates the q value (or Q-Value), i.e. the value of taking action a in state s under policy π. That is where the name comes from. During the training iterations it updates these Q-Values for each state-action combination, meaning we are forming some kind of table, where for each action and the state we apply some value – Q-Value. The process of updating these values during the training process is described by the formula:

As we mentioned previously, this Q-Value for a particular state-action pair can be observed as the quality of that particular action in that particular state. If this value is higher we expect that agent will get higher reward if it undertake this action. This is the mechanism that the learning agent uses to know which “way” it should go in order to achieve defined goal. The policy in this case determines which state–action pairs are visited and updated, but nothing more. This is why this algorithm often have prefix off-policy.

The important part of the formula above is maxQ(St+1, a). Note the t+1 annotation. This means that Q-value of the current time step is based on the Q-value of the future time step. I know, it is confusing. This means that we initialize Q-Values for states St and St+1 to some random values at first. In the first training iteration we update Q-Value in the state St based on reward and on those random value of Q-Value in the state St+1. Since the whole system is driven by the reward and not by the Q-Value itself system converge to the best result.

All these Q-Values are stored for each action-state pair, inside of the Q-Table and updated in every iteration:

To get it even more clear we can brake down Q-Learning into the steps. It would look something like this:

  1. Initialize all Q-Values in the Q-Table arbitrary, and the Q value of terminal-state to 0:
    Q(s, a) = n, ∀s ∈ S∀a ∈ A(s) 
    Q(terminal-state, ·) = 0
  2. Pick the action a, from the set of actions defined for that state A(s) defined by the policy π.
  3. Perform action a
  4. Observe reward R and the next state s’
  5. For all possible actions from the state s’ select the one with the highest Q-Value – a’.
  6. Update value for the state using the formula: 
    Q(s, a) ← Q(s, a) + α [R + γQ(s’, a’) − Q(s, a)]
  7. Repeat steps 2-5 for each time step until the terminal state is reached
  8. Repeat steps 2-6 for each episode

In one of the previous articles, you can find the simple implementation of this algorithm.

It is important to note, that this type of learning can get stuck in the certain scenario which might not be the best solution for the problem. For example, algorithm can learn that the best thing going from state s is performing action a’ and going to state s’. However, it never even tried to perform action a” and end up in the state s”, which could be a better option. Technically, the system can overfit. Because of this, we use additional parameter which defines will the agent explore actions it hadn’t perform thus far (exploration), or it will take the safe already learned route (exploitation). This parameter is denoted with epsilon.

However, scaling is the big problem for the Q-Learning. Environments get complicated. Imagine how many states and possible actions for each state can exist in a video game. Imagine how big Q-Table can get in that case and how it should be handled. This is where artificial neural networks come into play.

Deep Q-Learning

Deep Q-Learning harness the power of deep learning with so-called Deep Q-Networks, or DQN for short. In this scenario, these networks are just standard feed forward neural networks which are utilized for predicting the best Q-Value. In order for this approach to work, the agent has to store previous experiences in a local memory, but more on that later.

One of the the important things to realize is that DQNs don’t use standard supervised learning. This is because agent doesn’t have labels (expected output) provided beforehand. It depends on the policy or value functions in reinforcement learning. In a nutshell, the target is continuously changing with each iteration. Because of this the agent doesn’t have just one neural network, but two of them. The first network, which is refereed to as Q-Network is calculating Q-Value in the state St. The second network, refereed to as Target Network is calculating Q-Value in the state St+1.

Speaking more formallygiven the current state St, the Q-Network retrieves the action-values Q(St,a). At the same time the Target-Network uses the next state St+1 to calculate Q(St+1, a) for the Temporal Difference target. In order to stabilize this training of two networks, on each N-th iteration parameters of the Q-Network are copied over to the Target Network. The whole process is presented in the image above. A while back we implemented this process using Python and Tensorflow 2. You can check out that implementation here.

Experience Replay

We already mentioned that the agent, in order to train neural networks, has to store previous experiences. Deep Q-Learning takes this to the next level and uses one more concept to improve performances – experience replay. This concept is used for one more reason, to stabilize training process. In a nutshell, the agent uses random batches of experiences to train the networks. Experience replay is the memory that stores those experiences in a form of a tuple <s, s’, a, r>:

  • s – State of the agent
  • a – Action that was taken in the state by the agent
  • r – Immediate reward received in state for action a
  • s’ – Next state of the agent after state s

The network uses random batches of <s, s’, a, r> from the memory to calculate Q-Values and base backpropagation on that. The loss is calculated using the squared difference between target Q-Value and predicted Q-Value:

Note that this is performed only for the training of Q-Network, while parameters are copied over to Target Network as we previously mentioned.

To sum it all up, we can summarize the whole process of Deep Q-Learning into several steps:

  1. Provide the state of the environment to the agent. The agent uses Target Network and Q-Network to get the Q-Values of all possible actions in the defined state.
  2. Pick the action a, based on the epsilon value. Meaning, either select a random action (exploration) or select the action with the maximum Q-Value (exploitation).
  3. Perform action a
  4. Observe reward r and the next state s’
  5. Store these information in the experience replay memory <s, s’, a, r>
  6. Sample random batches from experience replay memory and perform training of the Q-Network.
  7. Each Nth iteration, copy the weights values from the Q-Network to the Target Network.
  8. Repeat steps 2-7 for each episode


So, as you can see there are few elements that need to be implemented in order to get one reinforcement learning system up and running. First there is the learning agent, which utilizes a neural network with expirience replay underneath the hood. Apart from that, there is the environment with sets of actions and rewards. In previous article, we implemented everything on our own with Python, TensorFlow 2 and OpenAI Gym. However this time we utilize TF-Agents, Google’s library for reinforcement learning. TF-Agents provides modular components that can be modified and extended and at the core of it there are different types of agents that can be used. In this article we use DQN agent, which is an agent for Deep Q-Learning. Also, this library provides and easy access to OpenAI Gym, Atari, and DM Control and all their environments. To install TF-Agents simply run this command:

pip install tf-agents

This library works with both TensorFlow 1 and TensorFlow 2, so make sure you have it installed as well. More on TensorFlow 2 can be found here, but you can simply install CPU version with command:

pip install tensorflow

In this article we use CartPole-v0 enviroment, which is sort of ‘Hello World’ example in the world or reinforcement learning.

In this experimental environment, a pole is attached to a cart which moves along a track. The system is controlled by applying a force of +1 or -1 to the cart and moving it left or right. The pole is in upright position in the beginning, and the goal is to prevent it from falling. For every timestamp in which pole doesn’t fall a reward of +1 is provided. The complete episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.


The first thing we need to do is to import necessary libraries and define some constants:

import base64
import imageio
import matplotlib
import matplotlib.pyplot as plt
import tensorflow as tf
from tf_agents.agents.dqn.dqn_agent import DqnAgent
from tf_agents.networks.q_network import QNetwork
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.policies.random_tf_policy import RandomTFPolicy
from tf_agents.replay_buffers.tf_uniform_replay_buffer import TFUniformReplayBuffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common
# Globals

You can see that we imported TensorFlow and a lot of modules from TF-Agents. One of the classes we imported is DqnAgent, specific agent that can perform Deep Q-Learning. This is really cool and saves us a lot of time. Also we imported QNetwork class. This class is an abstraction of neural network that we use for learning. As you can see, as with transfer learning, this saves us a bunch of time. We also import suite_gym and tf_py_environment. The first module grants us access to training environments. Since all of these environments are implemented in Python, we need to wrap them up into TensorFlow. That is what tf_py_environment is used for. For experience replay, we use class TFUniformReplayBuffer and in this buffer we store trajectories. Trajectory is a tuple that contains state of the environment in some time step, action that agent should take it in that state and state in which the environment will be after defined action is performed. Cool, now we have our building blocks, so let’s proceed with the implementation of the ecosystem.


After importing all necessary modules, we need to construct the environment. In fact, we need two environments, one for training and the other one for evaluation. Here is how we do that:

train_env = suite_gym.load('CartPole-v0')
evaluation_env = suite_gym.load('CartPole-v0')
print('Observation Spec:')
print('Reward Spec:')
print('Action Spec:')
train_env = tf_py_environment.TFPyEnvironment(train_env)
evaluation_env = tf_py_environment.TFPyEnvironment(evaluation_env)
view raw enviroments.py hosted with ❤ by GitHub

Loading of the environment is easy using suite_gym. Once environment is loaded, we can easily get states of the environment, rewards that environment returns and actions that agent can perform in it. However, these environments are implemented in Python, so we need to wrap them in order to use them with TensorFlow. Awesome, we have our first piece of the puzzle prepared.


Now, we can build DQN agent. Before we proceed with that, we need to create an instance of QNetwork class. Here is what the constructor of that class looks like:

class QNetwork(network.Network):
"""Feed Forward network."""
def __init__(self,
fc_layer_params=(75, 40),
"""Creates an instance of `QNetwork`.

Here we have two obligatory parameters and a number of optional ones. We must define input_tensor_spec, which is the set of possible states of the environment and action_spec, which is the set of possible actions that agent can be undertake in that environment. Among other parameters fc_layer_params, is of great importance to us. Using this parameters, we can define number of neurons for each hidden layer. We use this constructor like this:

hidden_layers = (100,)
q_network = QNetwork(
view raw qnetwork.py hosted with ❤ by GitHub

We define only one hidden layer, with 100 neurons and pass on information about training environment to the QNetwork constructor. Now we can instantiate an object of DQNAgent class.

counter = tf.Variable(0)
agent = DqnAgent(
q_network = q_network,
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=1e-3),
td_errors_loss_fn = common.element_wise_squared_loss,
train_step_counter = counter)
view raw DQNAgent.py hosted with ❤ by GitHub

This object is initialized with information about training environment, object of the QNetwork and optimizer. In the end, we must call initialize method on this object. We implement one more function on top of this – get_average_return.

def get_average_return(environment, policy, episodes=10):
total_return = 0.0
for _ in range(episodes):
time_step = environment.reset()
episode_return = 0.0
while not time_step.is_last():
action_step = policy.action(time_step)
time_step = environment.step(action_step.action)
episode_return += time_step.reward
total_return += episode_return
avg_return = total_return / episodes
return avg_return.numpy()[0]

This method is used for calculations of how much reword has agent gained on average.

Experience Replay

Ok, let’s build the last piece of the Deep Q-Learning ecosystem – Experience Replay. For this purpose, we implement the class with the same name:

class ExperienceReply(object):
def __init__(self, agent, enviroment):
self._replay_buffer = TFUniformReplayBuffer(
self._random_policy = RandomTFPolicy(train_env.time_step_spec(),
self._fill_buffer(train_env, self._random_policy, steps=100)
self.dataset = self._replay_buffer.as_dataset(
self.iterator = iter(self.dataset)
def _fill_buffer(self, enviroment, policy, steps):
for _ in range(steps):
self.timestamp_data(enviroment, policy)
def timestamp_data(self, environment, policy):
time_step = environment.current_time_step()
action_step = policy.action(time_step)
next_time_step = environment.step(action_step.action)
timestamp_trajectory = trajectory.from_transition(time_step, action_step, next_time_step)
experience_replay = ExpirienceReply(agent, train_env)

In the constructor of this class, we initialize replay buffer, which is an object of the class TFUniformReplayBuffer. If your agent is not getting good results, you can play with batch size and length of the buffer. Also, we created and instance of RandomTFPolicy. This one is used to fill buffer with initial values, which is done by calling internal function _fill_buffer. This method in turn calls timestamp_data method for each state of the environment. Method timestamp_data then forms trajectory from the current state and the action defined by policy. This trajectory is stored in the the buffer. Final step of the constructor is to create an iterable tf.data.Dataset pipeline which feeds data to the agent.

Training and Evaluation

Once we have all this prepared, implementing training process is straight forward:

avg_return = get_average_return(evaluation_env, agent.policy, EVAL_EPISODES)
returns = [avg_return]
for _ in range(NUMBER_ITERATION):
for _ in range(COLLECTION_STEPS):
experience_replay.timestamp_data(train_env, agent.collect_policy)
experience, info = next(experience_replay.iterator)
train_loss = agent.train(experience).loss
if agent.train_step_counter.numpy() % EVAL_INTERVAL == 0:
avg_return = get_average_return(evaluation_env, agent.policy, EVAL_EPISODES)
print('Iteration {0} – Average Return = {1}, Loss = {2}.'.format(agent.train_step_counter.numpy(), avg_return, train_loss))
view raw training.py hosted with ❤ by GitHub

First, we initialize counter on the agent to 0 and get initial average return of reward. Then training process starts for defined number of iterations. During this process we first collect data from the environment and then use that data to train the agent’s both neural networks. We also periodically print out average reward return and loss on evaluation environment. Here is how that looks like:

Iteration 1000 – Average Return = 0.8999999761581421, Loss = 6.6086745262146.
Iteration 2000 – Average Return = 2.0999999046325684, Loss = 45.229454040527344.
Iteration 3000 – Average Return = 4.800000190734863, Loss = 15.170860290527344.
Iteration 4000 – Average Return = 14.399999618530273, Loss = 82.94471740722656.
Iteration 5000 – Average Return = 20.0, Loss = 25.855594635009766.
Iteration 6000 – Average Return = 20.0, Loss = 14.144152641296387.
Iteration 7000 – Average Return = 20.0, Loss = 329.9082336425781.
Iteration 8000 – Average Return = 20.0, Loss = 27.24919319152832.
Iteration 9000 – Average Return = 20.0, Loss = 18.222564697265625.
Iteration 10000 – Average Return = 20.0, Loss = 81.72532653808594.
Iteration 11000 – Average Return = 20.0, Loss = 508.6898498535156.
Iteration 12000 – Average Return = 20.0, Loss = 84.64437866210938.
Iteration 13000 – Average Return = 20.0, Loss = 58.284549713134766. Iteration 14000 – Average Return = 20.0, Loss = 1083.6463623046875. Iteration 15000 – Average Return = 20.0, Loss = 1778.074951171875.
Iteration 16000 – Average Return = 20.0, Loss = 56.717002868652344. Iteration 17000 – Average Return = 20.0, Loss = 62.18504333496094.
Iteration 18000 – Average Return = 20.0, Loss = 1133.4283447265625. Iteration 19000 – Average Return = 20.0, Loss = 2548.564697265625.
Iteration 20000 – Average Return = 20.0, Loss = 111.26058959960938.

Or if we plot it out:

In the end we can get states from the evaluation environment using render method and make gif or video out of it:


In this article, we expored wast world of Q-Learning. We had a chance to see where it all comes from and how it can be implemented using different tehniques. For us interesting approach to this algorythm is Deep Q-Learning, which utilizes neural networks. Finally we saw how this approach can be implemented using TF-Agents library.

Thank you for reading!

Read more posts from the author at Rubik’s Code.