Select Page

Reinforcement learning is field that keeps growing and not only because of the breakthroughs in deep learning. Sure if we talk about deep reinforcement learning, it uses neural networks underneath, but there is more to it than that. In our journey through the world of reinforcement learning we focused on one of the most popular reinforcement learning algorithms out there Q-Learning. This approach is considered one of the biggest breakthroughs in Temporal Difference control. In this article, we are going to explore one variation and improvement of this algorithm – Double Q-Learning.

Eager to learn how to build Deep Learning systems using Tensorflow 2 and Python? Get the ebook here!

In general, reinforcement learning is a mechanism to solve problems that can be presented with Markov Decision Processes (MDPs). This type of learning relies on interaction of the learning agent with some kind of environment. This agent that is trying to achieve some kind of  goal within that environment, and environment has certain states. In it’s pursue, agent performs numerous actions. Every action changes the state of the environment and results in feedback from it. Feedback comes in a form of reward or punishment. Based on this agent learns which actions are suitable and which are not, and build so called policy.

Q-Learning is one of the most well known algorithms in the world of reinforcement learning.


This algorithm estimates the Q-Value, i.e. the value of taking action a in state s under policy π. This can be considered as quality of action a in state s. During the training process agent updates Q-Values for each state-action combination, meaning it forms a table, where for each action and the state we store Q-Value. The process of updating these values during the training process is described by the formula:

As we mentioned previously, this Q-Value for a particular state-action pair can be observed as the quality of that particular action in that particular state. Higher Q-Value indicates greater reward for the learning agent. This is the mechanism that the learning agent uses to find out how to achieve defined goal. The policy in this case determines which state–action pairs are visited and updated.

The important part of the formula above is maxQ(St+1, a). Note the t+1 annotation. This means that Q-value of the current time step is based on the Q-value of the future time step. Spooky, I know. This means that we initialize Q-Values for states St and St+1 to some random values at first. In the first training iteration we update Q-Value in the state St based on reward and on those random value of Q-Value in the state St+1. Since the whole system is driven by the reward and not by the Q-Value itself system converge to the best result.

To get it even more clear we can brake down Q-Learning into the steps. It would look something like this:

  1. Initialize all Q-Values in the Q-Table arbitrary, and the Q value of terminal-state to 0:
    Q(s, a) = n, ∀s ∈ S∀a ∈ A(s) 
    Q(terminal-state, ·) = 0
  2. Pick the action a, from the set of actions defined for that state A(s) defined by the policy π.
  3. Perform action a
  4. Observe reward R and the next state s’
  5. For all possible actions from the state s’ select the one with the highest Q-Value – a’.
  6. Update value for the state using the formula: 
    Q(s, a) ← Q(s, a) + α [R + γQ(s’, a’) − Q(s, a)]
  7. Repeat steps 2-5 for each time step until the terminal state is reached
  8. Repeat steps 2-6 for each episode

In one of the previous articles, you can find the simple implementation of this algorithm.

Q-Learning Problem

However, this important part of the formula maxQ(St+1, a) is at the same time the biggest problem of Q-Learning. In fact, this is the reason why this algorithm performs poorly in some stochastic environments. Because of max operator Q-Learning can overestimate Q-Values for certain actions. It can be tricked that some actions are worth perusing, even if those actions result in the lower reward in the end. Let’s consider this scenario:

  • Environment has 4 states – X, Y, Z, W.
  • States Z and W are terminal states.
  • X is starting state and there are two actions that agent can undertake in this state:
    • Up – Reward for this action is 0 and next state is Y.
    • Down – Reward for this action is 0 and next state is terminal state Z.
  • From Y state agent can take multiple actions all taking them to the terminal state W. The reward of these actions is random value which follows a normal distribution with mean -1 and a variance 2 – N(-1, 2). This means that after a large number of iterations reward will be negative.

This MDP is presented in the image below:

Because after many iterations cumulative reward for the agent will be negative, if it take action Up from the state X, this action should never be taken. However, since we use max operator to update the Q-Value ,which could be positive for this action, learning agent takes this action as valid option. Meaning, estimator is taking maximum possible value instead of mean value and in this particular situation that is wrong.

Double Q-Learning

The solution for this problem was proposed by Hado van Hasselt in his 2010 paper. What he proposes is that instead using one set of data and one estimator, to use two estimators. This effectively means that instead of using one Q-Value for each state-action pair, we should use two values – QA and QB. Technically, this approach focuses on finding action a* that maximizes QA in the state next state s’ – (Q(s’, a*) = max Q(s’, a)). Then it uses this action to get the value of second Q-ValueQB(s’, a*). Finally it uses QB(s’, a*) in order to update QA(s, a):

This is done the other way around too for QB.

Let’s brake down Double Q-Learning process into the steps. It would look something like this:

  • Initialize all QA, QB and starting state – s
  • Repeat
    • Pick the action a and based on QA(s, ) and QB(s, ) get r and s’
    • Update(A) or Update(B) (pick at random)
    • If Update(A)
      • Pick the action a* = argmax QA(s’, a)
      • Update QA
        QA(s, a) ← QA(s, a) + α [R + γQB(s’, a*) − QA(s, a)]
    • If Update(B)
      • Pick the action b* = argmax QB(s’, a)
      • Update QB
        QB(s, a) ← QB(s, a) + α [R + γQA(s’, b*) − QB(s, a)]
    • s ← s’
  • Until End

So, why this works? Well, in the paper it is mathematically proven that expected value of QB for the action a* is smaller or equal to the maximum value of QA(s’, a*), i.e. E(QB(s’, a*)) ≤ Max QA(s’, a*). This means that if we perform large number of iterations the expected value of QB(s’, a*) is going to be smaller than maximal value of QA(s’, a*). In turn, QA(s, a) is never updated with a maximum value and thus never overestimated.


In this article, we had a chance to see the intuition behind Double Q-Learning. In the next article we will implement this algorithm using Python and after that we will involve some neural networks for extra fun.

Thank you for reading!

Read more posts from the author at Rubik’s Code.