**Reinforcement learning** is field that keeps **growing **and not only because of the breakthroughs in **deep learning**. Sure if we talk about deep reinforcement learning, it uses **neural networks** underneath, but there is **more **to it than that. In our journey through the world of reinforcement learning we focused on one of the most **popular **reinforcement learning algorithms out there **Q-Learning**. This approach is considered one of the biggest breakthroughs in Temporal Difference control. In this article, we are going to explore one variation and **improvement **of this algorithm – *Double Q-Learning*.

**Eager to learn how to build Deep Learning systems using Tensorflow 2 and Python? Get the ebook**

**here**!

In general, reinforcement learning is a mechanism to solve problems that can be presented with **Markov Decision Processes (MDPs)**. This type of learning relies on interaction of the learning agent with some kind of environment. This agent that is trying to achieve some kind of **goal **within that environment, and environment has certain states. In it’s pursue, agent performs numerous **actions**. Every action changes the state of the environment and results in **feedback **from it. Feedback comes in a form of reward or punishment. Based on this agent learns which actions are suitable and which are not, and build so called policy.

Q-Learning is one of the most **well known** algorithms in the world of reinforcement learning.

## Q-Learning

This algorithm estimates the *Q-Value*, i.e. the value of taking **action ***a* in **state ***s* under **policy **π*.* This can be considered as *quality* of action *a* in state *s*. During the training process agent updates *Q-Values* for each state-action combination, meaning it forms a **table**, where for each action and the state we store *Q-Value*. The process of updating these values during the training process is described by the **formula**:

As we mentioned previously, this *Q-Value* for a particular state-action pair can be observed as the **quality** of that particular action in that particular state. Higher *Q-Value* indicates **greater **reward for the learning agent. This is the mechanism that the learning agent uses to find out how to achieve defined goal. The **policy **in this case determines which state–action pairs are visited and updated.

The important part of the formula above is *maxQ(St+1, a)*. Note the *t+1* annotation. This means that Q-value of the **current time step** is based on the Q-value of the **future time step**. Spooky, I know. This means that we initialize *Q-Values* for states *St* and *St+1* to some random values at first. In the first training iteration we update *Q-Value* in the state *St* based on reward and on those random value of *Q-Value* in the state *St+1*. Since the whole system is driven by the **reward **and not by the *Q-Value* itself system **converge **to the best result.

To get it even more clear we can brake down *Q-Learning* into the steps. It would look something like this:

**Initialize**all*Q-Values*in the*Q-Table*arbitrary, and the Q value of terminal-state to 0:*Q(s, a) = n, ∀s ∈ S*,*∀a ∈ A(s)**Q(terminal-state, ·) = 0***Pick**the action*a*, from the set of actions defined for that state*A(s)*defined by the policy π.**Perform**action*a***Observe**reward*R*and the next state*s’*- For all possible actions from the state
*s’*select the one with the**highest***Q-Value*–*a’*. **Update**value for the state using the formula:*Q(s, a) ← Q(s, a) + α [R + γQ(s’, a’) − Q(s, a)]***Repeat**steps 2-5 for each time step until the terminal state is reached**Repeat**steps 2-6 for each episode

In one of the **previous articles**, you can find the simple **implementation **of this algorithm.

## Q-Learning Problem

However, this **important **part of the formula *maxQ(St+1, a)* is at the same time the biggest **problem **of *Q-Learning*. In fact, this is the reason why this algorithm performs **poorly **in some stochastic environments. Because of *max* operator *Q-Learning* can **overestimate ***Q-Values* for certain actions. It can be **tricked **that some actions are worth perusing, even if those actions result in the **lower **reward in the end. Let’s consider this scenario:

**Environment**has 4 states – X, Y, Z, W.- States Z and W are
**terminal**states. - X is
**starting**state and there are two actions that agent can undertake in this state:- Up – Reward for this action is 0 and next state is Y.
- Down – Reward for this action is 0 and next state is terminal state Z.

- From Y state agent can take
**multiple**actions all taking them to the terminal state W. The reward of these actions is random value which follows a normal distribution with**mean**-1 and a**variance**2 –*N(-1, 2)*. This means that after a large number of iterations reward will be negative.

This **MDP **is presented in the image below:

Because after many iterations **cumulative **reward for the agent will be negative, if it take action *Up* from the state *X*, this action should never be taken. However, since we use *max* operator to **update **the *Q-Value* ,which *could* be **positive **for this action, learning agent takes this action as valid option. Meaning, estimator is taking maximum possible value instead of mean value and in this particular situation that is **wrong**.

## Double Q-Learning

The **solution **for this problem was proposed by Hado van Hasselt in his 2010 **paper**. What he proposes is that instead using one set of data and one estimator, to use **two estimators**. This effectively means that instead of using one *Q-Value* for each state-action pair, we should use **two **values – *QA *and *QB*. Technically, this approach focuses on finding action *a** that **maximizes ***QA* in the state next state *s’ – (Q(s’, a*) = max Q(s’, a))*. Then it uses this action to get the value of **second ***Q-Value* – *QB(s’, a*)*. Finally it uses *QB(s’, a*)* in order to update *QA(s, a)*:

This is done the other way around too for *QB*.

Let’s brake down *Double Q-Learning* process into the **steps**. It would look something like this:

**Initialize**all*QA*,*QB*and starting state –*s***Repeat****Pick**the action*a*and based on*QA(s,*•*)*and*QB(s,*•*)*get*r*and*s’**Update(A)*or*Update(B)*(**pick**at random)**If***Update(A)***Pick**the action*a* = argmax QA(s’, a)***Update***QA*

*QA(s, a) ← QA(s, a) + α [R + γQB(s’, a*) − QA(s, a)]*

**If***Update(B)***Pick**the action*b* = argmax QB(s’, a)***Update***Q*B*QB(s, a) ← QB(s, a) + α [R + γQA(s’, b*) − QB(s, a)]*

*s ← s’*

**Until End**

So, why this works? Well, in the paper it is mathematically **proven **that expected value of *QB* for the action *a** is **smaller or equal **to the maximum value of *QA(s’, a*)*, i.e. *E(QB(s’, a*)) ≤ Max QA(s’, a*)*. This means that if we perform large number of iterations the expected value of *QB(s’, a*)* is going to be **smaller **than maximal value of *QA(s’, a*)*. In turn, *QA(s, a)* is never updated with a **maximum **value and thus never **overestimated**.

## Conclusion

In this article, we had a chance to see the **intuition **behind *Double Q-Learning*. In the next article we will implement this algorithm using *Python* and after that we will involve some **neural networks** for extra fun.

Thank you for reading!

Read more posts from the author at **Rubik’s Code**.

## Trackbacks/Pingbacks