So far in our journey through the world of reinforcement learning we covered several topics. First we kicked it off with introduction to reinforcement learning and we saw how this paradigm functions. Then we learned the simplest from of it – Q-Learning. Finally, we merged that algorithm with artificial neural networks and created Deep Q-Learning. Now, if there is something that data scientists like to do, is merge concepts and create new beautiful and unexpected models. That is why in this article, we will find out what happens when we give the learning agent ability to “see”, i.e. what happens when we involve convolutional neural networks into Deep Q-Learning framework.
Before we start exploring this topic, let’s remind ourselves what we’ve learned so far. Unlike other types of learning (supervised and unsupervised) reinforcement learning is using the power of interaction. This means that a learning agent is trying to achieve certain goal with some environment. In it’s pursue, it performs numerous actions, which change state of the environment and result in feedback from the environment. Feedback comes in a form of reward or punishment. Based on this agent learns, which actions are suitable in certain states.
We use Markov Decision Processes (MDPs) as a mathematical framework to describe this process. MDPs give us a formal description of the problem in which an agent seeks for the best policy, defined as the function π. Policy is a function that maps states of the environment to the best action that the agent can take in defined state. MDPs are constructed of several concepts and described with the tuple (S, A, Pa, Ra), where:
- S – is the set of states. At each time step t, the agent gets the environment’s state – St, where St ∈ S.
- A – Is set of actions that the agent can do in defined state. The agent makes the decision to perform an action, based on the state St – At, where At ∈ A(St). A(St) represents a set of possible actions in the state St.
- Pa – Represents the probability that action in some state s will result in the time step t, will result in the state s’ in the time step t+1.
- Ra – Or more precisely Ra(s, s’), represents expected reward received after going from state s to the state s’, as a result of action a.
The algorithm we explore in this article is based on one very important concept – the value of taking action a in state s under policy π. It represents expected reward that agent will get if it start from state s, take the action a and follow policy π – qπ(s, a) . As a mathematical formula it can be written down like this:
where γ is the discount factor which determines how much importance we want to give to future rewards. If you check Quora, you will find a theory that this value represents quality of an action, ergo it is denoted with q. One way or another, this brings us to the first reinforcement learning algorithm – Q-Learning.
Q-Learning
Q-Learning is is estimating the q value, i.e. the value of taking action a in state s under policy π. That is where the name comes from. During the training iterations it updates these Q-Values for each state-action combination. It is described by the formula:
As we mentioned previously, this Q-Value for a particular state-action combination can be observed as the quality of an action taken from that state. This is the mechanism that the learning agent uses to know which “way” it should go in order to achieve defined goal. The policy in this case determines which state–action pairs are visited and updated, but nothing more. This is why this algorithm often have prefix off-policy.
The important part of the formula above is maxQ(St+1, a). This means that Q-value of the current step is based on the Q-value of the future step. I know, it is confusing. This means that we initialize Q-Values for St and St+1 to some random values at first. In the first training iteration we update Q-Value in the state St based on reward and on those random value of Q-Value in the state St+1. Since reward is still guiding our system this will eventually converge to the best result.
All these Q-Values are stored for each action-state pair, inside of the Q-Table and updated in every iteration:
To get it even more clear we can brake down Q-Learning into the steps. It would look something like this:
- Initialize all Q-Values in the Q-Table arbitrary, and the Q value of terminal-state to 0:
Q(s, a) = n, ∀s ∈ S, ∀a ∈ A(s)
Q(terminal-state, ·) = 0 - Pick the action a, from the set of actions defined for that state A(s) defined by the policy π.
- Perform action a
- Observe reward R and the next state s’
- For all possible actions from the state s’ select the one with the highest Q-Value – a’.
- Update value for the state using the formula:
Q(s, a) ← Q(s, a) + α [R + γQ(s’, a’) − Q(s, a)] - Repeat steps 2-5 for each time step until the terminal state is reached
- Repeat steps 2-6 for each episode
In one of the previous articles, you can find the simple implementation of this algorithm.
It is important to note, that this type of learning can get stuck in the certain scenarios which might not be the best solution for the problem. For example, algorithm can learn that the best thing going from state s is performing action a’ and going to state s’. However, it never performed action a” and ending up in the state s”, which could be a better option. Basically, we can produce overfitting. Because of this reason, we use additional parameter which defines will the agent explore actions it hadn’t perform thus far (exploration), or it will take the safe already learned route (exploitation). This parameter is denoted with epsilon.
However, scaling is the big problem for the Q-Learning. Environments can be complicated. Imagine how many states and possible there is in a video game and how big Q-Table should be to handle it. This is not good approach for the problem we are trying to solve. That is where artificial neural networks come into play.
Deep Q-Learning
Deep Q-Learning harness the power of deep learning with so-called Deep Q-Networks, or DQN for short. In this scenario, these networks are just standard feed forward neural networks which are utilized for predicting the best Q-Value. In order for this approach to work, the agent has to store previous experiences in a local memory.
One of the the important things to realize is that DQNs don’t use standard supervised learning. This is because agent doesn’t have labels (expected output) provided beforehand. It depends on the policy or value functions in reinforcement learning. In a nutshell, the target is continuously changing with each iteration. Because of this the agent doesn’t have just one neural network, but two of them. The first network, which is refereed to as Q-Network is calculating Q-Value in the state St. The second network, refereed to as Target Network is calculating Q-Value in the state St+1.
Speaking more formally, given the current state St, the Q-Network retrieves the action-values Q(St,a). At the same time the Target-Network uses the next state St+1 to calculate Q(St+1, a) for the Temporal Difference target. In order to stabilize this training of two networks, on each N-th iteration parameters of the Q-Network are copied over to the Target Network. The whole process is presented in the image above.
Experience Replay
We already mentioned that the agent, in order to train neural networks, has to store previous experiences. Deep Q-Learning takes this to the next level and uses one more concept to improve performances – experience replay. This concept is used for one more reason, to stabilize training process. In a nutshell, the agent uses random batches of experiences to train the networks. Experience replay is the memory that stores those experiences in a form of a tuple <s, s’, a, r>:
- s – State of the agent
- a – Action that was taken in the state s by the agent
- r – Immediate reward received in state s for action a
- s’ – Next state of the agent after state s
The network uses random batches of <s, s’, a, r> from the memory to calculate Q-Values and base backpropagation on that. The loss is calculated using the squared difference between target Q-Value and predicted Q-Value:
Note that this is performed only for the training of Q-Network, while parameters are copied over to Target Network as we previously mentioned.
To sum it all up, we can summarize the whole process of Deep Q-Learning into several steps:
- Provide the state of the environment to the agent. The agent uses Target Network and Q-Network to get the Q-Values of all possible actions in the defined state.
- Pick the action a, based on the epsilon value. Meaning, either select a random action (exploration) or select the action with the maximum Q-Value (exploitation).
- Perform action a
- Observe reward r and the next state s’
- Store these information in the experience replay memory <s, s’, a, r>
- Sample random batches from experience replay memory and perform training of the Q-Network.
- Each Nth iteration, copy the weights values from the Q-Network to the Target Network.
- Repeat steps 2-7 for each episode
Convolutional Neural Networks
Ok, Deep Q-Learning is a cool way to solve things but can we go one step further. Can we teach our agent to play a video game? Can we teach it to do it in a way that we do it, by “looking” into the video? That is where special kind of neural networks, Convolutional Neural Networks are used. What we want to do is make the learning agent use frames from the video game as an input, and learn how to play it like that.
In one of the previous article, we had a chance to examine how Convolutional Neural Networks work. We covered layers of these networks and their functionalities. To sum it up, additional layers of Convolutional Neural Networks are used to preprocess the image and put it in a form that standard neural network can work with. The first step in doing so is detecting certain features or attributes on the input image. This is done by convolutional layer.
This layer use filters to detect low-level features, like edges and curves, as well as higher levels features, like a face or a hand. Than Convolutional Neural Network use additional layers to remove linearity from the image, something that could cause overfitting. When linearity is removed, additional layers for compressing the image and flattening the data are used. Finally, this information is passed into a neural network, called Fully-Connected Layer in the world of Convolutional Neural Networks. However, the goal of this article is to show you how to incorporate these concepts into Q-Learning, so more details about these layers, how they work and what is the purpose of each of them can be found here. If you are interested how to implement simple Convolutional Neural Network, check this article here.
Implementation
Technologies
In order to run the code from this article, you have to have Python 3 installed on your local machine. In this example, to be more specific, we are using Python 3.7. The implementation itself is done using TensorFlow 2.0. The complete guide on how to install and use Tensorflow 2.0 can be found here.
Also, you have to install Open AI Gym or to be more specific Atari Gym. You can install it by running:
pip install gym[atari]
If you are using Windows installation is not this straight forward, so you can follow this article in order to do it correctly.
Open AI Gym has its own API and the way it works. However, we will not go in depth of how interaction with the environment from the code is done. We will mention a few important topics here and there. Because of this we strongly suggest you can check out this article if you are not familiar with the concept and the API of Open AI Gym.
There is one more additional module you need to install in order for the code from this article to work and that is the tqdm library. It is not doing anything essential, it just for cosmetic purposes.
Environment
In this article, we use simple environment called – BreakoutDeterministic-v4. It is the classic Atari 2600 game. The observation is actually an RGB image of the screen, which is an array of shape (210, 160, 3). Each action is repeatedly performed for a duration of k frames, where k is uniformly sampled from \{2, 3, 4\}{2,3,4}.
Code
As usual with Python implementations, first we import necessary libraries and modules first:
import numpy as np | |
import random | |
from collections import deque | |
import matplotlib.pyplot as plt | |
from PIL import Image | |
import imageio | |
import os | |
import gym | |
from tqdm import tqdm | |
from tensorflow.keras import Model, Sequential | |
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Input | |
from tensorflow.keras.optimizers import Adam | |
from tensorflow.keras.losses import Huber |
Ok, so we imported some common libraries like numpy for calculations, random for generating random values and deque for experience replay memory. Second section of imports cover modules we imported for image manipulation. Final section covers tensorflow modules. Note that we have imported layers for convolutional calculations Conv2D and Flatten. Apart form that, notice that we imported Huber loss. This is the idea that first came from DeepMind, because:
We also found it helpful to clip the error term from the update […] to be between -1 and 1. Because the absolute value loss function |x| has a derivative of -1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between -1 and 1 corresponds to using an absolute value loss function for errors outside of the (-1,1) interval. This form of error clipping further improved the stability of the algorithm.
Huber loss or Huber function is great for this things. Now let’s load and instantiate the environment:
GAME_NAME = "BreakoutDeterministic-v4" | |
NUMBER_OF_FRAMES = 5 | |
enviroment = gym.make(GAME_NAME).env |
In order to ease up image manipulation, we use class ImageProcessor. This class has several functions:
- plot_frames – Plots passed images.
- resize_and_grayscale – Frames from the environment have dimensions 210x160x3. We want to simplify that, so our convolutional neural network gets better results. We will resize images and make them gray-scale using this function, i.e. we will change it’s dimensions to 84x84x2.
- process_env_state – Does the same as the previous function, but it additionally casts the image into nparray.
- save_frame – Saves current state of the environment into defined location.
- makegif – Generates .gif file from the images at a certain location.
Thanks to ImageProcessor class, we are able to explore the environment. First we can plot several states from the environment to see how it looks like in the original:
enviroment.reset() | |
frames = [] | |
for _ in range(NUMBER_OF_FRAMES): | |
enviroment.step(enviroment.action_space.sample()) | |
frames.append(enviroment.ale.getScreenRGB()) | |
img_processor.plot_frames(frames) |
The output looks like this:
Then we can see how it looks like when this data is prepared for the agent:
enviroment.reset() | |
frames = [] | |
for _ in range(NUMBER_OF_FRAMES): | |
enviroment.step(enviroment.action_space.sample()) | |
frame = img_processor.resize_and_grayscale(enviroment.ale.getScreenRGB()) | |
frames.append(frame) | |
img_processor.plot_frames(frames, gray=True) |
The output of the code from above:
Finally, we implement the agent:
class Agent(object): | |
def __init__(self, enviroment, optimizer, image_shape): | |
# Initialize atributes | |
self._action_size = enviroment.action_space.n | |
self._optimizer = optimizer | |
self._image_shape = image_shape | |
self.enviroment = enviroment | |
self.expirience_replay = deque(maxlen=100000) | |
# Initialize discount and exploration rate | |
self.gamma = 0.6 | |
self.epsilon = 0.1 | |
# Build networks | |
self.q_network = self._build_compile_model() | |
self.target_network = self._build_compile_model() | |
self.alighn_target_model() | |
def store(self, state, action, reward, next_state, terminated): | |
self.expirience_replay.append((state, action, reward, next_state, terminated)) | |
def _update_epsilon(self): | |
self.epsilon -= self.epsilon_decay | |
self.epsilon = max(self.epsilon_min, self.epsilon) | |
def _build_compile_model(self): | |
model = Sequential() | |
model.add(Conv2D(32, 8, strides=(4, 4), padding="valid",activation="relu", | |
input_shape = self._image_shape)) | |
model.add(Conv2D(64, 4, strides=(2, 2), padding="valid", activation="relu", | |
input_shape = self._image_shape)) | |
model.add(Conv2D(64, 3, strides=(1, 1), padding="valid",activation="relu", | |
input_shape = self._image_shape)) | |
model.add(Flatten()) | |
model.add(Dense(512, activation="relu")) | |
model.add(Dense(self._action_size)) | |
huber = Huber() | |
model.compile(loss = huber, | |
optimizer=self._optimizer, | |
metrics=["accuracy"]) | |
return model | |
def alighn_target_model(self): | |
self.target_network.set_weights(self.q_network.get_weights()) | |
def act(self, frame): | |
if np.random.rand() <= self.epsilon: | |
return self.enviroment.action_space.sample() | |
frame = np.expand_dims(np.asarray(frame).astype(np.float64), axis=0) | |
q_values = self.q_network.predict(frame) | |
return np.argmax(q_values[0]) | |
def retrain(self, batch_size): | |
minibatch = random.sample(self.expirience_replay, batch_size) | |
for state, action, reward, next_state, terminated in minibatch: | |
state = np.expand_dims(np.asarray(state).astype(np.float64), axis=0) | |
next_state = np.expand_dims(np.asarray(next_state).astype(np.float64), axis=0) | |
target = self.q_network.predict(state) | |
if terminated: | |
target[0][action] = reward | |
else: | |
t = self.target_network.predict(next_state) | |
target[0][action] = reward + self.gamma * np.amax(t) | |
self.q_network.fit(state, target, epochs=1, verbose=0) |
This implementation is almost the same like the one we have done for Deep Q-Learning. However, that is a lot of code. Let’s split it up and explore some important parts of it. The whole process is initialized inside of the constructor:
def __init__(self, enviroment, optimizer, image_shape): | |
# Initialize atributes | |
self._action_size = enviroment.action_space.n | |
self._optimizer = optimizer | |
self._image_shape = image_shape | |
self.enviroment = enviroment | |
self.expirience_replay = deque(maxlen=100000) | |
# Initialize discount and exploration rate | |
self.gamma = 0.6 | |
self.epsilon = 0.1 | |
# Build networks | |
self.q_network = self._build_compile_model() | |
self.target_network = self._build_compile_model() | |
self.alighn_target_model() |
The only difference from the previous implementation is that this one is having larger memory for experience replay. The rest of the stuff is pretty much the same. First we initialize size of the state and action space based on the environment object that is passed to this agent. We also initialize an optimizer and the experience reply memory. Then we build the Q-Network and the Target Network with the _build_compile_model method and align their weights with the alighn_target_model method. The _build_compile_model method is probably the most interesting one in this whole block of code, because it contains the core of the implementation. This time we create convolutional neural networks:
def _build_compile_model(self): | |
model = Sequential() | |
model.add(Conv2D(32, 8, strides=(4, 4), padding="valid",activation="relu", | |
input_shape = self._image_shape)) | |
model.add(Conv2D(64, 4, strides=(2, 2), padding="valid", activation="relu", | |
input_shape = self._image_shape)) | |
model.add(Conv2D(64, 3, strides=(1, 1), padding="valid",activation="relu", | |
input_shape = self._image_shape)) | |
model.add(Flatten()) | |
model.add(Dense(512, activation="relu")) | |
model.add(Dense(self._action_size)) | |
huber = Huber() | |
model.compile(loss = huber, | |
optimizer=self._optimizer, | |
metrics=["accuracy"]) | |
return model |
Nothing groundbreaking, right? Just several Conv2D layers, followed by Flatten and several Dense layers. The whole expropriation-exploration concept we mentioned in the previous chapter is done inside of the act function. Based on the epsilon value we either invoke Q-Network to make a prediction, or we pick a random action. Like this:
def act(self, frame): | |
if np.random.rand() <= self.epsilon: | |
return self.enviroment.action_space.sample() | |
frame = np.expand_dims(np.asarray(frame).astype(np.float64), axis=0) | |
q_values = self.q_network.predict(frame) | |
return np.argmax(q_values[0]) |
Finally, lets take a look at the retrain method. In this method we pick random samples from the experience replay memory and train the Q-Network:
def retrain(self, batch_size): | |
minibatch = random.sample(self.expirience_replay, batch_size) | |
for state, action, reward, next_state, terminated in minibatch: | |
state = np.expand_dims(np.asarray(state).astype(np.float64), axis=0) | |
next_state = np.expand_dims(np.asarray(next_state).astype(np.float64), axis=0) | |
target = self.q_network.predict(state) | |
if terminated: | |
target[0][action] = reward | |
else: | |
t = self.target_network.predict(next_state) | |
target[0][action] = reward + self.gamma * np.amax(t) | |
self.q_network.fit(state, target, epochs=1, verbose=0) |
Now, almost all preparation are done. Let’s create an object of the agent initialize everything before training:
optimizer = Adam(learning_rate=0.01) | |
state = enviroment.reset() | |
agent = Agent(enviroment, optimizer, state.shape) | |
batch_size = 32 | |
num_of_episodes = 1000 | |
timesteps_per_episode = 1000 | |
agent.q_network.summary() |
From the output of our this sample of the code, we can see structure of networks in the agent:
Now let’s run the training using the steps explained in Deep Q-Learning chapter:
for e in tqdm(range(0, num_of_episodes)): | |
# Reset the enviroment | |
state = enviroment.reset() | |
# Initialize variables | |
reward = 0 | |
terminated = False | |
for timestep in range(timesteps_per_episode): | |
state = img_processor.process_env_state(state) | |
# Run Action | |
action = agent.act(state) | |
# Take action | |
next_state, reward, terminated, info = enviroment.step(action) | |
next_state = img_processor.process_env_state(next_state) | |
agent.store(state, action, reward, next_state, terminated) | |
state = next_state | |
if terminated: | |
agent.alighn_target_model() | |
break | |
if len(agent.expirience_replay) > batch_size: | |
agent.retrain(batch_size) |
Once this is done, we can run the evaluation:
total_epochs, total_penalties = 0, 0 | |
num_of_episodes = 10 | |
enviroment.reset() | |
counter = 0 | |
for e in range(num_of_episodes): | |
state = enviroment.reset() | |
state = img_processor.process_env_state(state) | |
epochs = 0 | |
penalties = 0 | |
reward = 0 | |
total_reward = 0 | |
terminated = False | |
for timesteps in tqdm(range(timesteps_per_episode)): | |
action = agent.act(state) | |
state, reward, terminated, info = enviroment.step(action) | |
state = img_processor.process_env_state(state) | |
if reward == –10: | |
penalties += 1 | |
total_reward += reward | |
epochs += 1 | |
img_processor.save_frame("images/frame_{}.png".format(counter)) | |
counter += 1 | |
total_penalties += penalties | |
total_epochs += epochs | |
img_processor.makegif("images/") | |
print("**********************************") | |
print("Done!") | |
print("**********************************") |
Finally, we got some results. They are pretty cool:
Conclusion
In this article we got a chance to go one step further and merge convolutional neural networks into existing Deep Q-Learning process. This way we based behavior of the learning agent on visual input, not on some arbitrary number. We got a chance to see how this agent behaves in one of the Atari 2600 games as well. There are several improvements that we didn’t mention here, like Dual Networking, but that is a topic for some other time.
Thanks for reading!
Read more posts from the author at Rubik’s Code.
I haven’t yet read the whole post, but I must point out that the Q learning update equation you presented here is false. You should either: 1) remove the (1-alpha) term OR 2) remove the subtraction of Q(St,At) at the far right term, but you can’t have both! I prefer #1 as it illustrates that Q(St,At) is incremented by the error term (as with negative feedback).
Side note: This mistake is carried forward where you show the step-by-step as well. Also, I checked the previous article which you referenced, and the same mistake is there too. However in that one, the V(s) update is done correctly, and it should give a hint on the whole (1-alpha) issue.
Thanks for the post, I appreciate your efforts.
HI Rayan,
Thank you for noticing!
I guess I made a mistake the first time and copied it over several times.
I made changes and I think all is good now.
Thanks once again for reading the article and giving a feedback.
Cheers,
Nikola
How can you be sure the the calculation of the mse use the correct values? why is it necessary to reduce the number of states to 10 fro 500? why not say 100?
Hi Robert,
Thanks for reading the blog.
It seems that this is the question for one of the previous articles (Deep Q-Learning one).
In this one we used Huber, not MSE and we didn’t embed the states.
For the questions which I assume are for mentioned article, I am not sure I understand the first one. Regarding the second one you can use any number of states, however, keep in mind that more states you have it is harder to train the neural network.