Edward observed his cats as they tried to escape from home-made puzzle boxes. Puzzles were simple, all cats had to do was pull some string or push a poll and they were out. When first encountered with a puzzle cats took a long time to solve it. However, when faced with the same or similar problem, cats were able to solve it and escape much faster. He came up with the term law of effect, which states:

Responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation.

O yeah, did I forgot to mention it is 1898. and we are talking about psychologist Edward L. Thorndike? Somewhere around this time Ivan Pavlov made experiments with his dog and came up with his Nobel price theory of classical conditioning. He noticed that his dog was drooling when it sees the person which feeds it even though there is no food.

Later, in the 20th century, B.F. Skinner took both of these approaches and invented the operant conditioning chamber, or “Skinner Box“. Unlike Edward Thorndike’s puzzles, this box gave subjects (in this case mice), only one or two simple repeatable options. Using data from these experiments he and his collages defined operant conditioning as a learning process in which the strength of a behavior is modified by reinforcement or punishment.

Why are we talking about all this? What does this mean to us, except that we need to have pets if we want to become a famous psychologist? What does this all have to do with artificial intelligence? Well, these topics explore a type of learning in which some subject is interacting with the environment. This is the way we as humans learn as well. When we were babies, we experimented. We performed some actions and got a response from the environment. If the response is positive (reward) we repeated those actions, otherwise (punishment) we stopped doing them. In this article, we will explore reinforcement learning, type of learning which is inspired by this goal-directed learning from interaction.

Reinforcement Learning

We wrote about many types of machine learning on this site, mainly focusing on supervised learning and unsupervised learning. Unlike these types of learning, reinforcement learning has a different scope. In a nutshell, it tries to solve a different kind of problem. This type of learning observes an agent which is performing certain actions in an environment and models its behavior based on the rewards which it gets from those actions. It differs from both of aforementioned types of learning.

In supervised learning, an agent learns how to map certain inputs to some output. The agent learns how to do that because during the learning process it is provided with training inputs and labeled expected outputs for those inputs. Using this approach we are able to solve many types of problems, mostly the ones which are classification and regression problems in nature. This is an important type of learning and it is mostly used commercial approach today.

Supervised Learning

Another type of learning is unsupervised learning. In this type of learning, the agent is provided only with input data, and it needs to make some sort of sense out of it. The agent is basically trying to find patterns in otherwise unstructured data. This type of problem is usually used for classification or clusterization types of problems. One might be tempted to think that reinforcement learning is the same as unsupervised learning because they don’t have expected results provided to them during the learning process, however, they are conceptually different. In reinforcement learning the agent is trying to maximize the reward it gets and not to find hidden patterns.

As you can see none of these types of learning is solving a problem of interaction with the environment like reinforcement learning does. That is why this type of learning is considered the third paradigm of machine learning. Many papers actually consider that this type of learning is the future of AI. This might be true if you consider that this is a tool of natural selection.

Reinforcement Learning

Ok, we spent a lot of time talking about what reinforcement learning isn’t, lets now see what it is. As mentioned previously, this type of learning implies an interaction between the active agent and its environment. The agent tries to achieve a defined goal within that environment. Every action of the agent sets the environment in a different state and affects future options and opportunists for the agent. Since the effects of action cannot be predicted, because of the uncertainty of the environment, the agent must monitor it. You can already identify some of the main elements of reinforcement learning. However, there are a couple of hidden ones.


So far, we talked a lot about the agent and the environment. Apart from that, we mentioned that the agent is performing actions that change the state of the environment. There are a few additional elements of reinforcement learning elements that are explaining this process in more details. They are:

  • Reward
  • Policy
  • Value Function

Let’s dive into more details of each them.


This element defines the goal of the agent. In essence, we split the whole process into time steps and we consider that in every time step the agent performs some action. After every action, the environment changes its state and it gives a reward to the agent in the form of a number. This number depends on the agent’s action in the time step t, but on the state of the environment in time frame t as well. This way agent can influence reward in two ways. It can get better reward directly through its actions, or indirectly through changing the state of the environment. In a nutshell, rewards define is the action good or bad and the agent is trying to maximize the reward over time.


The policy is the core element of reinforcement learning. It defines the action that the agent is going to perform in a certain environment state. Essentially, it maps states of the environment to the actions of the agent. This element got it’s inspiration from psychology as well, it corresponds to a “set of stimulus-response rules”. On the first look, it may seem that policy is just a simple table or function. However, things can be much more complicated than that and policies can be stochastic.

Value Function

We already defined that in each state agent gets a certain reward based on the action and state of the environment in the previous time step. The agent observes reward as immediate desirability to be in a certain state. Meaning, if we make an analogy with humans, the reward is the short-term goal. Unlike reward, value function defines a long-term goal for the agent. The value of the state represents the amount of reward agent can accumulate in the future, starting from that state. The value represents the prediction of rewards. In essence, the agent observes it as long-term desirability to be in a certain state.

From the agent’s point of view, rewards are the primary objective, while values are secondary. This is due to the stochastic nature of the environment. However, decisions on which action will be done next is always done based on the values. Meaning, the agent is always trying to get into the state with the highest value, since this means this will get more reward on the long-run. Just like we humans do.

Markov Decision Processes

Now, when we know the elements we have a better picture of reinforcement learning. The agent interacts with the environment in discrete time-steps by applying action in every step. Based on that action environment will change its state and give some sort of the reward in numerical form. The agent will use value function and try to come up with the policy that will maximize the reward. Reinforcement learning is actually trying to solve this problem.

In order to represent this mathematically, we use a framework called Markov Decision Processes (MDPs). Almost all reinforcement learning problems can be formalized using this framework. Generally speaking, MDPs are used for modeling decision making in which result of the decision is partly random and partly in the control of decision maker. This is perfect for reinforcement learning. In this article, we are observing finite MDPs. Meaning, number of states in which environment can be and the number of actions that an agent can make is finite.

Interaction in reinforcement learning

Markov Decision Process is tuple of four elements (S, A, Pa, Ra):

  • S – Represents the set of states. At each time step t, the agent gets the environment’s state – St, where St ∈ S.
  • A – Represents the set of actions that the agent can foretake. At each time step t, based on the received state St, the agent makes the decision to perform an action – At, where At ∈ A(St). A(St) represents a set of possible actions in the state St.
  • Pa – Represents the probability that action in some state s will result in the time step t, will result in the state s’ in the time step t+1.
  • Ra – Or more precisely Ra(s, s’), represents expected reward received after going from state s to the state s’, as a result of action a.

As we mentioned, the problem we try to formalize with MDP is the problem in which we seek for the best policy for the agent. The policy is defined as the function π, which maps each state s ∈ S, and action a ∈ A(s), to the probability π(a|s) of taking action a when in state s.

Using the policy function π, we can define the value of a state sVπ(s), as the expected reward when starting in the state s and following policy π. More formally it can be written like this:

This function Vπ(s) is called state-value function for policy π. Similar to this, we can define the value of taking action a in state s under policy π, as an expected reward, that agent will get if it starts from state s, takes the action a and follows policy π – qπ(s, a) :


In this article, we explored the basic concepts of enforcement learning. We defined most importation elements and defined the problem of reinforcement learning mathematically using Markov Decision Processes. Of course, since this is a huge topic, we didn’t cover everything regarding this type of learning and problems it faces, so we encourage you to explore topics in more details. Following this path, in the next article, we will explore one solution for the problems presented here – Q-Learning.

Thank you for reading!

Read more posts from the author at Rubik’s Code.