A couple of days ago a friend of mine, who started exploring deep learning, asked me “Hey man, can you explain this Cross-entropy thing to me?”. Now, that is a tough question, because this topic is never set with me right. My self-doubt kicked in, so in my mind, this question actually sounded more like “Can you explain Cross-entropy to yourself?”. Even in my conference talks, I usually avoid mentioning it. Every time I start talking about it, I get all confused and convoluted. You know what Einstein said “If you can’t explain it simply, you don’t understand it well enough.“, so my anxiety kicked in as well. However, after a couple of iterations, I was able to explain the concept to my friend and even to grasp it better myself. I really was happy that I managed to pull it off. Jokingly she said “You should write a blog post about it.”, which I thought is actually a good idea. So, in this article, you will learn what is Cross-entropy and how do we use it in machine learning/deep learning.

In general, the usual goal of machine learning and deep learning models is to solve classification and regression problems. When we are talking about classification, during the training process model learns how to map inputs to probabilistic predictions. As you probably already know, during the training process in supervised learning, the model is incrementally adjusting its parameters so that predictions get closer to the expected values, ie. closer to the ground truth. For example, let’s consider a dataset that contains 3 classes of images: snake, plane and Samuel L. Jackson.

Each image is labeled using the one-hot encoding, meaning classes are mutually exclusive.

Snake[1 0 0]
Plane[0 1 0]
Samuel L. Jackson[0 0 1]

Now, while we are training a model, we will give images as inputs and as output, we will get an array of probabilities. In this particular example, if we put an image of a plane into our model, we will get output with three numbers each representing the probability of a single class, i.e. y’= [ 0.4, 0.5, 0.1 ]. This differs from the expected value y = [ 0 1 0 ]. To get better, the model changes parameters to get from y’ to y. However, this leaves us with several questions, like “What does getting better actually means?”, “What is the measure or quantity that tells me how far y’ is from y?” and “How much should I tweak parameters in my model?”. Cross-entropy is one possible solution, one possible tool for this. It is telling us how badly our model is doing, meaning it tells us in which “direction” we should tweak the parameters of the model.


Back in 1948. mathematician and electrical engineer Claude Shannon was trying to figure out ways to send messages without losing any information. He was thinking in terms of an average message length, meaning that he tried to encode a message using the smallest number of bits. Apart from that, he assumed that decoder should be able to restore that message losslessly, meaning there should be no loss of information at all. That is how he invented the concept of entropy in his paper “A Mathematical Theory of Communication

Entropy is defined as that minimum average encoding size per transmission using which source can efficiently send messages to destination without losing any information in the process. Mathematically, we can use probability distribution to define entropy (denoted as H). If we are talking about categorical variables, that formula looks something like this:

When we consider quantitative variables, we use integral form:

x is a quantitative variable, and P(x) is the probability density function.


Hope you are assuming where we are going with this. In our example from the beginning of the article, as an output we get probabilities of which class of image we got on the input, e.g. we get the probability distribution. This can be observed as our encoding tool. Basically, we use probability distribution as a means to encode input. Our optimal tool would be entropy, in this case, distribution y. However, we have distribution y’. This means that Cross-entropy can be defined as the number of bits we need to encode information from y using the wrong encoding tool y’. Mathematically, this can be written like this:

The other way to write this expression is using expectation:

H(y, y’) represents expectation using y and the encoding size using y’. From this, we can conclude that H(y, y’) and H(y’, y) are not the same except when y = y’, e.g. this calculation becomes the entropy itself. Now, entropy is the theoretical minimum average size and the cross-entropy is higher than or equal to the entropy but not less than that.

To sum it up, entropy is the optimal distribution that we want to get on our output. However, we get some other distribution – Cross-entropy, which is always larger that entropy. Now, all we need to do is get the difference between them so we can improve our model. Here we need to introduce one more term Kullback–Leibler divergence.

KL Divergence

This term was introduced by Solomon Kullback and Richard Leibler back in 1951 as the directed divergence between two distributions. Kullback preferred the term discrimination information. This topic is heavily discussed in Kullback’s 1959 book – Information Theory and Statistics. Essentially, KL divergence is a difference between Cross-entropy and entropy. It can be written down like this:

We can say that it measures the number of extra bits we’ll need on average if we encode output with y’ instead of with y. This value is never negative and by optimizing Cross-entropy we are trying to get as close to 0 as possible. This means that by minimizing Cross-entropy we are minimizing KL divergence.

To sum it up on our example from the beginning. During our training process, we put the image of Samuel L. Jackson in our output. However, we don’t get the correct label for it, but some probabilities for each class of the image. For example, instead of y = [ 0 0 1 ] we get something like this y’ = [ 0.1 0.2 0.7]. Because we don’t want to get This means that instead of perfect encoding (entropy) y, we got imperfect encoding (cross-entropy) y’. Using these values we calculate KL divergence and we aim to minimize this value. That is how we know how to modify the parameters of our model.

Binary Cross-Entropy

What we covered so far was something called categorical cross-entropy, since we considered an example with multiple classes. However, we are sure you have heard term binary cross-entropy. When we are talking about binary cross-entropy, we are really talking about categorical cross-entropy with two classes. This means that our two distributions are mutually exclusive, e.g. distribution y can be written down as:

This in turn means that we can write down our cross-entropy as:

A formula that you could probably see during your collage years.


In this article, we covered a somewhat convoluted topic of cross-entropy. We explored the nature of entropy, how we extended that concept into cross-entropy and what KL divergence is. Apart from that, we were able to witness that binary cross-entropy is very similar to regular cross-entropy. In general, we saw why we use this concept for calculating loss and how we can use it as a tool for making our models better.

Read more posts from the author at Rubik’s Code.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.