A couple of days ago a friend of mine, who started exploring deep learning, asked me “Hey man, can you explain this * Cross-entropy* thing to me?”. Now, that is a tough question, because this topic is never set with me right. My self-doubt kicked in, so in my mind, this question actually sounded more like “Can you explain

*Cross-entropy*to yourself?”. Even in my conference talks, I usually avoid mentioning it. Every time I start talking about it, I get all confused and convoluted. You know what Einstein said “

*If you can’t*

*explain**it simply, you don’t understand it well enough.*“, so my anxiety kicked in as well. However, after a couple of

**iterations**, I was able to explain the concept to my friend and even to grasp it better myself. I really was happy that I managed to pull it off. Jokingly she said “You should write a

**blog post**about it.”, which I thought is actually a good idea. So, in this article, you will learn what is Cross-entropy and how do we use it in machine learning/deep learning.

In general, the usual goal of machine learning and deep learning **models** is to solve classification and regression problems. When we are talking about classification, during the training process model learns how to map inputs to probabilistic **predictions**. As you probably already know, during the training process in supervised learning, the model is **incrementally adjusting** its parameters so that predictions get closer to the expected values, ie. closer to the ground truth. For example, let’s consider a dataset that contains 3 classes of images: snake, plane and Samuel L. Jackson.

Each image is labeled using the **one-hot encoding**, meaning classes are mutually exclusive.

Class | Label |

Snake | [1 0 0] |

Plane | [0 1 0] |

Samuel L. Jackson | [0 0 1] |

Now, while we are training a model, we will give images as inputs and as output, we will get an array of **probabilities**. In this particular example, if we put an image of a plane into our model, we will get output with three numbers each representing the probability of a **single class**, i.e. *y’= [ 0.4, 0.5, 0.1 ]*. This differs from the expected value *y = [ 0 1 0 ]*. To get better, the model changes parameters to get from *y’* to *y*. However, this leaves us with several **questions**, like “What does getting **better** actually means?”, “What is the **measure** or quantity that tells me how far *y’* is from *y*?” and “How **much** should I tweak parameters in my model?”. *Cross-entropy *is one possible solution, one possible **tool** for this. It is telling us how **badly** our model is doing, meaning it tells us in which “direction” we should tweak the parameters of the model.

## Entropy

Back in 1948. mathematician and electrical engineer Claude Shannon was trying to figure out ways to send messages **without losing** any information. He was thinking in terms of an average **message length**, meaning that he tried to encode a message using the **smallest** number of bits. Apart from that, he assumed that decoder should be able to restore that message **losslessly**, meaning there should be **no loss** of information at all. That is how he invented the concept of **entropy** in his paper “*A Mathematical Theory of Communication *“

**Entropy** is defined as that minimum average encoding size per transmission using which source can efficiently send messages to destination without losing any information in the process. Mathematically, we can use probability distribution to define entropy (denoted as *H*). If we are talking about categorical variables, that **formula** looks something like this:

When we consider quantitative variables, we use **integral** form:

*x* is a quantitative variable, and *P(x)* is the probability density function.

## Cross-entropy

Hope you are assuming where we are going with this. In our example from the beginning of the article, as an output we get probabilities of which class of image we got on the input, e.g. we get the** probability distribution**. This can be observed as our **encoding tool**. Basically, we use probability distribution as a means to encode input. Our optimal tool would be entropy, in this case, distribution *y*. However, we have distribution* y’*. This means that *Cross-entropy* can be **defined** as the number of bits we need to encode information from *y* using the wrong encoding tool *y’*. Mathematically, this can be written like this:

The other way to write this expression is using expectation:

*H(y, y’)* represents **expectation** using *y* and the encoding size using *y’*. From this, we can conclude that *H(y, y’)* and *H(y’, y) *are **not the same** except when *y = y’, *e.g. this calculation becomes the entropy itself. Now, entropy is the theoretical minimum average size and the cross-entropy is higher than or equal to the entropy but not less than that.

To sum it up, entropy is the optimal distribution that we want to get on our output. However, we get some other distribution – *Cross-entropy*, which is always larger that entropy. Now, all we need to do is get the **difference** between them so we can improve our model. Here we need to introduce one more term **Kullback–Leibler divergence**.

## KL Divergence

This term was introduced by Solomon Kullback and Richard Leibler back in 1951 as the **directed divergence** between two distributions. Kullback preferred the term **discrimination information**. This topic is heavily discussed in Kullback’s 1959 book – *Information Theory and Statistics*. Essentially, KL divergence is a difference between *Cross-entropy* and entropy. It can be written down like this:

We can say that it measures the number of **extra** bits we’ll need on average if we encode output with *y’* instead of with *y*. This value is never negative and by optimizing *Cross-entropy* we are trying to get as close to 0 as possible. This means that by **minimizing** *Cross-entropy* we are **minimizing** KL divergence.

To sum it up on our example from the beginning. During our training process, we put the image of Samuel L. Jackson in our output. However, we don’t get the correct label for it, but some probabilities for each class of the image. For example, instead of *y = [ 0 0 1 ]* we get something like this *y’ = [ 0.1 0.2 0.7]*. Because we don’t want to get This means that instead of perfect encoding (entropy) *y*, we got imperfect encoding (cross-entropy) *y’*. Using these values we calculate KL divergence and we aim to minimize this value. That is how we know how to modify the parameters of our model.

## Binary Cross-Entropy

What we covered so far was something called **categorical cross-entropy**, since we considered an example with multiple classes. However, we are sure you have heard term **binary cross-entropy**. When we are talking about binary cross-entropy, we are really talking about categorical cross-entropy with **two classes**. This means that our two distributions are mutually **exclusive**, e.g. distribution y can be written down as:

This in turn means that we can write down our cross-entropy as:

A formula that you could probably see during your collage years.

## Conclusion

In this article, we covered a somewhat convoluted topic of **cross-entropy**. We explored the nature of **entropy**, how we extended that concept into cross-entropy and what** KL divergence** is. Apart from that, we were able to witness that binary cross-entropy is very similar to regular cross-entropy. In general, we saw why we use this concept for **calculating loss** and how we can use it as a tool for making our models **better**.

Read more posts from the author at **Rubik’s Code**.