Going deep into Entropy, Cross-Entropy, and KL Divergence

In Machine Learning we hear a lot about Entropy, Cross-Entropy, and KL Divergence.

In Machine Learning literature, we can see these terms getting frequently used in the below constructs:

Entropy in Decision Tree and Random Forest.

Cross Entropy in Cost Functions in Classification Problems.

KL Divergence in Knowledge Distillation.

and many other places.

Now, if you start thinking about it you will recollect that:

Entropy was taught in Information Theory during our undergrad days.

and some other places.

Now, think about it how and why Machine Learning uses this concept from Information Theory.

That’s me! Thinking deeply on why Machine Learning uses the concept of Entropy from Information Theory.

Today, we will go deep into these topics.

It all started when Claude Shannon wrote the seminal paper called “A Mathematical Theory of Communication” in 1948.

With this paper, he put forward a solid foundation for Information Theory.

The Goal was to reliably send a message from sender to recipient.

In the digital age, messages are composed of bits (1 or 0). When we send information, not all bits are useful. Some bits are redundant, some are erroneous, etc. Shannon proposed that every bit of transferred information should reduce the recipient’s uncertainty by 2.

Let us take an example:

Suppose you have appeared for a Machine Learning Exam and are wondering if you passed or failed the exam. You called the teacher to enquire about the exam and the teacher informed you that you have passed. Now, it does not matter how this information was conveyed(“You have Passed”, “Pass”, “P”, 1, etc.), it reduced the uncertainty by a factor of 2(considering both cases were equally likely). We can say that the teacher sent you 1 Bit of information because it reduced the uncertainty by 2.

Now, consider a case where you would like to also know the actual grades which have 8 possible values(3,4,5,6,7,8,9,10).

Since you are studying Entropy and Cross-Entropy, I assume you will get 3 or higher grades :)

Now, If your teacher informs you about your grades, he reduces your uncertainty by a factor 0f 8 i.e. 2³, This can be considered as 3 bits of useful information. The number of bits communicated can be calculated by taking the binary logarithm of the uncertainty reduced e.g. log2(8)=3.

Consider a different case where you have prepared well for the exam and now the probability of passing is 75% and the probability of failing is 25%. Given the above conditions, if your teacher informs you that you have failed, any guesses on how many bits of information were conveyed?

Answer: 2 bits(log2(4)), because he reduced your uncertainty by 4 (1/0.25).

Note: Uncertainty reduction is the inverse of the event’s probability.

The number of bits transferred can also be written as -log2(P(x)).

If the teacher informs you that you have passed, the number of bits transferred will be -log2(0.75)=0.41

Given the conditions, on average your teacher is transferring :

0.75*0.41 + 0.25*2=0.81 bits

The above entity is nothing but Entropy.

Formally, Entropy can be defined as the average amount of information that you get from any random sample of a probability distribution. It’s a measure of how unpredictable the distribution is. To explain this fact, let’s say Geoffrey Hinton appears for that exam. The probability of him passing is always 1, So the entropy, in this case, will be close to 0 (I guess, you get the idea where I am going with this :))

Let’s talk about Cross-Entropy now

It is just the average message length.

Let’s say you have 8 possible grades and your teacher uses the below code to convey this information:

Probability Distribution of Grades

In this case, the cross entropy is 3 bits.

Suppose, you have put in a lot of effort and your chances of getting different grades are as follows:

Probability Distribution of Grades

In this scenario, the entropy is 0.35*log2(0.35) +0.35*log2(0.35)+………. =2.23 bits

Now, the entropy of this distribution is only 2.23 bits but your teacher is using 3 bits to convey this Information. Now can we do better (using less than 3 bits)?

We can do better by using a different code for each message(preferably fewer bits for events having more probability). Let’s say we build a system like this:

Probability Distribution of Grades

Let’s calculate the cross entropy:

0.35*2+0.35*2+0.1*3+0.1*3+0.04*4+0.04*4+0.01*5+0.01*5

The cross-entropy comes to 2.42 bits. This is good but still not better than 2.23 bits.

Exercise for the reader: Why have we chosen code in this arbitrary order (why no 1000 or 1111 after all they have less number of bits than 11101)? Put the answer in the comments.

Now, if we reverse the order of probability, you will see that the total number of bits will drastically increase to 4.58 bits.

We are almost sending twice the number bits as is necessary. This is happening because the code we are using makes some implicit assumption about the distribution (.35 probability of getting grade 3 and so on..)

Let’s assume that the one with 2.42 bits was our actual distribution(p) and the one with 4.58 bits is our predicted distribution(q)

Formally, Cross-Entropy can be defined as:

p is the true distribution and q is the predicted distribution

As observed, it is the same as Entropy, the only difference is we take the log of the predicted distribution instead of the true distribution.

If our predictions are perfect, the Cross-Entropy will simply be equal to the Entropy.

But, if the distributions differ then the Cross-Entropy will be greater than the Entropy by some bits. This amount is called KL Divergence.

Cross-Entropy=Entropy + KL Divergence

Now, Coming back to the questions I asked, why use Cross-Entropy Loss in classification problems?

For any given input there is a True Distribution. Let’s take the example of the Irish Dataset.

https://archive.ics.uci.edu/ml/datasets/iris

We can use the Cross-Entropy between these two distributions as a cost function. This is also called log loss.

In this case, the Cross-Entropy Loss will be -log(0.8) =0.09. Instead, if the probability of Virginica was 25%, the loss would increase to log(0.25)=1.386.

As you can see the loss increased as the predicted probability distribution moved away from the actual probability distribution which is the whole point of choosing this Cross-Entropy as a cost function.

References:

https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron.

--

--

Principal R&D Engineer| NLP and Computer Vision Specialist

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store