Entropy and KL divergence

Nick Griffiths · Jul 03, 2020

Shannon Entropy

Shannon entropy is a property of a probability distribution.

For a discrete distribution with only one event P(x) = 1, entropy is zero:

library(entropy)

entropy(1)
## [1] 0

Entropy is proportional to the number of binary variables (bits) we need to capture all (equally likely) discrete outcomes:

entropy(c("1" = 1,"0" = 1)) # one binary variable
## [1] 0.6931472

entropy(c("11" = 1,"00" = 1, "10" = 1, "01" = 1)) # two variables
## [1] 1.386294

entropy(c("111" = 1, "110" = 1, "101" = 1, "100" = 1,
          "011" = 1, "010" = 1, "001" = 1, "000" = 1)) # 3 variables
## [1] 2.079442

If we have two events and weight one of them more, entropy decreases. This reflects the fact that as we increase the weight, we approach the single event distribution with no entropy:

entropy(c(1,1))
## [1] 0.6931472

entropy(c(1,3))
## [1] 0.5623351

entropy(c(1,10))
## [1] 0.3046361

entropy(c(1,10^6))
## [1] 1.48155e-05

Similarly, for continuous distributions, the entropy increases with flatter, wider, higher variance distributions:

entropy(dnorm(-10:10))
## [1] 1.418938

entropy(dnorm(-10:10, sd = 100))
## [1] 3.044521

Kullback–Leibler divergence

KL divergence describes differences in two distributions based on their entropies. The KL divergence of a distribution with itself is 0:

KL.empirical(c(1,1), c(1,1))
## [1] 0

It is the entropy carried by the additional binary variables (bits) needed by the higher-entropy distribution to fully represent the outcomes:

dist1 <- c("11" = 1,"00" = 1, "10" = 1, "01" = 1) # 2 binary variables
dist2 <- c("111" = 1, "110" = 1, "101" = 1, "100" = 1,
          "011" = 1, "010" = 1, "001" = 1, "000" = 1) # 3 binary variables

KL.empirical(dist1, dist2)
## [1] 0.6931472

entropy(c("1" = 1, "0" = 1)) # entropy of one binary variable
## [1] 0.6931472

Cross Entropy

Cross entropy is closely related to KL divergence and also contrasts two distributions. It is just the KL divergence plus the entropy of the lower-entropy distribution.

When the higher-entropy distribution (d2 below) has equally-weighted outcomes, cross-entropy is the same as its entropy regardless of the other distribution:

d1 <- c(1,3,2,1,1)

d2 <- c(1,1,1,1,1)

entropy(d1)
## [1] 1.494175
entropy(d2)
## [1] 1.609438

entropy(d1) + KL.empirical(d1,d2)
## [1] 1.609438

Otherwise the cross-entropy could be either higher or lower. For example, if the lower-entropy distribution removes a higher-weighted value, cross-entropy exceeds entropy:

d1 <- c(1,0,2,1,1)
entropy(d1)
## [1] 1.332179

d2 <- c(1,2,2,1,1)

entropy(d2)
## [1] 1.549826
entropy(d1) + KL.empirical(d1,d2)
## [1] 1.668651

If it removes a lower-weighted value, cross-entropy is lower than entropy:

d1 <- c(1,2,2,0,1)
entropy(d1)
## [1] 1.329661

d2 <- c(1,2,2,1,1)

entropy(d2)
## [1] 1.549826
entropy(d1) + KL.empirical(d1,d2)
## [1] 1.483812

Finally, cross-entropy has an important connection with the likelihood. If d1 is a model we’ve built, and d2 is the process we are modeling, then the cross-entropy is a scaled equivalent of the likelihood function. We can do maximum likelihood estimation by minimizing the cross-entropy.