Shannon Entropy
Shannon entropy is a property of a probability distribution.
For a discrete distribution with only one event P(x) = 1, entropy is zero:
library(entropy)
entropy(1)
## [1] 0
Entropy is proportional to the number of binary variables (bits) we need to capture all (equally likely) discrete outcomes:
entropy(c("1" = 1,"0" = 1)) # one binary variable
## [1] 0.6931472
entropy(c("11" = 1,"00" = 1, "10" = 1, "01" = 1)) # two variables
## [1] 1.386294
entropy(c("111" = 1, "110" = 1, "101" = 1, "100" = 1,
"011" = 1, "010" = 1, "001" = 1, "000" = 1)) # 3 variables
## [1] 2.079442
If we have two events and weight one of them more, entropy decreases. This reflects the fact that as we increase the weight, we approach the single event distribution with no entropy:
entropy(c(1,1))
## [1] 0.6931472
entropy(c(1,3))
## [1] 0.5623351
entropy(c(1,10))
## [1] 0.3046361
entropy(c(1,10^6))
## [1] 1.48155e-05
Similarly, for continuous distributions, the entropy increases with flatter, wider, higher variance distributions:
entropy(dnorm(-10:10))
## [1] 1.418938
entropy(dnorm(-10:10, sd = 100))
## [1] 3.044521
Kullback–Leibler divergence
KL divergence describes differences in two distributions based on their entropies. The KL divergence of a distribution with itself is 0:
KL.empirical(c(1,1), c(1,1))
## [1] 0
It is the entropy carried by the additional binary variables (bits) needed by the higher-entropy distribution to fully represent the outcomes:
dist1 <- c("11" = 1,"00" = 1, "10" = 1, "01" = 1) # 2 binary variables
dist2 <- c("111" = 1, "110" = 1, "101" = 1, "100" = 1,
"011" = 1, "010" = 1, "001" = 1, "000" = 1) # 3 binary variables
KL.empirical(dist1, dist2)
## [1] 0.6931472
entropy(c("1" = 1, "0" = 1)) # entropy of one binary variable
## [1] 0.6931472
Cross Entropy
Cross entropy is closely related to KL divergence and also contrasts two distributions. It is just the KL divergence plus the entropy of the lower-entropy distribution.
When the higher-entropy distribution (d2 below) has equally-weighted outcomes, cross-entropy is the same as its entropy regardless of the other distribution:
d1 <- c(1,3,2,1,1)
d2 <- c(1,1,1,1,1)
entropy(d1)
## [1] 1.494175
entropy(d2)
## [1] 1.609438
entropy(d1) + KL.empirical(d1,d2)
## [1] 1.609438
Otherwise the cross-entropy could be either higher or lower. For example, if the lower-entropy distribution removes a higher-weighted value, cross-entropy exceeds entropy:
d1 <- c(1,0,2,1,1)
entropy(d1)
## [1] 1.332179
d2 <- c(1,2,2,1,1)
entropy(d2)
## [1] 1.549826
entropy(d1) + KL.empirical(d1,d2)
## [1] 1.668651
If it removes a lower-weighted value, cross-entropy is lower than entropy:
d1 <- c(1,2,2,0,1)
entropy(d1)
## [1] 1.329661
d2 <- c(1,2,2,1,1)
entropy(d2)
## [1] 1.549826
entropy(d1) + KL.empirical(d1,d2)
## [1] 1.483812
Finally, cross-entropy has an important connection with the likelihood. If d1 is a model we’ve built, and d2 is the process we are modeling, then the cross-entropy is a scaled equivalent of the likelihood function. We can do maximum likelihood estimation by minimizing the cross-entropy.