Shannon’s entropy
Definition (Discrete)
Let be a discrete random variable on a finite set, with probability distributionfunction . The entropy
of isdefined as
(1) |
The convention is adopted in the definition. Thelogarithm is usually taken to the base 2, in which case theentropy is measured in “bits,” or to the base e, in which case is measured in “nats.”
If and are random variables on and respectively, the joint entropy of and is
where denote the joint distribution of and .
Discussion
The Shannon entropy was first introduced by Shannon in 1948 in hislandmark paper “A Mathematical Theory of Communication.” Theentropy is a functional of the probability distribution function, and is sometime written as
It is noted that the entropy of does not depend on the actualvalues of , it only depends on . The definition ofShannon’s entropy can be written as an expectation
The quantity is interpreted as the informationcontent of the outcome , and is also called the Hartleyinformation of . Hence the Shannon’s entropy is the averageamount of information contained in random variable , it is alsothe uncertainty removed after the actual outcome of isrevealed.
Characterization
We write as . The Shannon entropysatisfies the following properties.
- 1.
For any , is a continuous
andsymmetric function on variables , .
- 2.
Event of probability zero does not contribute to the entropy, i.e. for any ,
- 3.
Entropy is maximized when the probability distribution isuniform. For all ,
This follows from Jensen inequality
,
- 4.
If , , arenon-negative real numbers summing up to one, and , then
If we partition
the outcomes of the random experiment into groups, each group contains elements, we can do theexperiment in two steps: first determine the group to which theactual outcome belongs to, and second find the outcome in thisgroup. The probability that you will observe group is .The conditional probability distribution function given group is . The entropy
is the entropy of the probability distribution conditioned ongroup . Property 4 says that the total information is the sumof the information you gain in the first step, , and a weighted sum of the entropies conditioned on eachgroup.
Khinchin in 1957 showed that the only function satisfying theabove assumptions is of the form:
where is a positive constant, essentially a choice of unit ofmeasure.
Definition (Continuous)
Entropy in the continuous case is called differential entropy (http://planetmath.org/DifferentialEntropy).
Discussion—Continuous Entropy
Despite its seductively analogous form, continuous entropy cannot be obtained as a limiting case of discrete entropy.
We wish to obtain a generally finite measure as the “bin size” goes to zero. In the discrete case, the bin size is the (implicit) width of each of the (finite or infinite) bins/buckets/states whose probabilities are the . As we generalize to the continuous domain, we must make this width explicit.
To do this, start with a continuous function discretized as shown in the figure:
As the figure indicates, by the mean-value theorem there exists a value in each bin such that
(2) |
and thus the integral of the function can be approximated (in the Riemannian sense) by
(3) |
where this limit and “bin size goes to zero” are equivalent.
We will denote
(4) |
and expanding the we have
(5) | ||||
(6) |
As , we have
(7) | ||||
(8) |
This leads us to our definition of the differential entropy (continuous entropy):
(9) |