Shannon’s entropy

Definition (Discrete)

Let $X$ be a discrete random variable on a finite set $\\mathcal{X}=\\{x_{1},\\ldots,x_{n}\\}$ , with probability distributionfunction $p(x)=\\Pr(X=x)$ . The entropy $H(X)$ of $X$ isdefined as

H(X)=-\\sum_{x\\in\\mathcal{X}}p(x)\\log_{b}p(x).

(1)

The convention $0\\log 0=0$ is adopted in the definition. Thelogarithm is usually taken to the base 2, in which case theentropy is measured in “bits,” or to the base e, in which case $H(X)$ is measured in “nats.”

If $X$ and $Y$ are random variables on $\\mathcal{X}$ and $\\mathcal{Y}$ respectively, the joint entropy of $X$ and $Y$ is

H(X,Y)=-\\sum_{(x,y)\\in\\mathcal{X}\\times\\mathcal{Y}}p(x,y)\\log_{b}p(x,y),

where $p(x,y)$ denote the joint distribution of $X$ and $Y$ .

Discussion

The Shannon entropy was first introduced by Shannon in 1948 in hislandmark paper “A Mathematical Theory of Communication.” Theentropy is a functional of the probability distribution function $p(x)$ , and is sometime written as

H(p(x_{1}),p(x_{2}),\\ldots,p(x_{n})).

It is noted that the entropy of $X$ does not depend on the actualvalues of $X$ , it only depends on $p(x)$ . The definition ofShannon’s entropy can be written as an expectation

H(X)=-E[\\log_{b}p(X)].

The quantity $-\\log_{b}p(x)$ is interpreted as the informationcontent of the outcome $x\\in\\mathcal{X}$ , and is also called the Hartleyinformation of $x$ . Hence the Shannon’s entropy is the averageamount of information contained in random variable $X$ , it is alsothe uncertainty removed after the actual outcome of $X$ isrevealed.

Characterization

We write $H(X)$ as $H_{n}(p_{1},\\ldots,p_{n})$ . The Shannon entropysatisfies the following properties.

1.
For any $n$ , $H_{n}(p_{1},\\ldots,p_{n})$ is a continuous andsymmetric function on variables $p_{1}$ , $p_{2},\\ldots,p_{n}$ .
2.
Event of probability zero does not contribute to the entropy, i.e. for any $n$ ,
$H_{n+1}(p_{1},\\ldots,p_{n},0)=H_{n}(p_{1},\\ldots,p_{n}).$
3.
Entropy is maximized when the probability distribution isuniform. For all $n$ ,
$H_{n}(p_{1},\\ldots,p_{n})\\leq H_{n}\\Big{(}\\frac{1}{n},\\ldots,\\frac{1}{n}\\Big{)}.$
This follows from Jensen inequality,
$H(X)=E\\Big{[}\\log_{b}\\Big{(}\\frac{1}{p(X)}\\Big{)}\\Big{]}\\leq\\log_{b}\\Big{(}E%\\Big{[}\\frac{1}{p(X)}\\Big{]}\\Big{)}=\\log_{b}(n).$
4.
If $p_{ij}$ , $1\\leq i\\leq m$ , $1\\leq j\\leq n$ arenon-negative real numbers summing up to one, and $q_{i}=\\sum_{j=1}^{n}p_{ij}$ , then
$H_{mn}(p_{11},\\ldots,p_{mn})=H_{m}(q_{1},\\ldots,q_{m})+\\sum_{i=1}^{m}q_{i}H_{n%}\\Big{(}\\frac{p_{i1}}{q_{i}},\\ldots,\\frac{p_{in}}{q_{i}}\\Big{)}.$
If we partition the $m n$ outcomes of the random experiment into $m$ groups, each group contains $n$ elements, we can do theexperiment in two steps: first determine the group to which theactual outcome belongs to, and second find the outcome in thisgroup. The probability that you will observe group $i$ is $q_{i}$ .The conditional probability distribution function given group $i$ is $(p_{i1}/q_{i},\\ldots,p_{in}/q_{i})$ . The entropy
$H_{n}\\Big{(}\\frac{p_{i1}}{q_{i}},\\ldots,\\frac{p_{in}}{q_{i}}\\Big{)}$
is the entropy of the probability distribution conditioned ongroup $i$ . Property 4 says that the total information is the sumof the information you gain in the first step, $H_{m}(q_{1},\\ldots,q_{m})$ , and a weighted sum of the entropies conditioned on eachgroup.

Khinchin in 1957 showed that the only function satisfying theabove assumptions is of the form:

H_{n}(p_{1},\\ldots,p_{n})=-k\\sum_{i=1}^{n}p_{i}\\log p_{i}

where $k$ is a positive constant, essentially a choice of unit ofmeasure.

Definition (Continuous)

Entropy in the continuous case is called differential entropy (http://planetmath.org/DifferentialEntropy).

Discussion—Continuous Entropy

Despite its seductively analogous form, continuous entropy cannot be obtained as a limiting case of discrete entropy.

We wish to obtain a generally finite measure as the “bin size” goes to zero. In the discrete case, the bin size is the (implicit) width of each of the $n$ (finite or infinite) bins/buckets/states whose probabilities are the $p_{n}$ . As we generalize to the continuous domain, we must make this width explicit.

To do this, start with a continuous function $f$ discretized as shown in the figure:

Figure 1: Discretizing the function $f$ into bins of width $\\Delta$

As the figure indicates, by the mean-value theorem there exists a value $x_{i}$ in each bin such that

f(x_{i})\\Delta=\\int_{i\\Delta}^{(i+1)\\Delta}f(x)dx

(2)

and thus the integral of the function $f$ can be approximated (in the Riemannian sense) by

\\int_{-\\infty}^{\\infty}f(x)dx=\\lim_{\\Delta\\to 0}\\sum_{i=-\\infty}^{\\infty}f(x_{%i})\\Delta

(3)

where this limit and “bin size goes to zero” are equivalent.

We will denote

H^{\\Delta}:=-\\sum_{i=-\\infty}^{\\infty}\\Delta f(x_{i})\\log\\Delta f(x_{i})

(4)

and expanding the $\\log$ we have

	$\\displaystyle H^{\\Delta}$	$\\displaystyle=-\\sum_{i=-\\infty}^{\\infty}\\Delta f(x_{i})\\log\\Delta f(x_{i})$		(5)
		$\\displaystyle=-\\sum_{i=-\\infty}^{\\infty}\\Delta f(x_{i})\\log f(x_{i})-\\sum_{i=-%\\infty}^{\\infty}f(x_{i})\\Delta\\log\\Delta.$		(6)

As $\\Delta\\to 0$ , we have

	$\\displaystyle\\sum_{i=-\\infty}^{\\infty}f(x_{i})\\Delta$	$\\displaystyle\\to\\int f(x)dx=1\\qquad\\text{and}$		(7)
	$\\displaystyle\\sum_{i=-\\infty}^{\\infty}\\Delta f(x_{i})\\log f(x_{i})$	$\\displaystyle\\to\\int f(x)\\log f(x)dx$		(8)

This leads us to our definition of the differential entropy (continuous entropy):

h[f]=\\lim_{\\Delta\\to 0}\\left[H^{\\Delta}+\\log\\Delta\\right]=-\\int_{-\\infty}^{%\\infty}f(x)\\log f(x)dx.

(9)