Setting

For a discrete r.v. where is the PMF, we define the entropy of as

Intuition Behind Entropy

Imagine you are predicting the outcome of an event, such as flipping a coin. If the coin is fair, the result (heads or tails) is maximally uncertain, and observing the outcome provides a "surprising" or informative answer. Conversely, if the coin is heavily biased (e.g., 99% chance of heads), the outcome is predictable, so learning the result yields little new information. Entropy formalizes this intuition:

  • High entropy: Outcomes are uncertain or diverse (e.g., fair coin).
  • Low entropy: Outcomes are predictable or concentrated (e.g., biased coin).

Entropy as Expected "Surprise"

Each outcome of carries a self-information (or "surprise") defined as . Rare events (small ) yield large self-information, as they are more surprising. Entropy is the average self-information over all outcomes:

This means entropy measures the expected unpredictability of .

Binary Entropy

For a binary variable (e.g., coin flip) with and :

This function peaks at (fair coin, bit) and drops to for or (no uncertainty).

Joint Entropy and Conditional Entropy

Joint Entropy

The joint entropy of a pair of discrete random variables with a joint distribution is defined as

which can also be expressed as the expectation form

Bijection and Joint Entropy

If there exists a bijection between and , then the joint entropies are equal: . A bijection implies a one-to-one and onto mapping, meaning each outcome in corresponds uniquely to an outcome in , and vice versa. Therefore, the uncertainty (and thus information entropy) is the same for both.

Conditional Entropy

Chain Rule

Equivalently,

Corollary

Proof

Asymmetry

Relative Entropy and Mutual Information

K-L Divergence (Relative Entropy)

see also Kullback–Leibler Divergence

Note

In this definition, we use the covention that , , and . Thus, if there is any symbol such that and , then .

Asymmetry

Mutual Information

Consider two random variables and with a joint distribution and marginal probability mass function and . The mutual information is the relative entropy between and .

Relationship with independence: .

Relationship between Mutual Information and Entropy

In particular

By this relationship we also have

More Chain Rules

Chain rule of entropy

Conditional mutual information

For

Chain rule for information

Conditional relative entropy

Chain rule for relative entropy

Jenson's Inequality

See also Jensen's Inequality

Definition

If is a convex function and is a random variable

Moreover, if is strictly convex, the equality implies that with probability (i.e., is a constant)

Information inequality

with equality if and only if for all .

Non-negativity of mutual information

with equality if and only if and are independent.

Upper bound of

, where denotes the number of elements in the range of , with equality if and only if has a uniform distribution over .

we can prove this by calculating the KL divergence between any PMF , and the uniform PMF :

Conditioning reduces entropy (Information can't hurt)

since

Independence bound on entropy

Let be drawn according to . Then

with equality if and only if the are independent.

Proof By the chain rule for entropies,

Log Sum Inequality

Definition

For non-negative numbers, and ,

with equality if and only if .

Proof Setting , by Jenson's inequality

Convexity of KL divergence

is convex in the pair : that is, if and are two pairs of probability mass functions, then

Concavity of entropy

is a concave function , since .

Data-Processing Inequality (DPI)

Markov chain

Random variables are said to form a Markov chain in that order (denoted by ) if

Data-processing inequality

If , then .

Idea

This means that processing to produce cannot increase the amount of information contains about . At best, retains the same amount of information as , but typically, some information is lost.

The data-processing inequality captures the idea that processing data cannot increase the amount of information it contains about the original source. It formalizes the intuition that each step of processing (e.g., encoding, transmitting, decoding) can only preserve or reduce information, never create new information about the original data.

Proof: By the chain rule, we can expand mutual information in two different ways:

Since and are conditionally independent give , we have , since , we have .

Similarly, one can prove that .

Corollary If , then

Why

  • When is known, some of the information contains about might already be "explained" by . This reduces the additional information can provide about beyond what already tells us.
  • In other words, acts as a "summary" or "partial explanation" of , so knowing reduces the uncertainty about that can resolve.

Sufficient Statistics

Reference: Sufficient Statistic

Suppose that we have a family of probability mass functions indexed by , and let be a sample from a distribution in this family. Let be any statistic (i.e. function of the sample) like the sample man or sample variance. Then , and by the data-processing inequality, we have

for any distribution on . However, if equality holds, no information is lost.

A function is said to be a sufficient statistic relative to the family if is independent of given for any distribution on . (i.e., forms a Markov chain).

This is the same as the condition (since we have both and )

A statistic is a minimal sufficient statistic relative to if it is a function of every other sufficient statistic . Interpreting this in terms of the data-processing inequality, this implies that

Hence, a minimal sufficient statistic maximally compresses the information about in the sample. Other sufficient statistics may contain additional irrelevant information. For example, for a normal distribution with mean , the pair of functions giving the mean of all odd samples and the mean of all even samples is a sufficient statistic, but not a minimal sufficient statistic (i.e. just one function giving the mean of all samples).

Fano's Inequality

Suppose that we know a random variable and we wish to guess the value of a correlated random variable . Fano's inequality relates the probability of error in guessing the random variable to its conditional entropy .

Fano's Inequality. For any estimator such that , with an error probability , we have

The core idea here is that

  1. the uncertainty about given our estimate () must be at least as large as the uncertainty about given the original variable (). This makes intuitive sense because is derived from , so it can't contain more information about than itself does.
  2. The term is an upper bound on . Here, is the entropy of the error event (i.e., the entropy of a binary random variable that is with probability and with probability ). The term accounts for the worst-case scenario: when an error occurs, we have no information about , so we might have to choose from all possible values in the alphabet , resulting in an uncertainty of at most

Note that , This inequality can be weakened to

or

Proof. The first inequality can be proven by expanding in two ways

Where for the term we used

Corollary. For any two random variables and , let . We have

Corollary. Let , and let ; then

The proof of the theorem goes through without change with the original proof, except that

since given , the range of possible outcomes is .

Lemma. If and are i.i.d. with entropy ,

with equality if and only if has a uniform distribution. Proof of Lemma. Suppose that . By Jensen's inequality, we have

which implies that