Setting

For a discrete r.v. $X \sim P$ where $P_{X} (x) := Pr (X) \leftrightarrow p (x)$ is the PMF, we define the entropy of $X$ as

H (X) := - x \in X \sum p (x) lo g p (x)

Intuition Behind Entropy

Imagine you are predicting the outcome of an event, such as flipping a coin. If the coin is fair, the result (heads or tails) is maximally uncertain, and observing the outcome provides a "surprising" or informative answer. Conversely, if the coin is heavily biased (e.g., 99% chance of heads), the outcome is predictable, so learning the result yields little new information. Entropy formalizes this intuition:

High entropy: Outcomes are uncertain or diverse (e.g., fair coin).
Low entropy: Outcomes are predictable or concentrated (e.g., biased coin).

Entropy as Expected "Surprise"

Each outcome $x$ of $X$ carries a self-information (or "surprise") defined as $- lo g p (x)$ . Rare events (small $p (x)$ ) yield large self-information, as they are more surprising. Entropy $H (X)$ is the average self-information over all outcomes:

H (X) = E [- lo g p (X)] = - x \in X \sum p (x) lo g p (x) .

This means entropy measures the expected unpredictability of $X$ .

Binary Entropy

For a binary variable $X$ (e.g., coin flip) with $p (0) = q$ and $p (1) = 1 - q$ :

H (X) = - q lo g q - (1 - q) lo g (1 - q) .

This function peaks at $q = 0.5$ (fair coin, $H (X) = 1$ bit) and drops to $0$ for $q = 0$ or $q = 1$ (no uncertainty).

Joint Entropy and Conditional Entropy

Joint Entropy

The joint entropy $H (X, Y)$ of a pair of discrete random variables $(X, Y)$ with a joint distribution $p (x, y)$ is defined as

H (X, Y) := - x \in X \sum y \in Y \sum p (x, y) lo g p (x, y)

which can also be expressed as the expectation form

H (X, Y) := - E [lo g p (X, Y)]

Bijection and Joint Entropy

If there exists a bijection between $(X, Y)$ and $(U, V)$ , then the joint entropies are equal: $H (X, Y) = H (U, V)$ . A bijection implies a one-to-one and onto mapping, meaning each outcome in $(X, Y)$ corresponds uniquely to an outcome in $(U, V)$ , and vice versa. Therefore, the uncertainty (and thus information entropy) is the same for both.

Conditional Entropy

H (Y ∣ X) = x \in X \sum p (x) H (Y ∣ X = x) = - x \in X \sum p (x) y \in Y \sum p (y ∣ x) lo g p (y ∣ x) = - x \in X \sum y \in Y \sum p (x, y) lo g p (y ∣ x) = - E [lo g P (Y ∣ X)]

Chain Rule

H (X, Y) = H (X) + H (Y ∣ X)

Equivalently,

lo g p (X, Y) = lo g p (X) + lo g p (Y ∣ X)

Corollary

H (X, Y ∣ Z) = H (X ∣ Z) + H (Y ∣ X, Z)

Proof

H (X, Y) ⟹ H (X, Y ∣ Z) = H (X) + H (Y ∣ X) = H (X ∣ Z) + H (Y ∣ X, Z)

Asymmetry

$H (X ∣ Y) \neq = H (Y ∣ X)$

Relative Entropy and Mutual Information

K-L Divergence (Relative Entropy)

Mutual Information

Consider two random variables $X$ and $Y$ with a joint distribution $p (x, y)$ and marginal probability mass function $p (x)$ and $p (y)$ . The mutual information $I (X; Y)$ is the relative entropy between $p (x, y)$ and $p (x) p (y)$ .

I (X; Y) := D_{KL} [p (x, y) ∥ p (x) p (y)] = x \in X \sum y \in Y \sum p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )}

Relationship with independence: $I (X; Y) = 0 ⟺ X, Y are independent$ .

Relationship between Mutual Information and Entropy

I (X; Y) = H (X) + H (Y) - H (X, Y)

In particular

I (X; X) = H (X) + H (X) - H (X, X) = H (X) - H (X ∣ X) = H (X)

By this relationship we also have $I (X; Y) = I (Y; X)$

More Chain Rules

Chain rule of entropy

H (X_{1}, X_{2}, \dots, X_{n}) = i = 1 \sum n H (X_{i} ∣ X_{i - 1}, \dots, X_{1})

Conditional mutual information

I (X; Y ∣ Z) = H (X ∣ Z) - H (X ∣ Y, Z)

For

I (X; Y ∣ Z) = H (X ∣ Z) + H (Y ∣ Z) - H (X, Y ∣ Z) = H (X ∣ Z) - H (X ∣ Y, Z)

Chain rule for information

I (X_{1}, X_{2}, \dots, X_{n}; Y) = i = 1 \sum n I (X_{i}; Y ∣ X_{i - 1}, X_{i - 2}, \dots, X_{1})

Conditional relative entropy

D_{KL} (p (y ∣ x) ∥ q (y ∣ x)) = x \sum p (x) y \sum p (y ∣ x) lo g \frac{p ( y ∣ x )}{q ( y ∣ x )} = E_{p (x, y)} lo g \frac{p ( Y ∣ X )}{q ( Y ∣ X )}

Chain rule for relative entropy

D_{KL} (p (x, y) ∥ q (x, y)) = D_{KL} (p (x) ∥ q (x)) + D_{KL} (p (y ∣ x) ∥ q (y ∣ x))

Jenson's Inequality

Definition

If $f$ is a convex function and $X$ is a random variable

E [f (X)] \geq f (E [X])

Moreover, if $f$ is strictly convex, the equality implies that $X = E [X]$ with probability $1$ (i.e., $X$ is a constant)

Information inequality

D_{KL} (p ∥ q) \geq 0

with equality if and only if $p (x) = q (x)$ for all $x$ .

Non-negativity of mutual information

I (X; Y) \geq 0

with equality if and only if $X$ and $Y$ are independent.

Upper bound of $H (X)$

$H (X) \leq lo g ∣ X ∣$ , where $∣ X ∣$ denotes the number of elements in the range of $X$ , with equality if and only if $X$ has a uniform distribution over $X$ .

we can prove this by calculating the KL divergence between any PMF $p$ , and the uniform PMF $u \equiv \frac{1}{∣ X ∣}$ :

D_{KL} (p ∥ u) = \sum p (x) lo g \frac{p ( x )}{u ( x )} = lo g ∣ X ∣ - H (X) \geq 0

Conditioning reduces entropy (Information can't hurt)

H (X ∣ Y) \leq H (X)

since

0 \leq I (X; Y) = H (X) + H (Y) - H (X, Y) = H (X) - H (X ∣ Y)

Independence bound on entropy

Let $X_{1}, X_{2}, \dots, X_{n}$ be drawn according to $p (x_{1}, x_{2}, \dots, x_{n})$ . Then

H (X_{1}, X_{2}, \dots, X_{n}) \leq i = 1 \sum n H (X_{i})

with equality if and only if the $X_{i}$ are independent.

Proof By the chain rule for entropies,

H (X_{1}, X_{2}, \dots, X_{n}) = i = 1 \sum n H (X_{i} ∣ X_{1}, \dots, X_{i - 1}) \leq i = 1 \sum n H (X_{i}) (Information can’t hurt)

Log Sum Inequality

Definition

For non-negative numbers, $a_{1}, a_{2}, \dots, a_{n}$ and $b_{1}, b_{2}, \dots, b_{n}$ ,

i = 1 \sum n a_{i} lo g \frac{a _{i}}{b _{i}} \geq (i = 1 \sum n a_{i}) lo g \frac{\sum _{i = 1}^{n} a _{i}}{\sum _{i = 1}^{n} b _{i}}

with equality if and only if $\frac{a _{i}}{b _{i}} = const$ .

Proof Setting $f (x) = x lo g x$ , by Jenson's inequality

i = 1 \sum n lo g \frac{a _{i}}{b _{i}} = i = 1 \sum n b_{i} f (\frac{a _{i}}{b _{i}}) = (i = 1 \sum n b_{i}) i = 1 \sum n \frac{b _{i}}{\sum _{i = 1}^{n} b _{i}} f (\frac{a _{i}}{b _{i}}) \geq (i = 1 \sum n b_{i}) f (i = 1 \sum n \frac{b _{i}}{\sum _{i = 1}^{n} b _{i}} \frac{a _{i}}{b _{i}}) = (i = 1 \sum n b_{i}) f (\frac{\sum _{i = 1}^{n} a _{i}}{\sum _{i = 1}^{n} b _{i}}) = (i = 1 \sum n a_{i}) lo g \frac{\sum _{i = 1}^{n} a _{i}}{\sum _{i = 1}^{n} b _{i}}

Convexity of KL divergence

$D_{KL} (p ∥ q)$ is convex in the pair $(p, q)$ : that is, if $(p_{1}, q_{1})$ and $(p_{2}, q_{2})$ are two pairs of probability mass functions, then

D_{KL} [λ p_{1} + (1 - λ) p_{2} ∥ λ q_{1} + (1 - λ) q_{2}] \leq λ D_{KL} (p_{1} ∥ q_{1}) + (1 - λ) D_{KL} (p_{2} ∥ q_{2})

Concavity of entropy

$H (p)$ is a concave function $p$ , since $H (p) \equiv lo g ∣ X ∣ - D_{KL} (p ∥ u)$ .

Data-Processing Inequality (DPI)

Markov chain

Random variables $X, Y, Z$ are said to form a Markov chain in that order (denoted by $X \to Y \to Z$ ) if

p (x, y, z) = p (x) p (y ∣ x) p (z ∣ y)

Data-processing inequality

If $X \to Y \to Z$ , then $I (X; Y) \geq I (X; Z)$ .

Idea

This means that processing $Y$ to produce $Z$ cannot increase the amount of information $Z$ contains about $X$ . At best, $Z$ retains the same amount of information as $Y$ , but typically, some information is lost.

The data-processing inequality captures the idea that processing data cannot increase the amount of information it contains about the original source. It formalizes the intuition that each step of processing (e.g., encoding, transmitting, decoding) can only preserve or reduce information, never create new information about the original data.

Proof: By the chain rule, we can expand mutual information in two different ways:

I (X; Y, Z) = I (X; Z) + I (X; Y ∣ Z) = I (X; Y) + I (X; Z ∣ Y)

Since $X$ and $Z$ are conditionally independent give $Y$ , we have $I (X; Z ∣ Y) = 0$ , since $I (X; Y ∣ Z) \geq 0$ , we have $I (X; Y) \geq I (X; Z)$ .

Similarly, one can prove that $I (Y; Z) \geq I (X; Z)$ .

Corollary If $X \to Y \to Z$ , then $I (X; Y ∣ Z) \leq I (X; Y)$

Why $I (X; Y ∣ Z) \leq I (X; Y)$

When $Z$ is known, some of the information $Y$ contains about $X$ might already be "explained" by $Z$ . This reduces the additional information $Y$ can provide about $X$ beyond what $Z$ already tells us.

In other words, $Z$ acts as a "summary" or "partial explanation" of $Y$ , so knowing $Z$ reduces the uncertainty about $X$ that $Y$ can resolve.

Sufficient Statistics

Reference: Sufficient Statistic

Suppose that we have a family of probability mass functions ${f_{θ} (x)}$ indexed by $θ$ , and let $X$ be a sample from a distribution in this family. Let $T (X)$ be any statistic (i.e. function of the sample) like the sample man or sample variance. Then $θ \to X \to T (X)$ , and by the data-processing inequality, we have

I (θ; T (X)) \leq I (θ; X)

for any distribution on $θ$ . However, if equality holds, no information is lost.

A function $T (X)$ is said to be a sufficient statistic relative to the family ${f_{θ} (x)}$ if $X$ is independent of $θ$ given $T (X)$ for any distribution on $θ$ . (i.e., $θ \to T (X) \to X$ forms a Markov chain).

This is the same as the condition (since we have both $θ \to T (X) \to X$ and $θ \to X \to T (X)$ )

I (θ; X) = I (θ; T (X))

A statistic $T (X)$ is a minimal sufficient statistic relative to ${f_{θ} (x)}$ if it is a function of every other sufficient statistic $U$ . Interpreting this in terms of the data-processing inequality, this implies that

θ \to T (X) \to U (X) \to X

Hence, a minimal sufficient statistic maximally compresses the information about $θ$ in the sample. Other sufficient statistics may contain additional irrelevant information. For example, for a normal distribution with mean $θ$ , the pair of functions giving the mean of all odd samples and the mean of all even samples is a sufficient statistic, but not a minimal sufficient statistic (i.e. just one function giving the mean of all samples).

Fano's Inequality

Suppose that we know a random variable $Y$ and we wish to guess the value of a correlated random variable $X$ . Fano's inequality relates the probability of error in guessing the random variable $X$ to its conditional entropy $H (X ∣ Y)$ .

Fano's Inequality. For any estimator $\hat{X}$ such that $X \to Y \to \hat{X}$ , with an error probability $P_{e} = P (X \neq = \hat{X})$ , we have

H (P_{e}) + P_{e} lo g ∣ X ∣ \geq H (X ∣ \hat{X}) \geq H (X ∣ Y)

The core idea here is that

the uncertainty about $X$ given our estimate $\hat{X}$ ( $H (X ∣ \hat{X})$ ) must be at least as large as the uncertainty about $X$ given the original variable $Y$ ( $H (X ∣ Y)$ ). This makes intuitive sense because $\hat{X}$ is derived from $Y$ , so it can't contain more information about $X$ than $Y$ itself does.
The term $H (P_{e}) + P_{e} lo g ∣ X ∣$ is an upper bound on $H (X ∣ \hat{X})$ . Here, $H (P_{e})$ is the entropy of the error event (i.e., the entropy of a binary random variable that is $1$ with probability $P_{e}$ and $0$ with probability $1 - P_{e}$ ). The term $P_{e} lo g ∣ X ∣$ accounts for the worst-case scenario: when an error occurs, we have no information about $X$ , so we might have to choose from all possible values in the alphabet $X$ , resulting in an uncertainty of at most $lo g ∣ X ∣$

Note that $H (P_{e}) \leq - \frac{1}{2} lo g_{2} \frac{1}{2} - \frac{1}{2} lo g_{2} \frac{1}{2} = 1$ , This inequality can be weakened to

1 + P_{e} lo g ∣ X ∣ \geq H (X ∣ Y)

P_{e} \geq \frac{H ( X ∣ Y ) - 1}{lo g ∣ X ∣}

Proof. The first inequality can be proven by expanding $H (E, X ∣ \hat{X})$ in two ways

H (E, X ∣ \hat{X}) = \leq H (E) = H (P_{e}) H (E ∣ \hat{X}) + P_{e} l o g ∣ X ∣ H (X ∣ E, \hat{X}) = H (X ∣ \hat{X}) + = 0 H (E ∣ X, \hat{X})

Where for the $H (X ∣ E, \hat{X})$ term we used

H (X ∣ E, \hat{X}) = P (E = 0) H (X ∣ \hat{X}, E = 0) + P (E = 1) H (X ∣ \hat{X}, E = 1) \leq (1 - P_{e}) 0 + P_{e} lo g ∣ X ∣

Corollary. For any two random variables $X$ and $Y$ , let $p = P (X \neq = Y)$ . We have

H (p) + p lo g ∣ X ∣ \geq H (X ∣ Y)

Corollary. Let $P_{e} = P (X \neq = \hat{X})$ , and let $\hat{X} : Y \to X$ ; then

H (P_{e}) + P_{e} lo g (∣ X ∣ - 1) \geq H (X ∣ Y)

The proof of the theorem goes through without change with the original proof, except that

H (X ∣ E, \hat{X}) = P (E = 0) H (X ∣ \hat{X}, E = 0) + P (E = 1) H (X ∣ \hat{X}, E = 1) \leq (1 - P_{e}) 0 + P_{e} lo g (∣ X ∣ - 1)

since given $E = 1$ , the range of possible $X$ outcomes is $∣ X ∣ - 1$ .

Lemma. If $X$ and $X^{'}$ are i.i.d. with entropy $H (X)$ ,

P (X = X^{'}) \geq 2^{- H (X)}

with equality if and only if $X$ has a uniform distribution. Proof of Lemma. Suppose that $X \sim p (x)$ . By Jensen's inequality, we have

2^{E [l o g p (X)]} \leq E [2^{l o g p (X)}]

which implies that

2^{- H (X)} = 2^{\sum p (x) l o g_{2} p (x)} \leq \sum p (x) 2^{l o g_{2} p (x)} = \sum p^{2} (x) = P (X = X^{'})

Lin's Notes Garden

Explorer

Entropy - Information Theory

Setting

Intuition Behind Entropy

Entropy as Expected "Surprise"

Binary Entropy

Joint Entropy and Conditional Entropy

Joint Entropy

Conditional Entropy

Chain Rule

Relative Entropy and Mutual Information

K-L Divergence (Relative Entropy)

Mutual Information

Relationship between Mutual Information and Entropy

More Chain Rules

Jenson's Inequality

Definition

Information inequality

Non-negativity of mutual information

Upper bound of $H (X)$

Conditioning reduces entropy (Information can't hurt)

Independence bound on entropy

Log Sum Inequality

Definition

Convexity of KL divergence

Concavity of entropy

Data-Processing Inequality (DPI)

Markov chain

Data-processing inequality

Sufficient Statistics

Fano's Inequality

Graph View

Table of Contents

Backlinks

Lin's Notes Garden

Explorer

Entropy - Information Theory

Setting

Intuition Behind Entropy

Entropy as Expected "Surprise"

Binary Entropy

Joint Entropy and Conditional Entropy

Joint Entropy

Conditional Entropy

Chain Rule

Relative Entropy and Mutual Information

K-L Divergence (Relative Entropy)

Mutual Information

Relationship between Mutual Information and Entropy

More Chain Rules

Jenson's Inequality

Definition

Information inequality

Non-negativity of mutual information

Upper bound of H(X)

Conditioning reduces entropy (Information can't hurt)

Independence bound on entropy

Log Sum Inequality

Definition

Convexity of KL divergence

Concavity of entropy

Data-Processing Inequality (DPI)

Markov chain

Data-processing inequality

Sufficient Statistics

Fano's Inequality

Graph View

Table of Contents

Backlinks

Upper bound of $H (X)$