Setting
For a discrete r.v. where is the PMF, we define the entropy of as
Intuition Behind Entropy
Imagine you are predicting the outcome of an event, such as flipping a coin. If the coin is fair, the result (heads or tails) is maximally uncertain, and observing the outcome provides a "surprising" or informative answer. Conversely, if the coin is heavily biased (e.g., 99% chance of heads), the outcome is predictable, so learning the result yields little new information. Entropy formalizes this intuition:
- High entropy: Outcomes are uncertain or diverse (e.g., fair coin).
- Low entropy: Outcomes are predictable or concentrated (e.g., biased coin).
Entropy as Expected "Surprise"
Each outcome of carries a self-information (or "surprise") defined as . Rare events (small ) yield large self-information, as they are more surprising. Entropy is the average self-information over all outcomes:
This means entropy measures the expected unpredictability of .
Binary Entropy
For a binary variable (e.g., coin flip) with and :
This function peaks at (fair coin, bit) and drops to for or (no uncertainty).
Joint Entropy and Conditional Entropy
Joint Entropy
The joint entropy of a pair of discrete random variables with a joint distribution is defined as
which can also be expressed as the expectation form
Bijection and Joint Entropy
If there exists a bijection between and , then the joint entropies are equal: . A bijection implies a one-to-one and onto mapping, meaning each outcome in corresponds uniquely to an outcome in , and vice versa. Therefore, the uncertainty (and thus information entropy) is the same for both.
Conditional Entropy
Chain Rule
Equivalently,
Corollary
Proof
Asymmetry
Relative Entropy and Mutual Information
K-L Divergence (Relative Entropy)
see also Kullback–Leibler Divergence
Note
In this definition, we use the covention that , , and . Thus, if there is any symbol such that and , then .
Asymmetry
Mutual Information
Consider two random variables and with a joint distribution and marginal probability mass function and . The mutual information is the relative entropy between and .
Relationship with independence: .
Relationship between Mutual Information and Entropy
In particular
By this relationship we also have
More Chain Rules
Chain rule of entropy
Conditional mutual information
For
Chain rule for information
Conditional relative entropy
Chain rule for relative entropy
Jenson's Inequality
See also Jensen's Inequality
Definition
If is a convex function and is a random variable
Moreover, if is strictly convex, the equality implies that with probability (i.e., is a constant)
Information inequality
with equality if and only if for all .
Non-negativity of mutual information
with equality if and only if and are independent.
Upper bound of
, where denotes the number of elements in the range of , with equality if and only if has a uniform distribution over .
we can prove this by calculating the KL divergence between any PMF , and the uniform PMF :
Conditioning reduces entropy (Information can't hurt)
since
Independence bound on entropy
Let be drawn according to . Then
with equality if and only if the are independent.
Proof By the chain rule for entropies,
Log Sum Inequality
Definition
For non-negative numbers, and ,
with equality if and only if .
Proof Setting , by Jenson's inequality
Convexity of KL divergence
is convex in the pair : that is, if and are two pairs of probability mass functions, then
Concavity of entropy
is a concave function , since .
Data-Processing Inequality (DPI)
Markov chain
Random variables are said to form a Markov chain in that order (denoted by ) if
Data-processing inequality
If , then .
Idea
This means that processing to produce cannot increase the amount of information contains about . At best, retains the same amount of information as , but typically, some information is lost.
The data-processing inequality captures the idea that processing data cannot increase the amount of information it contains about the original source. It formalizes the intuition that each step of processing (e.g., encoding, transmitting, decoding) can only preserve or reduce information, never create new information about the original data.
Proof: By the chain rule, we can expand mutual information in two different ways:
Since and are conditionally independent give , we have , since , we have .
Similarly, one can prove that .
Corollary If , then
Why
- When is known, some of the information contains about might already be "explained" by . This reduces the additional information can provide about beyond what already tells us.
- In other words, acts as a "summary" or "partial explanation" of , so knowing reduces the uncertainty about that can resolve.
Sufficient Statistics
Reference: Sufficient Statistic
Suppose that we have a family of probability mass functions indexed by , and let be a sample from a distribution in this family. Let be any statistic (i.e. function of the sample) like the sample man or sample variance. Then , and by the data-processing inequality, we have
for any distribution on . However, if equality holds, no information is lost.
A function is said to be a sufficient statistic relative to the family if is independent of given for any distribution on . (i.e., forms a Markov chain).
This is the same as the condition (since we have both and )
A statistic is a minimal sufficient statistic relative to if it is a function of every other sufficient statistic . Interpreting this in terms of the data-processing inequality, this implies that
Hence, a minimal sufficient statistic maximally compresses the information about in the sample. Other sufficient statistics may contain additional irrelevant information. For example, for a normal distribution with mean , the pair of functions giving the mean of all odd samples and the mean of all even samples is a sufficient statistic, but not a minimal sufficient statistic (i.e. just one function giving the mean of all samples).
Fano's Inequality
Suppose that we know a random variable and we wish to guess the value of a correlated random variable . Fano's inequality relates the probability of error in guessing the random variable to its conditional entropy .
Fano's Inequality. For any estimator such that , with an error probability , we have
The core idea here is that
- the uncertainty about given our estimate () must be at least as large as the uncertainty about given the original variable (). This makes intuitive sense because is derived from , so it can't contain more information about than itself does.
- The term is an upper bound on . Here, is the entropy of the error event (i.e., the entropy of a binary random variable that is with probability and with probability ). The term accounts for the worst-case scenario: when an error occurs, we have no information about , so we might have to choose from all possible values in the alphabet , resulting in an uncertainty of at most
Note that , This inequality can be weakened to
or
Proof. The first inequality can be proven by expanding in two ways
Where for the term we used
Corollary. For any two random variables and , let . We have
Corollary. Let , and let ; then
The proof of the theorem goes through without change with the original proof, except that
since given , the range of possible outcomes is .
Lemma. If and are i.i.d. with entropy ,
with equality if and only if has a uniform distribution. Proof of Lemma. Suppose that . By Jensen's inequality, we have
which implies that