TLDR: A brief summary of a paper that I read recently about why it’s hard to estimate information entropy.
I recently came across a paper with some proofs of the number of samples you need to accurately estimate various information theoretic quantities in a “black box” way. Although I enjoyed the proofs, I found hard to understand as-written, and so I thought I’d post a more understandable explanation of the core results.
Suppose we have some variable \(X\) that we can draw i.i.d samples from but know nothing else about. Then how many samples do we need to draw to accurately estimate its entropy \(H(X)\), its KL-divergence from another random variable \(D(Y \parallel X)\), or its mutual information with another random variable \(I(X; Y)\)? We will show that for our estimate to be “accurate with high probability”, then the number of samples we require is exponential in the quantity we’re estimating.
I’ll present the arguments from the paper fairly informally (in particular, I won’t actually define what I mean by “high probability”, although it shouldn’t be too hard to formalize the notion that I’m using). The core argument is fairly straightforward and I think the formalization doesn’t add much to understanding the underlying logic.
Core Argument
We’ll use the same generic argument for all three cases. Suppose we want to estimate \(f(X)\) in a black box way, where our only information from \(X\) comes from drawing \(N\) i.i.d samples. We will first observe that if our estimator \(E\) is within \(f(X) \pm \epsilon\) with “high probability” (i.e. “accurate with high probability”), then \(E - \epsilon < f(X)\) with “high probability”, giving us a “black-box lower bound” estimator that is correct with “high probability”.
But we’ll show that for any black box lower-bound estimate of \(f(P)\) that is correct with “high-probability”, our lower-bound estimate will be smaller than some function \(B(N)\) with non-negligible probability. That implies that \(E - \epsilon < B(N)\), or \(E < B(N) + \epsilon\) with non-negligible probability.
Thus for \(E\) to be a “accurate with high probability”, we need we need \(B(N) + \epsilon \ge f(X) - \epsilon\). This will force us to conclude that \(N \ge B^{-1}(f(X) - 2\epsilon)\) for \(E\) to be accurate with high probability.
In other words, if we can show that any black-box lower-bound estimate of \(f(P)\) that is correct with high probability must be \(\le B(N)\) with non-negligible probability, then we can conclude that for our black-box estimator to be accurate with high probability, we need \(N > B^{-1}(f(X) - 2\epsilon)\).
Kullback-Leibler Divergence Estimation
We’ll start with the black-box lower bound for Kullback-Leibler divergence (this is section 3 of the paper), which IMO is the most straightforward of the three estimators.
Suppose we want to lower-bound the KL-divergence of two random variables \(P\) and \(Q\) in a “black box” way, where our only information for \(Q\) comes from drawing i.i.d samples (we can assume we know the exact pmf/pdf of \(P\)).
Now, let us define a new random variable \(\tilde{Q}\) that we generate from a two-step process: first, we flip a coin that is heads with probability \(\frac{1}{N}\). If we get heads, then we sample \(\tilde{Q}\) from \(P\). Otherwise, we sample \(\tilde{Q}\) from \(Q\).
Now, if we denote the pmf of \(P\) and \(Q\) by \(p(x)\) and \(q(x)\) respectively, then by construction we have that the pmf of \(\tilde{Q}\) is \(\tilde{q}(x) = \frac{(N-1)q(x) + p(x)}{N}\), which implies that:
\[\begin{aligned} D(P \parallel \tilde{Q}) &= \sum p(x) \lg \frac{p(x)}{\tilde{q}(x)} \\ &= \sum p(x) \lg \frac{N p(x)}{(N-1)p(x) + q(x)} \\ &= \lg N + \sum p(x) \lg \frac{p(x)}{(N-1)p(x) + q(x)} \\ &\le \lg N \end{aligned} \]
(using the fact that \( p(x) \le (N-1)p(x) + q(x)\) for \(N \ge 2\), and so \( \lg \frac{p(x)}{(N-1)p(x) + q(x)} \le 0\)).
But with probability \((1 - \frac{1}{N})^N \ge \frac{1}{4}\), any sample of \(N\) draws from \(\tilde{Q}\) will be indistinguishable from a sample from \(Q\).
Thus, any lower bound estimator for \(D(P \parallel Q)\) that is accurate with “high probability” must also be \(\le \lg N\) with non-negligible probability.
Applying our core argument, we thus have that any accurate black-box estimator of \(D(P \parallel Q)\) requires \(N > 2^{D(P \parallel Q) - 2\epsilon}\) samples, which is exponential in \(D(P \parallel Q)\).
Note that this argument generalizes perfectly well to continuous \(Q\) as well: just replace pmf with pdf and replace our sums with integrals and the proofs naturally follow.
Bounds of Discrete Entropy Estimation
We can use a similar trick (replacing a random variable with another adversarially chosen random variable that is “indistinguishable” with high probability) to also derive some bounds on sample sizes for black-box entropy estimation.
Suppose we have some discrete random variable \(P\) with pmf \(p(x)\). to lower-bound \(H(P)\) with high probability using a black-box estimator. We know that entropy is only concerned with the probability mass function \(p(x)\): the actual identity of events in our sample space is irrelevant. That means given a random sample i.i.d in \(P\), the the frequency of frequencies vector is a sufficient statistic for estimating information entropy, where we define the frequency of frequencies vector \(N\) as the vector where \(N_r\) records the number of distinct elements of \(P\)’s support that we see \(r\) times in our sample (you might recognize this from Good-Turing frequency estimation. The frequency of frequencies vector stores exactly the information that we need for entropy (frequencies) while discarding the actual identity of the objects that we observed in our sample.
We will show that for such black-box lower-bound estimators, we have \(E \le \lg (2N^2) = 2 \lg N + 1\) with non-negligible probability. To do so, let’s only consider \(P\) with at least \(2N^2\) elements in its support (if \(P\)’s support is smaller, then we trivially have that \(H(P) < \lg 2N^2\), and so any lower bound of \(H(P)\) is also \(\le \lg2N^2\)).
For such \(P\), let \(\Omega\) denote \(P\)’s support, and let \(A\) be the set of \(N^2\) most likely outcomes from \(P\) and let \(B\) be the set of the next \(N^2\) most likely outcomes (breaking ties arbitrarily). We will construct a new random variable \(\tilde{P}\) by first drawing a random sample from \(P\): if our sample is \(\in A\), then we will return that sample. Otherwise, we will draw a sample uniformly from \(B\) In other words, this random variable has pmf:
\[\tilde{p}(x) = \begin{cases} p(x) ,& x \in A \\ \frac{1}{N^2} \Pr [ P \not\in A ] ,& x \in B \\ 0 ,& x \not \in A \land x \not \in B \end{cases} \]
Because the support of \(\tilde{P}\) has size \(2N^2\), we have that \(H(\tilde{P}) \le \lg 2N^2\) trivially.
We will now argue that i.i.d samples from \(P\) are indistinguishable to our entropy estimator from samples in \(\tilde{P}\) as long as we have no element in \(x \not \in A\) that are sampled more than once. If this happens, then the entropy estimator just sees a frequency-of-frequency vector from the samples in \(A\) (which by construction have the same distributions for both \(P\) and \(\tilde{P}\)) along with a bunch of singleton samples from \(\Omega - A\) (which are also indistinguishable between \(P\) and \(\tilde{P}\), because we are discarding their identities).
The probability of having no collisions in \(\Omega - A\) also must happen with non-negligible probability: suppose we draw a i.i.d sample of size \(N\) from either \(\tilde{P}\) or \(P\). Then we can upper bound the probability of a collision in \(\Omega - A \supseteq B \) with the probability of a collision from samples frawn uniformly over \(B\) because \(\forall x \in B, \frac{1}{N^2} \ge p(x) \land \tilde{p}(x)\). The probability of having no such collisions is thus lower-bounded for both \(P\) and \(\tilde{P}\) by:
\[\prod_{k=0}^{N-1} (1 - \frac{k}{N^2}) \ge (1 - \frac{1}{N})^N \ge \frac{1}{4} \]
and so such collision-free samples happen with non-negligible probability. Thus, we have shown that any accurate black-box estimator of \(H(P)\) must be \(\le 2 \lg N + 1\) with non-negligible probability.
Using our core argument, we thus have for any accurate-with-high-probabiity black box estimator of \(H(P)\), we need \(N \ge 2^{\frac{H(P) - 1}{2} - \epsilon} = \Omega(2^{-\epsilon}\sqrt{2^{H(P)}})\).
Bounds on Mutual Information
We know that for any two random variables \(X, Y\), we have that \(I(X; Y) = H(X) - H(X|Y) \le H(X)\). Thus, any lower bound on \(I(X; Y)\) is also a lower bound on \(H(X)\), and so any black box lower bound estimator of \(I(X; Y)\) must also be \(\le 1 + 2 \lg N\) with non-negligible probability. We thus have the same bound that any accurate-with-high-probability black box estimator of \(I(X;Y)\) needs \(N = \Omega(2^{-\epsilon}\sqrt{2^{H(P)}})\) samples.
Closing Thoughts
I enjoyed these proofs because they were relatively straightforward: the key insights are about generating the right adversarial distributions to establish bounds on lower-bounds, but once you get the right adversary the mathematics aren’t particularly complicated.
One interesting point is that these are exponential bounds in the quantity itself: that is, if we want to estimate the entropy of a high-entropy distribution, this implies we need lots of samples. But if we want to estimate low-entropy distributions, the mutual information of mostly independent random variables, or KL-divergences between almost identical random variables, then these bounds don’t tell us all that much.