Converging Random Variables

Note: In Dec 2019, I re-worked the exposition and expanded certain parts of this blog post.

In an introduction to probability, you’ll probably deal with lots of random variables (RVs). Like any other expression, we want to talk about the convergence of random variables too (e.g. convergence of the sample mean to population mean with the the law of large numbers). The problem is that there are three major definitions of random variable convergence:

Convergence in Probability:

\[X_n \overset{P}\to c \iff \forall \epsilon > 0, \lim_{n \to \infty} \Pr [|X_n - c| < \epsilon] = 1 \]

Almost sure convergence:

\[X_n \overset{as}\to c \iff \Pr \big[ \lim_{n \to \infty} X_n = c \big] = 1 \]

Convergence in r^th mean:

\[X_n \overset{L^r}\to c \iff \lim_{n \to \infty} \mathbb{E}\big[|X_n - c|^r\big] = 0 \]

Why are there three definitions and how are they related? I couldn’t find a particularly good explanation anywhere, so I’m going to write my own. The following picture of the Law of Large Numbers will be useful to think about:

Random Walk

Convergence in Probability: \(X_n \overset{P}\to X\)

Let’s start with convergence in probability. The definition of a random variable converging in probability is:

\[\forall \epsilon > 0, \lim_{n \to \infty} \Pr [|X_n - X| < \epsilon] = 1 \]

To interpret this definition, consider the picture above. If the dotted lines represents our \(\epsilon\)-region/error-tolerance (i.e. \((X - \epsilon, X + \epsilon)\)), this means that as \(n \to \infty\), almost all of our random walks will be between those error bars.

This is a decently strong notion of random variable convergence, but it’s “weak” in two ways:

This definition doesn’t say much about any particular random walk. If we sample a particular random walk \(x_1, x_2, \ldots\), it’s possible that this sequence of numbers does not actually converge to anything!
This definition doesn’t say anything about how much random walks can go outside of the \(\epsilon\) boundary. Although most random walks should be within the \(\epsilon\) boundary, the random walks that aren’t can be arbitrarily far away!

To address these two different weaknesses, we have two different ways we could potentially strengthen our notion of random variable convergence.

Almost Sure Convergence: \(X_n \overset{as}\to X\)

Almost sure convergence tackles the first problem: it essentially says that if we consider every individual random walk sample, almost all of them converge to our desired value.

Now, a random variable is just a (measurable) function from our sample space to some (measurable) metric space. If we call our original sample space \(S\), then this intuition can then be written down as:

\[\Pr \big[ \{w \in S | \lim_{n \to \infty} X_n(w) = c \} ] = 1 \]

which is often written more succinctly as:

\[\Pr \big[\lim_{n \to \infty} X_n = c \big] = 1 \]

If we consider a graphical interpretation of our definition, then this is saying that for our any given \(\epsilon\)-region, almost all of our random walks will always stay within our error bars after some large (per-sample) \(N_\epsilon\). Note that we might not know what that \(N_\epsilon\) is for our particular random walk draw – we just now that eventually our random walk is almost certain to end up within the error bars.

As a concrete example, let’s consider the number of steps a person walks in a day as our random variable, with \(X_n\) be the number of steps they talk on day \(n\). Clearly, at some point, the person dies and \(X_n = 0\). But more importantly, after the death, \(X_n\) will always stay at 0. We don’t know when this’ll happen, but we’re certain that it will happen eventually.

Now, given our intuitions we can clearly see that if \(X \overset{as}\to c\), then \(X \overset{P}\to c\). More formally, suppose that \(\Pr \big[ \lim_{n \to \infty} X_n = c \big] = 1\). Then, let’s consider the set of events

\[\begin{aligned} A_m(\epsilon) &= \{ w \in S : \exists n > m, |X_n(w) - c| \ge \epsilon \\ A_\infty(\epsilon) &= \bigcap_{m \ge 1} A_m(\epsilon) \end{aligned} \]

Clearly \(A_m(\epsilon) \subseteq A_{m+1}(\epsilon)\) and \(\lim_{m \to \infty} \Pr [A_m(\epsilon)] = \Pr[A_\infty(\epsilon)]\). But of course, \(\Pr[A_\infty(\epsilon)] = 0\): otherwise, we would have a non-zero probability of having some random walk that does not converge to \(c\). Unraveling the definitions, we see that this implies that \(\lim_{n \to \infty} \Pr [|X_n - c| \ge \epsilon] = 0\) – because we have proved this for all positive \(\epsilon\), we have shown that \(X \overset{P}\to c\).

Naturally, the converse does not hold: consider the random sequence \(X_n\) that is \(n\) with probability \(\frac{1}{n}\) and \(0\) with probability \(\frac{n-1}{n}\), where we draw each variable in the process independently from all the others. Clearly,

\[\lim_{n \to \infty} \Pr[ |X_n - X| \le \epsilon] = \lim_{n \to \infty} \frac{n-1}{n} = 1 \]

But for any particular random walk, the probability of never being \(\ge 1\) after time point \(N\) is:

\[\prod_{n=N}^\infty \frac{n}{n+1} = \frac{N}{N+1} \cdot \frac{N + 1}{N+2} \cdot \frac{N+2}{N+3} \cdot \ldots = \frac{N}{\infty} = 0 \]

So almost none of our random walks actually converge to \(0\)!

Convergence in r^th mean: \(X_n \overset{L^r}\to c\)

Convergence in r^th mean is a strengthening of convergence in probability that takes into account how far away each random walk is. The definition here is:

\[\lim_{n \to \infty} \mathbb{E}\big[|X_n - X|^c\big] = 0 \]

This ensures that no random walk can get too far away from \(c\) (where “too far away” is weighted by how likely that particular random walk is). Obviously, as \(r\) increases, the bounds get tighter (the proof that convergence in r^th mean implies convergence in s^th mean when \(r > s \) is trivial).

Convergence in r^th mean also implies convergence in probability when \(r > 0\): using Markov’s inequality, we get:

\[\Pr\big[|X_n - c|^r < \epsilon^r \big] \le \frac{\mathbb{E} \big[|X_n - c|\big]^r}{\epsilon^r} \]

Because the right-hand side goes to \(0\) (by definition of \(L^r\) convergence), the left hand side must also go to \(0\). Thus, \(X_n\) must converge to \(c\) in probability too.

The converse is not true, however: let’s consider the same random walk with \(\Pr[X_n = n] = \frac{1}{n}, \Pr[X_n = 0] = \frac{n-1}{n}\). We already know that this converges in probability, but \(\mathbb{E}[|X_n - 0|] = 1\), which does not go to 0.

Unfortunately, AFAICT there is no necessary relationship between \(L_r\) convergence and almost sure convergence. If we tweak this random sequence such that \(\Pr[X_n = 1] = \frac{1}{n}, \Pr[X_n = 0] = \frac{n-1}{n}\), then suddenly this sequence converges in mean but it still does not converge almost surely.

We can also construct a sequence that converges almost surely but does not converge in mean: consider the sequence \(X_n\) that is i.i.d with \(\Pr[X_n = 2] = \frac{1}{2}\) and \(\Pr[X_n = 0] = \frac{1}{2}\). Now, let \(P_n = \prod_1^n X_n\). By construction, almost all \(P_n\) will eventually be 0:

\[\Pr[P_n = 0] = \Pr[\bigvee_{k=1}^n X_k = 0] = 1 - 2^{-n} \to 1 \]

But if we consider the pmf of \(P_n\) separately, we see that \(\Pr[P_n = 2^n] = 2^{-n}\) and \(\Pr[P_n = 0] = 1 - 2^{-n}\). Then \(\mathbb{E}[(X_n - 0)^r] = \mathbb{E}[2^{nr - n}]\), which does not converge to 0 for \(r \ge 1\).

Questions that I still have:

Do these “strengthened” versions of random variable convergence matter in practice? Although I understand the mathematical difference, is there any real data analysis problem where I would practically care about the difference in guarantees?
Is there some intuition about which definition of convergence we should care about in which situations? In other words, are there situations where we would prefer to operate using one of the notions of convergence over the others?

Amateur Hour