Amateur Hour

Three Derivations of the Gaussian Distribution

Or riffing on ET Jayne
Oct 21, 2019. Filed under Technical
Tags: statistics, math

This discussion is adapted from ET Jayne’s Probability Theory: A Logic of Science.

Preliminaries

The Gaussian (or normal) distribution is a special distribution parametrized by mean \(\mu\) and variance \(\sigma^2\), with pdf:

\[p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x - \mu)^2} \]

The key fact about the Gaussian distribution (and the reason for its ubiquity) is that its pdf is the exponent of a quadratic function – any pdf which is proportional to \(e^{-ax^2 + bx + c}\) will be a Gaussian distribution.

Where does the normalizing constant come from? The derivation is easiest to show for a standard Gaussian distribution with \(\mu=0, \sigma^2=1\), which has pdf:

\[p(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2} \]

We can compute \(\int_{-\infty}^{\infty} e^{-\frac{1}{2}x^2} dx = \sqrt{2\pi}\) (hence the normalizing constant \(\frac{1}{\sqrt{2\pi}}\)) by cleverly changing our basis into polar coordinates:

\[\begin{aligned} \int_{-\infty}^{\infty} e^{-\frac{1}{2}x^2} dx &= \sqrt{\big[\int_{-\infty}^{\infty} e^{-\frac{1}{2}x^2} dx \big]^2} \\ &= \sqrt{\int_{-\infty}^{\infty} e^{-\frac{1}{2}x^2} dx \int_{-\infty}^{\infty} e^{-\frac{1}{2}y^2} dy } \\ &= \sqrt{\iint_{\mathbb{R}^2} e^{-\frac{1}{2}(x^2 + y^2)} dx \, dy} \\ &= \sqrt{\iint_{\mathbb{R}^2} e^{-\frac{1}{2}r^2} r \, dr \, d\theta} \\ &= \sqrt{2\pi \int_{0}^\infty r e^{-\frac{1}{2}r^2} dr} = \sqrt{2\pi} \end{aligned} \]

Herschel-Maxwell Derivation

Suppose we are drawing points \((x, y)\) from a 2-dimensional distribution with pdf \(p(x, y)\) centered on \((0, 0)\). We will make three assumptions:

  1. That the probability of a point only depends on the radial distance from the origin and that the angles do not matter. In other words, \(p(x, y) = f(\sqrt{x^2 + y^2})\) for some function \(f\)
  2. That \(x\) and \(y\) coordinates are independent of each other. In other words, \(p(x, y) = g(x)g(y)\) for some function \(g\).
  3. That \(p(x, y)\) is a smooth function.

These three assumptions are enough to show that both \(x\) and \(y\) are drawn from a normal distribution! This is a really nice derivation of the Gaussian distribution that uses very few assumptions. Most of these assumptions are about the geometry of space rather than probability, indicating that the Gaussian distribution arises quite naturally given the structure of our reality.

By setting \(y=0\), we can immediately see that \(p(x, 0) = g(x)g(0) = f(x)\). This implies that:

\[\begin{aligned} g(x)g(y) &= p(x, y) = f(\sqrt{x^2 + y^2}) = g(\sqrt{x^2 + y^2})g(0) \\ \frac{g(x)}{g(0)}\frac{g(y)}{g(0)} &= \frac{g(\sqrt{x^2 + y^2})}{g(0)} \\ \frac{g(\sqrt{x^2})}{g(0)}\frac{g(\sqrt{y^2})}{g(0)} &= \frac{g(\sqrt{x^2 + y^2})}{g(0)} \\ h(x^2)h(y^2) &= h(x^2 + y^2) \end{aligned} \]

where we define \(h(a) = \frac{g(\sqrt{a})}{g(0)}\). In other words, we have found that \(h(a)h(b) = h(a + b)\). This should immediately remind you of an exponential function \(h(a) = c^a\) for some exponent base \(c\) (because exponentiation converts multiplication into addition). We can also show this formally: for any positive integer \(q \in \mathbb{Z}^+\), the definition of integers gives us that \(h(q) = \prod_{k=1}^q h(1) = h(1)^q\). We can combine this with \(h(1) = h(\frac{q}{q}) = h(\frac{1}{q})^q\), so \(h(\frac{1}{q}) = h(1)^{\frac{1}{q}}\). To get negative integers, we can notice that \(h(0) = h(0)h(0) = h(0)^2\), which is true if \(h(0) = 1\) (\(h(0) \ne 0\), because \(h(0) = \frac{g(0)}{g(0)} = 1\)). Then for any integer \(p \in \mathbb{Z}\), \(1 = h(0) = h(p - p) = h(p)h(-p) = h(1)^p h(-p)\). Thus, \(h(-p) = h(p)^{-1} = h(1)^{-p}\). Putting this together, we know that for all rational numbers \(\frac{p}{q} \in \mathbb{Q}\), \(h(\frac{p}{q}) = h(1)^\frac{p}{q}\). Continuity of \(h\) extends this to the irrational numbers, so we know that \(h(a) = h(1)^a = c^a\) for some constant \(c = h(1)\).

Thus, \(\frac{g(\sqrt{a})}{g(0)} = h(a) = c^a\), which implies that \(g(x) = g(0) c^{x^2} = g(0) e^{x^2 \ln c}\), which is going to be a normal distribution with mean \(\mu = 0\). Thus, under our assumptions, both \(x\) and \(y\) coordinates have a Gaussian distribution.

Gauss’s Derivation

We will now examine Gauss’s derivation of the normal distribution, which is famous enough that he got his name attached (hence, Gaussian distribution). This derivation uses slightly more probabilistic machinery, but show a deep connection between the normal distribution and arithmetic means.

In Gauss’s derivation, we will draw \(n\) measurements \(x_1, x_2, \ldots, x_n \in \mathbb{R}\) of some quantity \(\mu \in \mathbb{R}\). Given our measurements, we want to infer the correct value of \(\mu\). We will show under some mild assumptions, the only way for the “correct” estimate of \(\mu\) to be the arithmetic mean \(\frac{1}{n}\sum x_k\) is for our measurements to be drawn from the Gaussian distribution.

  1. We assume that all of our observations are i.i.d. from some distribution with pdf \(p(x; \mu)\) that is parametrized by \(\mu\). In other words, \(p(x_1, x_2, \ldots, x_n) = \prod_{k=1}^n p(x_k; \mu)\).
  2. One intuitive way to infer \(\mu\) is to find the estimator \(\hat{\mu}\) that maximizes our likelihood function \(\prod_{k=1}^{n} p(x_k; \hat{\mu})\) (i.e. an MLE). We will assume that we have a unique maximum likelihood estimator \(\hat{\mu}\) that is the empirical mean of our measurements \(\frac{1}{n} \sum_{k=1}^n x_k\).
  3. That the probability of each measurement is a continuous function of its distance from the mean. In other words, our pdf \(p(x; \mu) = f(|x - \mu|)\) for some continuous function \(f\).

Now, our likelihood function is maximized whenever its logarithm is maximized (because log is a strictly increasing function). From calculus, we can remember that a maximum can only occur when its derivative is equal to 0, which gives us

\[\begin{aligned} 0 &= \frac{d}{d\mu} \ln \prod_{k=1}^n p(x_k; \mu) = \frac{d}{d\mu} \sum_{k=1}^n \ln p(x_k; \mu) = \sum_{k=1}^n \frac{d}{d\mu} \ln p(x_k; \mu) \\ &= \sum_{k=1}^n \frac{d}{d\mu} \ln f(|x_k - \mu|) = \sum_{k=1}^n g'(x_k - \mu) \end{aligned} \]

where we have define \(g(x_k - \hat{\mu}) = \ln f(|x_k - \hat{\mu}|)\). Now, by assumption, this happens when \(\hat{\mu} = \frac{1}{n} \sum_{k=1}^n x_k\). This must be true for all possible measurements, so in particular it will be true for two measurements \(x_1 = \mu, x_2 = -\mu\), which implies that \(g'(\mu) + g'(-\mu) = 0\). Our identity must also be true for \(x_1 = (n - 1)\mu\) and all other \(x_{k \ne 1} = 0\). That means that:

\[\begin{aligned} 0 &= g'(-(n-1)\mu) + (n-1)g'(\mu) = -g'((n-1)\mu) + (n-1)g'(\mu) \\ g'((n-1)\mu) &= (n-1)g'(\mu) \\ \end{aligned} \]

This implies that \(g'\) is a linear function, ans so \(g'(x - \mu) = a(x-\mu)\) for some constant \(a\). This implies that:

\[\begin{aligned} g(x_k - \mu) &= \frac{1}{2} a(x - \mu)^2 + K \\ \ln p(x; \mu) &= \frac{1}{2} a(x - \mu)^2 + K \\ p(x; \mu) &= Ce^{\frac{1}{2} a(x - \mu)^2} \end{aligned} \]

which results in a Gaussian distribution.

The Central Limit Theorem

The last derivation is the most involved and comes out of the famed Central Limit Theorem of statistics. Suppose we have \(n\) observations \(x_1, x_2, \ldots, x_n\), that are i.i.d from some distribution with mean \(\mu=0\) and variance \(\sigma^2 = 1\). Then we will show that \(\lim_{n \rightarrow \infty} \frac{\sum x_k}{\sqrt{n}}\) has the standard normal distribution.

The proof of this theorem will rely heavily on the characteristic function \(\varphi_X(t) = \mathbb{E} \big[ e^{itX} \big]\), which you might also recognize as the Fourier transform of the pdf.

In some texts, this proof is done via the moment-generating function \(M_X(t) = \mathbb{E} \big[ e^{tX} \big]\). The proof is similar but applies to fewer distributions because the moment-generating function does not exist for some distributions. Consider a distribution with pdf \(p(x) = \frac{3}{x^4}\) and support \([1, \infty)\). Then its moment generating function is:

\[M_X(t) = \mathbb{E}[e^{tX}] = \int_1^\infty \frac{3}{x^4} e^{tx} dx \]

which diverges for \(t > 0\) (exponents grow faster than the inverse polynomial shrinks). This particular function still has finite mean and variance (it has an infinite 3rd moment), so the central limit theorem still applies to this distribution in a way that cannot be proven via moment-generating functions.

The characteristic function, on the other hand, always exists because we know that \(|e^{itX}| = 1\) for all \(t, X\) (by definition of imaginary exponents). Thus, \(\int e^{itX} p(x) dx\) is always guaranteed to converge.

The key characteristics of characteristic functions that we’ll need for our proof are that for independent random variables \(X\) and \(Y\) and constant \(c \in \mathbb{C}\):

\[\begin{aligned} \varphi_{X + Y}(t) &= \mathbb{E}[e^{it(X + Y)}] = \mathbb{E}[e^{itX}e^{itY}] = \mathbb{E}[e^{itX}]\mathbb{E}[e^{itY}] = \varphi_X(t) \varphi_Y(t) \\ \varphi_{cX}(t) &= \mathbb{E}[e^{itcX}] = \varphi_{X}(ct) \\ \varphi_{X}(t) &= \mathbb{E}[e^{itX}] = \mathbb{E}[\sum_{k=0}^\infty \frac{i^k}{k!} X^k] = \sum_{k=0}^\infty \frac{i^k}{k!} \mathbb{E}[X^k] \\ \frac{d^n}{dt^n}\varphi_{X}(0) &= \mathbb{E}[\frac{d^n}{dt^n} e^{itX}]\Big|_{t=0} = i^n \mathbb{E}[X^n] \end{aligned} \]

(Note that the first two facts show us that \(\varphi_{X}(t)\) is a linear transformation. This makes sense, because we know that the Fourier transform is a linear transformation).

In particular, because we have specified that \(0 = \mu = \mathbb{E}[X]\) and \(1 = \sigma^2 = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \mathbb{E}[X^2]\), we know that \(\varphi_{X}'(0) = 0, \varphi_{X}''(0) = -1\). We also know by definition that \(\varphi_{X}(0) = 0\).

Now, Taylor’s theorem tells us that we can approximate any continuous function by a quadratic with \(\varphi_X = \varphi_X (0) + \varphi_X'(0) t + \frac{1}{2} \varphi_X''(0) t^2 + o(t^2) = 1 - \frac{1}{2}t^2 + o(t^2)\), using Little-o notation to indicate that the remainder shrinks to 0 faster than \(t^2\) does as \(t \rightarrow 0\).

Together, this tells us that defining \(Z_n\) to be our sum \(\frac{1}{\sqrt{n}}\sum_{k=1}^n X_k\), we have:

\[\begin{aligned} \varphi_{Z_n}(t) &= \prod_{k=1}^n \varphi_{X_k}(\frac{t}{\sqrt{n}}) = \varphi_X(\frac{t}{\sqrt{n}})^n = \big[ 1 - \frac{t^2}{2n} + o(\frac{t^2}{n}) \big]^n \\ \\ \lim_{n \rightarrow \infty} \varphi_{Z_n}(t) &= \lim_{n \rightarrow \infty} \big[1 - \frac{t^2}{2n} + o(\frac{t^2}{n}) \big]^n = \lim_{n \rightarrow \infty} (1 - \frac{t^2}{2n})^n = e^{-\frac{1}{2}t^2} \end{aligned} \]

Alternatively (depending on which version of Taylor’s theorem you learned), Taylor’s theorem tells us that that \(\exists \xi \in [0, t]\) such that \(\varphi_X(t) = \varphi_X(0) + \varphi_X'(0)t + \frac{1}{2} \varphi_X''(\xi) t^2 = 1 + \frac{1}{2} \varphi_X''(\xi)t^2 = 1 - \frac{1}{2}t^2 + \frac{\varphi_X''(\xi) + 1}{2} t^2\). This gives us

\[\begin{aligned} \varphi_{Z_n}(t) &= \varphi_X(\frac{t}{\sqrt{n}})^n = \big[ 1 - \frac{t^2}{2n} + \frac{\varphi_X''(\xi) + 1}{2}t^2\big]^n \\ \lim_{n \rightarrow \infty} \varphi_{Z_n}(t) &= \lim_{n \rightarrow \infty} \big[ 1 - \frac{t^2}{2n} + \frac{\varphi_X''(\xi) + 1}{2}t^2\big]^n \\ &= \lim_{n \rightarrow \infty} \big[ 1 - \frac{t^2}{2n} + \frac{\varphi_X''(0) + 1}{2}t^2\big]^n = \lim_{n \rightarrow \infty} \big[ 1 - \frac{t^2}{2n}\big]^n \\ &= e^{-\frac{1}{2}t^2} \end{aligned} \]

Using the fact that for \(\varphi_X(\frac{t}{\sqrt{n}})\), \(\xi \in [0, \frac{t}{\sqrt{n}}]\), so \(\xi \rightarrow 0\).

Now, what distribution has characteristic function \(\varphi_Z(t) = e^{-\frac{1}{2}t^2}\)? It turns out this is the standard normal distribution!

\[\begin{aligned} \varphi_Z(t) &= \mathbb{E}[e^{itX}] \\ &= \int_{-\infty}^{\infty} e^{itz} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}z^2} dz = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{itz -\frac{1}{2}z^2} dz \\ &= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-\frac{1}{2}\big[ (z - it)^2 + t^2 \big]} dz \\ &= \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}t^2} \int_{-\infty}^\infty e^{-\frac{1}{2} (z - it)^2} dz \\ &= e^{-\frac{1}{2}t^2} \end{aligned} \]

where I’m going to elide some complex analysis for the last step (you can kind of think of it as a change of variable of \(\tilde{z} = z - it\) and using \(\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-\frac{1}{2}\tilde{z}^2} d\tilde{z} = 1\), although obviously that isn’t totally kosher with complex variables. If we had used the moment-generating function here, then the change of variables would actually be legitimate in the corresponding derivation).

Now, this is actually another pretty cool result. Remembering that \(\varphi_Z(t)\) can be interpreted as a Fourier transform of the pdf, this shows that the Gaussian distribution is actually an eigenvector (or rather, eigenfunction since the vector space is the space of probability denstiy functions) of the Fourier transform!