Entropy and the Central Limit Theorem

NOTE: This page is still in early draft form, with missing sections.

What is “special” about normal distributions? The central limit theorem, which says that normal distributions arise as the result of many independent variables summed together. Hence why normal distributions (and their cousins, such as log-normal distributions) appear in statistics

Normal distributions are also, in some sense, the limiting behavior of heat flow, or of Brownian motion.

They are also equal to their own Fourier transform.

And I recently learned that they are entropy-maximizing.

Entropy as a concept in thermodynamics. Learned the name in high school chemistry. “Entropy is always increasing”.

Entropy as a concept in information theory. Entropy of a probability distribution. Shannon’s channel coding theorem. Statistical decision theory, modern ML, etc uses terms like cross-entropy.

The point of this page is to more clearly define entropy, and try to tie all of these viewpoints on normal distributions under the framework of information theory, more specifically entropy-maximization.

What is entropy?

Entropy measures how uncertain or disparate the outcomes of a random event are.

It has a very concrete formula. If a random event $X$ can take on $n$ different possible outcomes with probabilities $p_1, p_2, \ldots, p_n$ (whose sum equals $1$) then the entropy, $H(X)$, is negative the sum of each probability value times its logarithm.

$$H(X) = -\sum\limits_{i=1}^{n}p_i\log(p_i)$$

Entropy (unlike e.g. mean or standard deviation) doesn’t require that the event outcomes are structured in any way, such as being numerical values. The measure of “spread-out-ness” has nothing to do with such values, only that the outcomes are distinct. So in a sense entropy is one of the most “basic” properties of probability theory.

A simple example to ground ourselves is an event with just two possible outcomes, e.g. flipping an unfair coin which turns up heads with probability $p$ and tails with probability $1-p$. The entropy is symmetric with respect to interchanging $p$ and $1-p$ (which is the same as replacing heads for tails). When $p$ is close to $0$ or close to $1$, the entropy is almost zero, and the entropy is highest when $p=1/2$.

The logarithm here can be taken in any base. In coding and information theory it’s preferred to be log base 2 (as I’ll justify in the next section), and in statistical mechanics it’s preferred to be log base e (as I’ll justify in the section after that). Using log base 2 for now means that flipping a fair coin is an entropy 1 random event.

An interpretation of entropy in plain English is “the expected amount of information gained from one sample”, where unexpected outcomes have higher information content than lower information ones — more precisely, a probability-$p$ outcome has “information content: equal to $-\log(p)$.¹

Why logarithms? In what sense is $-\log(p)$ a sensible way to measure “information content”? That’s is a nice segue to Shannon’s theory of communication.

Entropy and information theory

There’s a story about a conversation between von Neumann and Shannon at the time that Shannon was establishing his seminal paper on information theory, A Mathematical Theory of Communication. Shannon had defined the quantity above, and considered calling it either “information” or “uncertainty”. Von Neumann suggested

you should call it ‘entropy’ and for two reasons; first, the function is already in use in thermodynamics under that name; second, and more importantly, most people don’t know what entropy really is, and if you use the word ‘entropy’ in an argument you will win every time.

I guess it’s a bit like using “AI” or “quantum” today. Today, entropy appears as a concept in areas ranging from thermodynamics to statistical decision theory, and I hope this article makes the unified nature between the two clearer.

Let’s suppose that you’re communicating pieces of information across a channel, and that information is drawn from

In the context of communication and coding, the entropy of a distribution is the expected number of bits required to “address” samples of the distribution.

Prefix-free strings.

Example 1: Messages composed of instructions U, D, L, and R, with respective probabilities 1/2, 1/4, 1/8, 1/8. Described in Grant Sanderson’s recent video. Encode with $0$, $10$, $110$, and $111$. Entropy is $1.75$, and this encoding achieves that value.

Example 2: Messages composed of instructions YES, NO, and MU, each with equal probability 1/3. The best possible seems to be $0$, $10$, and $11$, which has $\frac{5}{3} \approx 1.667$ per character, which is worse than $H(X) = \log_2(3) \approx 1.585$. However, if we encode *pairs* of these instructions, we can use $000, 001, 010, 011, 100, 101, 110, 1110, 1111$ which has $3*7/9 + 4*2/9 = 29/9 \approx 3.222$ per two characters, or $1.611$ per character. We can get better encoding rate by using triples of instructions, and in the limit we can get as close as we like to (but no better than) the entropy.

Aside: A brief perspective on modern LLMs

Grant Sanderson recently began a video series “Compression is Intelligence”, on the information theory origins of modern large language models. I recommend checking it out (write more here)

You can play the same game with frequency analysis on letters of the English language. However, letters of the English language are not independent, you can chunk into words.
There’s a technical challenge here to find the entropy of the English language (or human symbolic language more generally), and a good encoding for it. How do you do this when you don’t have the probability distribution? This leads to the next topic: *modeling* a probability distribution based on samples of it, which is where Fisher information is a key concept. For example, concepts like Fisher information tell us how many samples of a distribution you would need to be able to model it to a given level of accuracy.
What language models have done is to ingest the largest sample set available (all of the data on the internet, which has a certain representative bias) and attempt to compress it as cleanly as possible into a model which can encode these statistical correlations and make predictions.
While this is the underlying, base-level theory, the magic in LLMs and modern AI is in all of the new unexpected emergent behavior which arises atop it. T

Entropy in thermodynamics

Imagine that you have a fluid (such as a gas or a liquid) living in a collection of $n$ connected chambers, labeled $1$ through $n$. Let $p_1, p_2, \ldots, p_n$ be a sequence of probabilities summing to 1, and $p_i$ is the fraction of particles which are in chamber $i$. Suppose that there are $N$ total fluid particles. Then how many possible ways are there to arrange the individual fluid particles so that they are distributed in this way?

(example image: above the boxes are shaded according to probabilities, while below we see individual particles floating around, and N can be varied with a slider.)

Let’s unpack that question. The picture above can be viewed macroscopically or microscopically. The macroscopic state we observe is that each chamber $i$ has a fraction $p_i$ of the particles. The microscopic state labels the individual particles $1$ through $N$ (considering them to be distinct), and then contains the further information of which particles live in which chamber. There are many possible microscopic states which lead to the same macroscopic state, and I’m asking for a calculation of just how many there are. Let’s call this number $\Omega$.

(example applet with 3 chambers and 5 particles as 2, 2, 1: 30 possibilities.)

The total number of microstates can be written explicitly as a ratio of factorials $$\Omega = \frac{N!}{(Np_1)!(Np_2)!\ldots (Np_n)!}$$ Taking the natural logarithm of both sides gives $$\log \Omega = \log(N!) – \sum\limits_{i=1}^n \log((Np_i)!)$$ Stirling’s approximation tells us that the natural logarithm of a factorial grows in a controlled manner for large values: $\log(k!) \approx k\log(k) – k + \frac{1}{2}\log(2\pi k) + O(1/k)$. Applying this approximation to the sum given yields (after algebraic simplification)

$$\log \Omega \approx N\log(N) – N + \frac{1}{2}\log(2\pi N) – \sum\limits_{i=1}^n (Np_i\log(Np_i) – Np_i – \frac{1}{2}\log(2\pi Np_i))$$

After using the fact that $\sum\limits_{i=1}^n p_i=1$, many terms cancel here, leaving us with

$$\log \Omega \approx -N\sum\limits_{i=1}^n p_i\log(p_i) + O(\log(N))$$

In other words, the dominant term in the number of possible microstates is the entropy of the distribution, with the relation being of the form $\frac{\log \Omega}{N} = H(X)$ as $N \rightarrow \infty$. The quantity $\log \Omega$ has an information-theoretic interpretation, as the approximate number of bits required to specify a given microstate, given the macrostate.

If entropy always increases (paraphrase of the second law of thermodynamics), then it would seem that the particles will always equilibriate to an equal proportion in each chamber. But obviously that’s not always true, e.g. if the chambers have different sizes. That’s because this entropy calculation says nothing about the physics of the system — such as whether one chamber is preferred over another for some reason, or even the process of how particles transition from one chamber to another. The next section runs through a few examples of entropy-maximizing distributions under constraints, which can be thought of as analogous to adding simple physical laws to the picture above.

Three entropy-maximizing examples

These are drawn from some notes from Keith Conrad.

The physics

Motivate from gas molecules within a system of connected chambers. Maximum entropy occurs at the uniform distribution. Do calculation. What we’ve done so far motivates this, using smoothing: if two probabilities were unequal, you could increase the entropy by smoothing them to be equal.

Next, consider a system with an infinite numbers of chambers arranged vertically, so that particles need more energy to reach the higher chambers — a dis-incentivizing force which prevents the particles from spreading out uniformly among the chambers. The total energy in the system is directly related to the average energy $T$ per particle (the temperature). Supposing that total energy is held fixed, the number of particles in the chamber at every level E equilibriates over time to a quantity proportional to e^{-E/T} — called the Gibbs distribution. The mathematical version of this statement is: for a fixed constant among all probability distributions on the positive real numbers with mean $c$, the one with maximal entropy is $p(x)=\frac{1}{c}e^{-x/c}. So the Gibbs distribution is just the one which maximizes entropy. Depicted below is a simulation of Feller diffusion, which flows an arbitrary distribution towards this Gibbs distribution (locally simulating the jumping of energy levels).

Lastly, consider unconstrained flow in both directions. Shape tends towards normal distribution. If we consider continuously normalizing the variance — a “soft” version of the constrained example before — then it well and truly becomes a normal distribution. I don’t know a physical manifestation of this, but it’s widespread in statistics.

Learning distributions

TODO Write an intro to motivate what this section is about. Concepts come from statistical learning theory

Relative entropy as “distance”

Use to quantify how far off two distributions on the same set are from each other. Relative entropy $H(X || G) = H(G) – H(X)$, which implies $H(G)$ is maximal, this is by an explicit calculation.

Score functions and Fisher information

When our outcome space has a geometry, can define score function on it: the derivative of information (log-density) with respect to X-value. Visualize it as a vector field pointing uphill direction. Has expected value zero. Used for “optimal transport”?

Fisher information is the expected value of the squared score function, i.e. the score function’s variance. Statistical significance of this. Represents how “peaky” the function is.

Basic inequality involving Fisher information (Stam’s inequality, AKA Cramer-Rao bound). The fact that equality holds iff Gaussian already implies towards the CLT, because Gaussian is therefore the only stable distribution under self-convolution-and-scaling (all others will increase in entropy and decrease Fisher information under this operation).

Flow processes

Ornstein-Uhlenbeck flow and Feller diffusion. Increasing entropy, and derivative is Fisher information.

Gradient-flow along this distance towards the normal distribution can be interpreted as movement in the vector space of probability distributions (where probability distributions are functions on the sample space). Worth bringing up this visualization of probability distributions.

This is like heat flow ; in the positive-value case, it’s literally like energy level transitions in matter.

Proof of the Central Limit Theorem

A special property of the Gaussian — it’s the only distribution that’s fixed under self-convolution-and-scaling (all others increase entropy value). However, another distribution could in-theory

First core idea is a visualization of the meaning of Fisher information, as a vector field representing gradient of log-density (and being used for “optimal transport”). This is from two sections back

Second core idea is that entropy (relative to Gaussian) is an integral of Fisher information. Or equivalently, the derivative of the entropy as we add in Gaussian noise is equal to the Fisher information. The semigroup evolution commutes with self-convolution. This is from the first and third sections

Third core idea, which is the main new one, is that scaled Fisher information is subadditive between normalized iid sums — and that Fisher information tends to zero for increasing n. This is established by showing that the Fisher informations are decreasing as we go by powers of 2 (using formal properties of the score function; establishing this will come down to an interpretation of what the score function means, as FI is its mean-squared value) and thus they do converge — I don’t understand Barron’s argument yet, and perhaps it’s hard to make graphical.

Alternative formulation of the third core idea is in the paper of Naor. Characterizes the Fisher information of a PDF on R which is obtained by marginalizing an n-dimensional one. We then interpret the normalized sum of n copies of X as an n-dimensional density, appropriately marginalized along (1, 1, …, 1).

Essentially, the third step uses the “fattening up” of the n-fold convolutions by small Gaussians, and then careful manipulation of inequalities to get the fact that the Fisher information (relative to Gaussian) converges to zero.

Below: visualization of such