Before an election, a polling firm surveys 1,500 people and predicts the national vote share within a few percentage points. Out of millions of voters, how can 1,500 responses tell you anything reliable?
The answer comes from two theorems that are arguably the most important results in all of probability and statistics.
The law of large numbers (LLN) guarantees that the sample mean converges to the population mean. The central limit theorem (CLT) describes how — the sampling distribution of the mean is approximately Normal, regardless of the underlying distribution. Together, they justify the core logic of statistics: draw a sample, compute a statistic, and make an inference about the population.
The previous article showed how to update probabilities when new information arrives. This article addresses the deeper question: why can a sample tell you anything about a population at all?
This article covers:
Four modes of convergence and their relationships
Weak and strong laws of large numbers
The central limit theorem (Lindeberg–Lévy) and Berry–Esseen bound
When the CLT fails (heavy-tailed distributions)
The delta method for transformations of asymptotically Normal statistics
Convergence Concepts
Before stating the LLN and CLT, you need a precise vocabulary for “a sequence of random variables approaches a limit.” There are four standard modes, each with a different strength. Think of them as increasingly strict standards of proof for the claim “\(X_n\) gets close to \(X\).”
NoteFour Modes of Convergence (Casella & Berger, 2002)
Let \(X_1, X_2, \ldots\) be a sequence of random variables and \(X\) a target.
Almost sure (a.s.): \(P\!\left(\lim_{n\to\infty} X_n = X\right) = 1\)
In probability: \(\lim_{n\to\infty} P(|X_n - X| > \varepsilon) = 0\) for every \(\varepsilon > 0\)
In \(L^r\) (mean): \(\lim_{n\to\infty} E[|X_n - X|^r] = 0\) for some \(r \ge 1\)
In distribution: \(\lim_{n\to\infty} F_{X_n}(x) = F_X(x)\) at every continuity point of \(F_X\)
The implications form a hierarchy:
Almost sure convergence \(\Rightarrow\) convergence in probability
\(L^r\) convergence \(\Rightarrow\) convergence in probability
Convergence in probability \(\Rightarrow\) convergence in distribution
The reverse implications do not hold in general. Almost sure and \(L^r\) convergence are independent of each other — neither implies the other without additional conditions. Figure 1 shows this hierarchy.
Figure 1: Hierarchy of convergence concepts
Law of Large Numbers
Let \(X_1, X_2, \ldots\) be i.i.d. random variables with \(E[X_i] = \mu\) and (for the weak law) \(\text{Var}(X_i) = \sigma^2 < \infty\). Define the sample mean \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\).
Weak Law (WLLN)
The weak law of large numbers states that \(\bar{X}_n \xrightarrow{P} \mu\).
A clean proof uses Chebyshev’s inequality (from the previous article):
The strong law strengthens this to almost sure convergence: \(\bar{X}_n \xrightarrow{a.s.} \mu\). This requires only \(E[|X_i|] < \infty\) — no finite variance assumption (Casella and Berger 2002, Ch. 5).
The practical difference: the weak law says the sample mean is probably close to \(\mu\) for large \(n\). The strong law says it is certainly close for large \(n\) (with probability 1). For most applications, the distinction is academic. But the strong law is what justifies Monte Carlo simulation: it guarantees that your simulation average converges to the true value, not just that it is likely to be close.
Simulation
Figure 2 shows 10 sample-mean paths for four distributions. In every case, the paths converge to \(\mu\) as \(n\) grows, regardless of the shape of the original distribution.
Figure 2: LLN: sample-mean convergence for Normal, Exponential, Uniform, and Poisson
Code
import numpy as nprng = np.random.default_rng(seed=12345)# Exponential(1): mu = 1samples = rng.exponential(1.0, size=100_000)running_mean = np.cumsum(samples) / np.arange(1, 100_001)print(f"n=100: mean = {running_mean[99]:.4f}")print(f"n=10000: mean = {running_mean[9999]:.4f}")print(f"n=100000: mean = {running_mean[99999]:.4f}")print(f"True mu = 1.0")
n=100: mean = 0.9452
n=10000: mean = 0.9971
n=100000: mean = 0.9977
True mu = 1.0
Central Limit Theorem
The LLN tells you that \(\bar{X}_n \to \mu\). The CLT tells you how fast and in what shape.
The remarkable fact: the CLT does not care about the shape of the original distribution. Whether \(X_i\) is Uniform, Exponential, Bernoulli, or any other distribution with finite variance, the standardized sample mean converges to a standard Normal. This is why the Normal distribution appears everywhere in statistics — it is the universal attractor for averages.
Berry–Esseen Bound
The CLT is an asymptotic result — it says nothing about how large \(n\) must be. The Berry–Esseen theorem(Berry 1941; Esseen 1942) quantifies the approximation error:
where \(C \le 0.4748\). This gives an \(O(1/\sqrt{n})\) rate, meaning doubling precision requires quadrupling the sample size.
Simulation
Figure 3 shows standardized sample means for three non-Normal distributions at \(n = 1, 5, 30, 100\). By \(n = 30\), all three are close to the \(N(0,1)\) curve. By \(n = 100\), the match is nearly exact.
Figure 3: CLT: standardised sample means converge to \(N(0,1)\)
Central Limit Theorem animation
Code
import numpy as npfrom scipy import statsrng = np.random.default_rng(seed=12345)# Exponential(1): skewed, mu=1, sigma=1n =30z_scores = []for _ inrange(10_000): sample = rng.exponential(1.0, size=n) z = (sample.mean() -1.0) / (1.0/ np.sqrt(n)) z_scores.append(z)z_scores = np.array(z_scores)print(f"Exponential, n={n}:")print(f" Mean of z-scores: {z_scores.mean():.4f} (should be ~0)")print(f" Std of z-scores: {z_scores.std():.4f} (should be ~1)")ks = stats.kstest(z_scores, 'norm')print(f" KS test p-value: {ks.pvalue:.4f}")
Exponential, n=30:
Mean of z-scores: -0.0070 (should be ~0)
Std of z-scores: 0.9880 (should be ~1)
KS test p-value: 0.0001
When the CLT Fails
The CLT requires finite variance. Distributions with infinite variance — or even infinite mean — violate this assumption. The Cauchy distribution is the canonical counterexample: it has no finite mean or variance.
For Cauchy-distributed \(X_i\), the sample mean \(\bar{X}_n\) has the same distribution as a single observation, regardless of \(n\). Averaging does not help. No matter how many observations you collect, the sample mean is just as noisy as a single draw.
Figure 4 shows the distribution of \(\bar{X}_n\) for \(n = 1, 10, 100, 1000\). Unlike the CLT distributions, these never approach the Normal shape — the heavy tails persist at every sample size.
Figure 4: CLT failure: Cauchy sample means do not converge to Normal
Why this matters in practice. Financial return distributions often have heavier tails than the Normal. If the underlying data have infinite variance (or very large kurtosis), the CLT-based approximation can be dangerously inaccurate. Risk models that assume Normal tails can dramatically underestimate the probability of extreme losses. This is one reason regulators require stress testing beyond VaR: the Normal approximation may not hold in the tails where it matters most.
Delta Method
The CLT applies to sample means. What if you need the distribution of a function of the sample mean? The delta method extends the CLT to smooth transformations.
The variance \(1/(4n)\) no longer depends on \(\lambda\) — a variance-stabilizing transform.
Figure 5 compares the simulated distribution of \(\sqrt{\bar{X}_n}\) with the delta method prediction for \(\text{Poisson}(9)\) at various sample sizes. The approximation is accurate even at \(n = 10\).
Figure 5: Delta method: \(\sqrt{\bar{X}}\) for Poisson(9)
This article covered the two theorems that connect samples to populations:
Convergence concepts: a.s., in probability, in \(L^r\), and in distribution form a hierarchy where each level is strictly weaker than the ones above it
LLN: the sample mean converges to the population mean — the weak law needs finite variance, the strong law only finite mean
CLT: the standardized sample mean converges to \(N(0,1)\), regardless of the original distribution’s shape — this is why the Normal distribution is ubiquitous
Berry–Esseen: the CLT approximation improves at rate \(O(1/\sqrt{n})\)
CLT failure: distributions without finite variance (e.g., Cauchy) break the CLT entirely — averaging does not help, which has serious implications for risk models with heavy-tailed data
Delta method: extends the CLT to smooth functions of sample means
Next: Stochastic Processes — when random variables are indexed by time, you get stochastic processes. These are the mathematical objects behind interest rate models, stock price dynamics, and default intensity processes.
Application preview: The CLT is the theoretical foundation of Monte Carlo simulation — every time you average simulation results and build a confidence interval, you are invoking the CLT. In stress testing, the Berry–Esseen bound tells you how many scenarios you need to trust the Normal approximation. The delta method appears whenever you compute confidence intervals for risk metrics that are nonlinear functions of estimated parameters.
References
Berry, Andrew C. 1941. “The Accuracy of the Gaussian Approximation to the Sum of Independent Variates.”Transactions of the American Mathematical Society 49 (1): 122–36. https://doi.org/10.1090/S0002-9947-1941-0003498-3.
Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Cengage Learning.
Esseen, Carl-Gustav. 1942. “On the Liapounoff Limit of Error in the Theory of Probability.”Arkiv För Matematik, Astronomi Och Fysik 28A (9): 1–19.
Source Code
---title: "Law of Large Numbers and Central Limit Theorem"subtitle: "Why Samples Tell Us About Populations"author: "Universe Office"date: 2026-04-04categories: [probability, foundations]bibliography: references.bibformat: html: code-fold: true toc: true---## IntroductionBefore an election, a polling firm surveys 1,500 people and predicts the national vote share within a few percentage points. Out of millions of voters, how can 1,500 responses tell you anything reliable?The answer comes from two theorems that are arguably the most important results in all of probability and statistics.The **law of large numbers** (LLN) guarantees that the sample mean converges to the population mean. The **central limit theorem** (CLT) describes *how* --- the sampling distribution of the mean is approximately Normal, regardless of the underlying distribution. Together, they justify the core logic of statistics: draw a sample, compute a statistic, and make an inference about the population.The [previous article](../conditional-probability/index.qmd) showed how to update probabilities when new information arrives. This article addresses the deeper question: **why can a sample tell you anything about a population at all?**This article covers:- Four modes of convergence and their relationships- Weak and strong laws of large numbers- The central limit theorem (Lindeberg--Lévy) and Berry--Esseen bound- When the CLT fails (heavy-tailed distributions)- The delta method for transformations of asymptotically Normal statistics## Convergence ConceptsBefore stating the LLN and CLT, you need a precise vocabulary for "a sequence of random variables approaches a limit." There are four standard modes, each with a different strength. Think of them as increasingly strict standards of proof for the claim "$X_n$ gets close to $X$."::: {.callout-note}## Four Modes of Convergence (Casella & Berger, 2002)Let $X_1, X_2, \ldots$ be a sequence of random variables and $X$ a target.1. **Almost sure (a.s.)**: $P\!\left(\lim_{n\to\infty} X_n = X\right) = 1$2. **In probability**: $\lim_{n\to\infty} P(|X_n - X| > \varepsilon) = 0$ for every $\varepsilon > 0$3. **In $L^r$ (mean)**: $\lim_{n\to\infty} E[|X_n - X|^r] = 0$ for some $r \ge 1$4. **In distribution**: $\lim_{n\to\infty} F_{X_n}(x) = F_X(x)$ at every continuity point of $F_X$:::The implications form a hierarchy:- Almost sure convergence $\Rightarrow$ convergence in probability- $L^r$ convergence $\Rightarrow$ convergence in probability- Convergence in probability $\Rightarrow$ convergence in distributionThe reverse implications do not hold in general. Almost sure and $L^r$ convergence are independent of each other --- neither implies the other without additional conditions. @fig-convergence-concepts shows this hierarchy.{#fig-convergence-concepts}## Law of Large NumbersLet $X_1, X_2, \ldots$ be i.i.d. random variables with $E[X_i] = \mu$ and (for the weak law) $\text{Var}(X_i) = \sigma^2 < \infty$. Define the sample mean $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$.### Weak Law (WLLN)The **weak law of large numbers** states that $\bar{X}_n \xrightarrow{P} \mu$.A clean proof uses Chebyshev's inequality (from the previous article):$$P(|\bar{X}_n - \mu| \ge \varepsilon) \le \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0$$### Strong Law (SLLN)The **strong law** strengthens this to almost sure convergence: $\bar{X}_n \xrightarrow{a.s.} \mu$. This requires only $E[|X_i|] < \infty$ --- no finite variance assumption [@casella2002, Ch. 5].The practical difference: the weak law says the sample mean is *probably* close to $\mu$ for large $n$. The strong law says it is *certainly* close for large $n$ (with probability 1). For most applications, the distinction is academic. But the strong law is what justifies Monte Carlo simulation: it guarantees that your simulation average converges to the true value, not just that it is likely to be close.### Simulation@fig-lln shows 10 sample-mean paths for four distributions. In every case, the paths converge to $\mu$ as $n$ grows, regardless of the shape of the original distribution.{#fig-lln}```{python}#| label: lln-demoimport numpy as nprng = np.random.default_rng(seed=12345)# Exponential(1): mu = 1samples = rng.exponential(1.0, size=100_000)running_mean = np.cumsum(samples) / np.arange(1, 100_001)print(f"n=100: mean = {running_mean[99]:.4f}")print(f"n=10000: mean = {running_mean[9999]:.4f}")print(f"n=100000: mean = {running_mean[99999]:.4f}")print(f"True mu = 1.0")```## Central Limit TheoremThe LLN tells you that $\bar{X}_n \to \mu$. The CLT tells you *how fast* and *in what shape*.::: {.callout-note}## Theorem (Lindeberg--Lévy CLT; Casella & Berger, 2002)Let $X_1, X_2, \ldots$ be i.i.d. with $E[X_i] = \mu$ and $0 < \text{Var}(X_i) = \sigma^2 < \infty$. Then:$$\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1)$$:::The remarkable fact: the CLT does not care about the shape of the original distribution. Whether $X_i$ is Uniform, Exponential, Bernoulli, or any other distribution with finite variance, the standardized sample mean converges to a standard Normal. This is why the Normal distribution appears everywhere in statistics --- it is the universal attractor for averages.### Berry--Esseen BoundThe CLT is an asymptotic result --- it says nothing about how large $n$ must be. The **Berry--Esseen theorem** [@berry1941; @esseen1942] quantifies the approximation error:$$\sup_x |F_{\bar{Z}_n}(x) - \Phi(x)| \le \frac{C \cdot E[|X_1 - \mu|^3]}{\sigma^3 \sqrt{n}}$$where $C \le 0.4748$. This gives an $O(1/\sqrt{n})$ rate, meaning doubling precision requires quadrupling the sample size.### Simulation@fig-clt shows standardized sample means for three non-Normal distributions at $n = 1, 5, 30, 100$. By $n = 30$, all three are close to the $N(0,1)$ curve. By $n = 100$, the match is nearly exact.{#fig-clt}```{python}#| label: clt-demoimport numpy as npfrom scipy import statsrng = np.random.default_rng(seed=12345)# Exponential(1): skewed, mu=1, sigma=1n =30z_scores = []for _ inrange(10_000): sample = rng.exponential(1.0, size=n) z = (sample.mean() -1.0) / (1.0/ np.sqrt(n)) z_scores.append(z)z_scores = np.array(z_scores)print(f"Exponential, n={n}:")print(f" Mean of z-scores: {z_scores.mean():.4f} (should be ~0)")print(f" Std of z-scores: {z_scores.std():.4f} (should be ~1)")ks = stats.kstest(z_scores, 'norm')print(f" KS test p-value: {ks.pvalue:.4f}")```## When the CLT FailsThe CLT requires finite variance. Distributions with infinite variance --- or even infinite mean --- violate this assumption. The **Cauchy distribution** is the canonical counterexample: it has no finite mean or variance.For Cauchy-distributed $X_i$, the sample mean $\bar{X}_n$ has the *same* distribution as a single observation, regardless of $n$. Averaging does not help. No matter how many observations you collect, the sample mean is just as noisy as a single draw.@fig-clt-failure shows the distribution of $\bar{X}_n$ for $n = 1, 10, 100, 1000$. Unlike the CLT distributions, these never approach the Normal shape --- the heavy tails persist at every sample size.{#fig-clt-failure}**Why this matters in practice.** Financial return distributions often have heavier tails than the Normal. If the underlying data have infinite variance (or very large kurtosis), the CLT-based approximation can be dangerously inaccurate. Risk models that assume Normal tails can dramatically underestimate the probability of extreme losses. This is one reason regulators require stress testing beyond VaR: the Normal approximation may not hold in the tails where it matters most.## Delta MethodThe CLT applies to sample means. What if you need the distribution of a *function* of the sample mean? The **delta method** extends the CLT to smooth transformations.::: {.callout-note}## Theorem (Delta Method; Casella & Berger, 2002)If $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $g$ is differentiable at $\mu$ with $g'(\mu) \ne 0$, then:$$\sqrt{n}(g(\bar{X}_n) - g(\mu)) \xrightarrow{d} N(0, \sigma^2 [g'(\mu)]^2)$$:::### Example: Variance-Stabilizing TransformFor $X_i \sim \text{Poisson}(\lambda)$, $\text{Var}(\bar{X}_n) = \lambda / n$ depends on $\lambda$. Apply $g(x) = \sqrt{x}$:$$\sqrt{\bar{X}_n} \;\dot\sim\; N\!\left(\sqrt{\lambda},\; \frac{1}{4n}\right)$$The variance $1/(4n)$ no longer depends on $\lambda$ --- a **variance-stabilizing** transform.@fig-delta-method compares the simulated distribution of $\sqrt{\bar{X}_n}$ with the delta method prediction for $\text{Poisson}(9)$ at various sample sizes. The approximation is accurate even at $n = 10$.{#fig-delta-method}```{python}#| label: delta-method-demoimport numpy as nprng = np.random.default_rng(seed=12345)lam =9.0for n in [10, 30, 100, 500]: samples = rng.poisson(lam, (10_000, n)) sqrt_xbar = np.sqrt(samples.mean(axis=1)) pred_std =1/ (2* np.sqrt(n))print(f"n={n:3d}: observed std = {sqrt_xbar.std():.4f}, "f"delta prediction = {pred_std:.4f}")```## Summary and ConnectionsThis article covered the two theorems that connect samples to populations:- **Convergence concepts**: a.s., in probability, in $L^r$, and in distribution form a hierarchy where each level is strictly weaker than the ones above it- **LLN**: the sample mean converges to the population mean --- the weak law needs finite variance, the strong law only finite mean- **CLT**: the standardized sample mean converges to $N(0,1)$, regardless of the original distribution's shape --- this is why the Normal distribution is ubiquitous- **Berry--Esseen**: the CLT approximation improves at rate $O(1/\sqrt{n})$- **CLT failure**: distributions without finite variance (e.g., Cauchy) break the CLT entirely --- averaging does not help, which has serious implications for risk models with heavy-tailed data- **Delta method**: extends the CLT to smooth functions of sample means**Next**: [Stochastic Processes](../stochastic-processes/index.qmd) --- when random variables are indexed by time, you get stochastic processes. These are the mathematical objects behind interest rate models, stock price dynamics, and default intensity processes.**Application preview**: The CLT is the theoretical foundation of Monte Carlo simulation --- every time you average simulation results and build a confidence interval, you are invoking the CLT. In stress testing, the Berry--Esseen bound tells you how many scenarios you need to trust the Normal approximation. The delta method appears whenever you compute confidence intervals for risk metrics that are nonlinear functions of estimated parameters.## References::: {#refs}:::