Law of Large Numbers and Central Limit Theorem

Why Samples Tell Us About Populations

probability

foundations

Author

Universe Office

Published

April 4, 2026

Introduction

Before an election, a polling firm surveys 1,500 people and predicts the national vote share within a few percentage points. Out of millions of voters, how can 1,500 responses tell you anything reliable?

The answer comes from two theorems that are arguably the most important results in all of probability and statistics.

The law of large numbers (LLN) guarantees that the sample mean converges to the population mean. The central limit theorem (CLT) describes how — the sampling distribution of the mean is approximately Normal, regardless of the underlying distribution. Together, they justify the core logic of statistics: draw a sample, compute a statistic, and make an inference about the population.

The previous article showed how to update probabilities when new information arrives. This article addresses the deeper question: why can a sample tell you anything about a population at all?

This article covers:

Four modes of convergence and their relationships
Weak and strong laws of large numbers
The central limit theorem (Lindeberg–Lévy) and Berry–Esseen bound
When the CLT fails (heavy-tailed distributions)
The delta method for transformations of asymptotically Normal statistics

Convergence Concepts

Before stating the LLN and CLT, you need a precise vocabulary for “a sequence of random variables approaches a limit.” There are four standard modes, each with a different strength. Think of them as increasingly strict standards of proof for the claim “$X_n$ gets close to $X$.”

Four Modes of Convergence (Casella & Berger, 2002)

Let $X_1, X_2, \ldots$ be a sequence of random variables and $X$ a target.

Almost sure (a.s.): $P\!\left(\lim_{n\to\infty} X_n = X\right) = 1$
In probability: $\lim_{n\to\infty} P(|X_n - X| > \varepsilon) = 0$ for every $\varepsilon > 0$
In $L^r$ (mean): $\lim_{n\to\infty} E[|X_n - X|^r] = 0$ for some $r \ge 1$
In distribution: $\lim_{n\to\infty} F_{X_n}(x) = F_X(x)$ at every continuity point of $F_X$

The implications form a hierarchy:

Almost sure convergence $\Rightarrow$ convergence in probability
$L^r$ convergence $\Rightarrow$ convergence in probability
Convergence in probability $\Rightarrow$ convergence in distribution

The reverse implications do not hold in general. Almost sure and $L^r$ convergence are independent of each other — neither implies the other without additional conditions. Figure 1 shows this hierarchy.

Figure 1: Hierarchy of convergence concepts

Law of Large Numbers

Let $X_1, X_2, \ldots$ be i.i.d. random variables with $E[X_i] = \mu$ and (for the weak law) $\text{Var}(X_i) = \sigma^2 < \infty$. Define the sample mean $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$.

Weak Law (WLLN)

The weak law of large numbers states that $\bar{X}_n \xrightarrow{P} \mu$.

A clean proof uses Chebyshev’s inequality (from the previous article):

\[ P(|\bar{X}_n - \mu| \ge \varepsilon) \le \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0 \]

Strong Law (SLLN)

The strong law strengthens this to almost sure convergence: $\bar{X}_n \xrightarrow{a.s.} \mu$. This requires only $E[|X_i|] < \infty$ — no finite variance assumption (Casella and Berger 2002, Ch. 5).

The practical difference: the weak law says the sample mean is probably close to $\mu$ for large $n$. The strong law says it is certainly close for large $n$ (with probability 1). For most applications, the distinction is academic. But the strong law is what justifies Monte Carlo simulation: it guarantees that your simulation average converges to the true value, not just that it is likely to be close.

Simulation

Figure 2 shows 10 sample-mean paths for four distributions. In every case, the paths converge to $\mu$ as $n$ grows, regardless of the shape of the original distribution.

Figure 2: LLN: sample-mean convergence for Normal, Exponential, Uniform, and Poisson

Code

import numpy as np
rng = np.random.default_rng(seed=12345)

# Exponential(1): mu = 1
samples = rng.exponential(1.0, size=100_000)
running_mean = np.cumsum(samples) / np.arange(1, 100_001)
print(f"n=100:    mean = {running_mean[99]:.4f}")
print(f"n=10000:  mean = {running_mean[9999]:.4f}")
print(f"n=100000: mean = {running_mean[99999]:.4f}")
print(f"True mu = 1.0")

n=100:    mean = 0.9452
n=10000:  mean = 0.9971
n=100000: mean = 0.9977
True mu = 1.0

Central Limit Theorem

The LLN tells you that $\bar{X}_n \to \mu$. The CLT tells you how fast and in what shape.

Theorem (Lindeberg–Lévy CLT; Casella & Berger, 2002)

Let $X_1, X_2, \ldots$ be i.i.d. with $E[X_i] = \mu$ and $0 < \text{Var}(X_i) = \sigma^2 < \infty$. Then:

\[ \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1) \]

The remarkable fact: the CLT does not care about the shape of the original distribution. Whether $X_i$ is Uniform, Exponential, Bernoulli, or any other distribution with finite variance, the standardized sample mean converges to a standard Normal. This is why the Normal distribution appears everywhere in statistics — it is the universal attractor for averages.

Berry–Esseen Bound

The CLT is an asymptotic result — it says nothing about how large $n$ must be. The Berry–Esseen theorem (Berry 1941; Esseen 1942) quantifies the approximation error:

\[ \sup_x |F_{\bar{Z}_n}(x) - \Phi(x)| \le \frac{C \cdot E[|X_1 - \mu|^3]}{\sigma^3 \sqrt{n}} \]

where $C \le 0.4748$. This gives an $O(1/\sqrt{n})$ rate, meaning doubling precision requires quadrupling the sample size.

Simulation

Figure 3 shows standardized sample means for three non-Normal distributions at $n = 1, 5, 30, 100$. By $n = 30$, all three are close to the $N(0,1)$ curve. By $n = 100$, the match is nearly exact.

Figure 3: CLT: standardised sample means converge to $N(0,1)$

Code

import numpy as np
from scipy import stats
rng = np.random.default_rng(seed=12345)

# Exponential(1): skewed, mu=1, sigma=1
n = 30
z_scores = []
for _ in range(10_000):
    sample = rng.exponential(1.0, size=n)
    z = (sample.mean() - 1.0) / (1.0 / np.sqrt(n))
    z_scores.append(z)

z_scores = np.array(z_scores)
print(f"Exponential, n={n}:")
print(f"  Mean of z-scores: {z_scores.mean():.4f} (should be ~0)")
print(f"  Std of z-scores:  {z_scores.std():.4f} (should be ~1)")
ks = stats.kstest(z_scores, 'norm')
print(f"  KS test p-value:  {ks.pvalue:.4f}")

Exponential, n=30:
  Mean of z-scores: -0.0070 (should be ~0)
  Std of z-scores:  0.9880 (should be ~1)
  KS test p-value:  0.0001

When the CLT Fails

The CLT requires finite variance. Distributions with infinite variance — or even infinite mean — violate this assumption. The Cauchy distribution is the canonical counterexample: it has no finite mean or variance.

For Cauchy-distributed $X_i$, the sample mean $\bar{X}_n$ has the same distribution as a single observation, regardless of $n$. Averaging does not help. No matter how many observations you collect, the sample mean is just as noisy as a single draw.

Figure 4 shows the distribution of $\bar{X}_n$ for $n = 1, 10, 100, 1000$. Unlike the CLT distributions, these never approach the Normal shape — the heavy tails persist at every sample size.

Figure 4: CLT failure: Cauchy sample means do not converge to Normal

Why this matters in practice. Financial return distributions often have heavier tails than the Normal. If the underlying data have infinite variance (or very large kurtosis), the CLT-based approximation can be dangerously inaccurate. Risk models that assume Normal tails can dramatically underestimate the probability of extreme losses. This is one reason regulators require stress testing beyond VaR: the Normal approximation may not hold in the tails where it matters most.

Delta Method

The CLT applies to sample means. What if you need the distribution of a function of the sample mean? The delta method extends the CLT to smooth transformations.

Theorem (Delta Method; Casella & Berger, 2002)

If $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $g$ is differentiable at $\mu$ with $g'(\mu) \ne 0$, then:

\[ \sqrt{n}(g(\bar{X}_n) - g(\mu)) \xrightarrow{d} N(0, \sigma^2 [g'(\mu)]^2) \]

Example: Variance-Stabilizing Transform

For $X_i \sim \text{Poisson}(\lambda)$, $\text{Var}(\bar{X}_n) = \lambda / n$ depends on $\lambda$. Apply $g(x) = \sqrt{x}$:

\[ \sqrt{\bar{X}_n} \;\dot\sim\; N\!\left(\sqrt{\lambda},\; \frac{1}{4n}\right) \]

The variance $1/(4n)$ no longer depends on $\lambda$ — a variance-stabilizing transform.

Figure 5 compares the simulated distribution of $\sqrt{\bar{X}_n}$ with the delta method prediction for $\text{Poisson}(9)$ at various sample sizes. The approximation is accurate even at $n = 10$.

Figure 5: Delta method: $\sqrt{\bar{X}}$ for Poisson(9)

Code

import numpy as np
rng = np.random.default_rng(seed=12345)

lam = 9.0
for n in [10, 30, 100, 500]:
    samples = rng.poisson(lam, (10_000, n))
    sqrt_xbar = np.sqrt(samples.mean(axis=1))
    pred_std = 1 / (2 * np.sqrt(n))
    print(f"n={n:3d}: observed std = {sqrt_xbar.std():.4f}, "
          f"delta prediction = {pred_std:.4f}")

n= 10: observed std = 0.1574, delta prediction = 0.1581
n= 30: observed std = 0.0906, delta prediction = 0.0913
n=100: observed std = 0.0494, delta prediction = 0.0500
n=500: observed std = 0.0225, delta prediction = 0.0224

Summary and Connections

This article covered the two theorems that connect samples to populations:

Convergence concepts: a.s., in probability, in $L^r$, and in distribution form a hierarchy where each level is strictly weaker than the ones above it
LLN: the sample mean converges to the population mean — the weak law needs finite variance, the strong law only finite mean
CLT: the standardized sample mean converges to $N(0,1)$, regardless of the original distribution’s shape — this is why the Normal distribution is ubiquitous
Berry–Esseen: the CLT approximation improves at rate $O(1/\sqrt{n})$
CLT failure: distributions without finite variance (e.g., Cauchy) break the CLT entirely — averaging does not help, which has serious implications for risk models with heavy-tailed data
Delta method: extends the CLT to smooth functions of sample means

Next: Stochastic Processes — when random variables are indexed by time, you get stochastic processes. These are the mathematical objects behind interest rate models, stock price dynamics, and default intensity processes.

Application preview: The CLT is the theoretical foundation of Monte Carlo simulation — every time you average simulation results and build a confidence interval, you are invoking the CLT. In stress testing, the Berry–Esseen bound tells you how many scenarios you need to trust the Normal approximation. The delta method appears whenever you compute confidence intervals for risk metrics that are nonlinear functions of estimated parameters.

References

Berry, Andrew C. 1941. “The Accuracy of the Gaussian Approximation to the Sum of Independent Variates.” Transactions of the American Mathematical Society 49 (1): 122–36. https://doi.org/10.1090/S0002-9947-1941-0003498-3.

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Cengage Learning.

Esseen, Carl-Gustav. 1942. “On the Liapounoff Limit of Error in the Theory of Probability.” Arkiv För Matematik, Astronomi Och Fysik 28A (9): 1–19.

--- title: "Law of Large Numbers and Central Limit Theorem" subtitle: "Why Samples Tell Us About Populations" author: "Universe Office" date: 2026-04-04 categories: [probability, foundations] bibliography: references.bib format: html: code-fold: true toc: true --- ## Introduction Before an election, a polling firm surveys 1,500 people and predicts the national vote share within a few percentage points. Out of millions of voters, how can 1,500 responses tell you anything reliable? The answer comes from two theorems that are arguably the most important results in all of probability and statistics. The **law of large numbers** (LLN) guarantees that the sample mean converges to the population mean. The **central limit theorem** (CLT) describes *how* --- the sampling distribution of the mean is approximately Normal, regardless of the underlying distribution. Together, they justify the core logic of statistics: draw a sample, compute a statistic, and make an inference about the population. The [previous article](../conditional-probability/index.qmd) showed how to update probabilities when new information arrives. This article addresses the deeper question: **why can a sample tell you anything about a population at all?** This article covers: - Four modes of convergence and their relationships - Weak and strong laws of large numbers - The central limit theorem (Lindeberg--Lévy) and Berry--Esseen bound - When the CLT fails (heavy-tailed distributions) - The delta method for transformations of asymptotically Normal statistics ## Convergence Concepts Before stating the LLN and CLT, you need a precise vocabulary for "a sequence of random variables approaches a limit." There are four standard modes, each with a different strength. Think of them as increasingly strict standards of proof for the claim "$X_n$ gets close to $X$." ::: {.callout-note} ## Four Modes of Convergence (Casella & Berger, 2002) Let $X_1, X_2, \ldots$ be a sequence of random variables and $X$ a target. 1. **Almost sure (a.s.)**: $P\!\left(\lim_{n\to\infty} X_n = X\right) = 1$ 2. **In probability**: $\lim_{n\to\infty} P(|X_n - X| > \varepsilon) = 0$ for every $\varepsilon > 0$ 3. **In $L^r$ (mean)**: $\lim_{n\to\infty} E[|X_n - X|^r] = 0$ for some $r \ge 1$ 4. **In distribution**: $\lim_{n\to\infty} F_{X_n}(x) = F_X(x)$ at every continuity point of $F_X$ ::: The implications form a hierarchy: - Almost sure convergence $\Rightarrow$ convergence in probability - $L^r$ convergence $\Rightarrow$ convergence in probability - Convergence in probability $\Rightarrow$ convergence in distribution The reverse implications do not hold in general. Almost sure and $L^r$ convergence are independent of each other --- neither implies the other without additional conditions. @fig-convergence-concepts shows this hierarchy. ![Hierarchy of convergence concepts](figures/fig1_convergence_concepts.png){#fig-convergence-concepts} ## Law of Large Numbers Let $X_1, X_2, \ldots$ be i.i.d. random variables with $E[X_i] = \mu$ and (for the weak law) $\text{Var}(X_i) = \sigma^2 < \infty$. Define the sample mean $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$. ### Weak Law (WLLN) The **weak law of large numbers** states that $\bar{X}_n \xrightarrow{P} \mu$. A clean proof uses Chebyshev's inequality (from the previous article): $$ P(|\bar{X}_n - \mu| \ge \varepsilon) \le \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0 $$ ### Strong Law (SLLN) The **strong law** strengthens this to almost sure convergence: $\bar{X}_n \xrightarrow{a.s.} \mu$. This requires only $E[|X_i|] < \infty$ --- no finite variance assumption [@casella2002, Ch. 5]. The practical difference: the weak law says the sample mean is *probably* close to $\mu$ for large $n$. The strong law says it is *certainly* close for large $n$ (with probability 1). For most applications, the distinction is academic. But the strong law is what justifies Monte Carlo simulation: it guarantees that your simulation average converges to the true value, not just that it is likely to be close. ### Simulation @fig-lln shows 10 sample-mean paths for four distributions. In every case, the paths converge to $\mu$ as $n$ grows, regardless of the shape of the original distribution. ![LLN: sample-mean convergence for Normal, Exponential, Uniform, and Poisson](figures/fig2_lln.png){#fig-lln} ```{python} #| label: lln-demo import numpy as np rng = np.random.default_rng(seed=12345) # Exponential(1): mu = 1 samples = rng.exponential(1.0, size=100_000) running_mean = np.cumsum(samples) / np.arange(1, 100_001) print(f"n=100: mean = {running_mean[99]:.4f}") print(f"n=10000: mean = {running_mean[9999]:.4f}") print(f"n=100000: mean = {running_mean[99999]:.4f}") print(f"True mu = 1.0") ``` ## Central Limit Theorem The LLN tells you that $\bar{X}_n \to \mu$. The CLT tells you *how fast* and *in what shape*. ::: {.callout-note} ## Theorem (Lindeberg--Lévy CLT; Casella & Berger, 2002) Let $X_1, X_2, \ldots$ be i.i.d. with $E[X_i] = \mu$ and $0 < \text{Var}(X_i) = \sigma^2 < \infty$. Then: $$ \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1) $$ ::: The remarkable fact: the CLT does not care about the shape of the original distribution. Whether $X_i$ is Uniform, Exponential, Bernoulli, or any other distribution with finite variance, the standardized sample mean converges to a standard Normal. This is why the Normal distribution appears everywhere in statistics --- it is the universal attractor for averages. ### Berry--Esseen Bound The CLT is an asymptotic result --- it says nothing about how large $n$ must be. The **Berry--Esseen theorem** [@berry1941; @esseen1942] quantifies the approximation error: $$ \sup_x |F_{\bar{Z}_n}(x) - \Phi(x)| \le \frac{C \cdot E[|X_1 - \mu|^3]}{\sigma^3 \sqrt{n}} $$ where $C \le 0.4748$. This gives an $O(1/\sqrt{n})$ rate, meaning doubling precision requires quadrupling the sample size. ### Simulation @fig-clt shows standardized sample means for three non-Normal distributions at $n = 1, 5, 30, 100$. By $n = 30$, all three are close to the $N(0,1)$ curve. By $n = 100$, the match is nearly exact. ![CLT: standardised sample means converge to $N(0,1)$](figures/fig3_clt.png){#fig-clt} ![Central Limit Theorem animation](figures/anim_clt.gif) ```{python} #| label: clt-demo import numpy as np from scipy import stats rng = np.random.default_rng(seed=12345) # Exponential(1): skewed, mu=1, sigma=1 n = 30 z_scores = [] for _ in range(10_000): sample = rng.exponential(1.0, size=n) z = (sample.mean() - 1.0) / (1.0 / np.sqrt(n)) z_scores.append(z) z_scores = np.array(z_scores) print(f"Exponential, n={n}:") print(f" Mean of z-scores: {z_scores.mean():.4f} (should be ~0)") print(f" Std of z-scores: {z_scores.std():.4f} (should be ~1)") ks = stats.kstest(z_scores, 'norm') print(f" KS test p-value: {ks.pvalue:.4f}") ``` ## When the CLT Fails The CLT requires finite variance. Distributions with infinite variance --- or even infinite mean --- violate this assumption. The **Cauchy distribution** is the canonical counterexample: it has no finite mean or variance. For Cauchy-distributed $X_i$, the sample mean $\bar{X}_n$ has the *same* distribution as a single observation, regardless of $n$. Averaging does not help. No matter how many observations you collect, the sample mean is just as noisy as a single draw. @fig-clt-failure shows the distribution of $\bar{X}_n$ for $n = 1, 10, 100, 1000$. Unlike the CLT distributions, these never approach the Normal shape --- the heavy tails persist at every sample size. ![CLT failure: Cauchy sample means do not converge to Normal](figures/fig4_clt_failure.png){#fig-clt-failure} **Why this matters in practice.** Financial return distributions often have heavier tails than the Normal. If the underlying data have infinite variance (or very large kurtosis), the CLT-based approximation can be dangerously inaccurate. Risk models that assume Normal tails can dramatically underestimate the probability of extreme losses. This is one reason regulators require stress testing beyond VaR: the Normal approximation may not hold in the tails where it matters most. ## Delta Method The CLT applies to sample means. What if you need the distribution of a *function* of the sample mean? The **delta method** extends the CLT to smooth transformations. ::: {.callout-note} ## Theorem (Delta Method; Casella & Berger, 2002) If $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $g$ is differentiable at $\mu$ with $g'(\mu) \ne 0$, then: $$ \sqrt{n}(g(\bar{X}_n) - g(\mu)) \xrightarrow{d} N(0, \sigma^2 [g'(\mu)]^2) $$ ::: ### Example: Variance-Stabilizing Transform For $X_i \sim \text{Poisson}(\lambda)$, $\text{Var}(\bar{X}_n) = \lambda / n$ depends on $\lambda$. Apply $g(x) = \sqrt{x}$: $$ \sqrt{\bar{X}_n} \;\dot\sim\; N\!\left(\sqrt{\lambda},\; \frac{1}{4n}\right) $$ The variance $1/(4n)$ no longer depends on $\lambda$ --- a **variance-stabilizing** transform. @fig-delta-method compares the simulated distribution of $\sqrt{\bar{X}_n}$ with the delta method prediction for $\text{Poisson}(9)$ at various sample sizes. The approximation is accurate even at $n = 10$. ![Delta method: $\sqrt{\bar{X}}$ for Poisson(9)](figures/fig5_delta_method.png){#fig-delta-method} ```{python} #| label: delta-method-demo import numpy as np rng = np.random.default_rng(seed=12345) lam = 9.0 for n in [10, 30, 100, 500]: samples = rng.poisson(lam, (10_000, n)) sqrt_xbar = np.sqrt(samples.mean(axis=1)) pred_std = 1 / (2 * np.sqrt(n)) print(f"n={n:3d}: observed std = {sqrt_xbar.std():.4f}, " f"delta prediction = {pred_std:.4f}") ``` ## Summary and Connections This article covered the two theorems that connect samples to populations: - **Convergence concepts**: a.s., in probability, in $L^r$, and in distribution form a hierarchy where each level is strictly weaker than the ones above it - **LLN**: the sample mean converges to the population mean --- the weak law needs finite variance, the strong law only finite mean - **CLT**: the standardized sample mean converges to $N(0,1)$, regardless of the original distribution's shape --- this is why the Normal distribution is ubiquitous - **Berry--Esseen**: the CLT approximation improves at rate $O(1/\sqrt{n})$ - **CLT failure**: distributions without finite variance (e.g., Cauchy) break the CLT entirely --- averaging does not help, which has serious implications for risk models with heavy-tailed data - **Delta method**: extends the CLT to smooth functions of sample means **Next**: [Stochastic Processes](../stochastic-processes/index.qmd) --- when random variables are indexed by time, you get stochastic processes. These are the mathematical objects behind interest rate models, stock price dynamics, and default intensity processes. **Application preview**: The CLT is the theoretical foundation of Monte Carlo simulation --- every time you average simulation results and build a confidence interval, you are invoking the CLT. In stress testing, the Berry--Esseen bound tells you how many scenarios you need to trust the Normal approximation. The delta method appears whenever you compute confidence intervals for risk metrics that are nonlinear functions of estimated parameters. ## References ::: {#refs} :::