The Axiomatic Foundation of Probability

probability
foundations
Author

Universe Office

Published

April 4, 2026

Introduction

Imagine rolling a standard six-sided die. Before it lands, six outcomes are possible: 1, 2, 3, 4, 5, or 6. You might ask: “What is the chance of rolling an even number?” To answer, you need three things — a list of all possible outcomes, a rule for which questions are allowed, and a way to assign numerical answers.

That three-part structure is exactly what a probability space formalizes. Every statistical model, every simulation, every risk calculation rests on it. This article builds the concept from the ground up: starting with the die, then abstracting to the general framework due to Kolmogorov (Kolmogorov 1933).

A probability space is like a complete rulebook for a game of chance: \(\Omega\) lists all possible outcomes, \(\mathcal{F}\) specifies which questions you are allowed to ask about the outcome, and \(P\) assigns a numerical answer to each question.

This article covers:

  • The three components of a probability space: sample space, sigma-algebra, and probability measure
  • Kolmogorov’s three axioms
  • Uniform and non-uniform probability assignments
  • Convergence of relative frequency to theoretical probability
  • The additivity axiom verified through simulation

From Dice to Definitions

A Concrete Starting Point

Roll a fair die once. The six faces give us \(\Omega = \{1, 2, 3, 4, 5, 6\}\). Now consider some questions you might ask:

  • “Did I roll a 4?” — this corresponds to the subset \(\{4\}\).
  • “Is the result even?” — this corresponds to \(\{2, 4, 6\}\).
  • “Did something happen?” — this corresponds to \(\Omega\) itself.

Each of these questions picks out a subset of \(\Omega\). We call such subsets events. To do probability, we need to organize these events and assign each one a number between 0 and 1.

This leads us to three ingredients: the set of outcomes, the collection of events, and the rule for assigning probabilities. Mathematicians package these into a single object.

The Probability Space

NoteDefinition (Probability Space)

A probability space is a triple \((\Omega, \mathcal{F}, P)\) where:

  • \(\Omega\) is the sample space — the set of all possible outcomes
  • \(\mathcal{F}\) is a \(\sigma\)-algebra (sigma-algebra) on \(\Omega\) — a collection of events (subsets of \(\Omega\)) to which we can assign probabilities
  • \(P\) is a probability measure — a function \(P : \mathcal{F} \to [0, 1]\) satisfying Kolmogorov’s axioms

For the die, \(\Omega = \{1, 2, 3, 4, 5, 6\}\), \(\mathcal{F} = 2^\Omega\) (all 64 subsets), and \(P\) assigns \(1/6\) to each face.

Sample Space \(\Omega\)

The sample space simply lists every outcome that could occur. Its contents depend on the problem, not on the mathematics. For a single die roll, \(\Omega = \{1, 2, 3, 4, 5, 6\}\). For a coin flip, \(\Omega = \{H, T\}\). For the lifetime of a loan, \(\Omega = [0, \infty)\).

Sigma-Algebra \(\mathcal{F}\)

Think of \(\mathcal{F}\) as the set of all questions you are allowed to ask. Not every subset of \(\Omega\) necessarily qualifies as an event. The collection \(\mathcal{F}\) must satisfy three properties — it must contain \(\Omega\) itself, be closed under complementation, and be closed under countable unions. Such a collection is called a \(\sigma\)-algebra.

For finite sample spaces like our die example, we can safely take \(\mathcal{F}\) as the power set (all subsets). But this distinction becomes critical in continuous probability spaces. For uncountable spaces like \(\mathbb{R}\), the Borel \(\sigma\)-algebra provides a rich enough collection without running into measure-theoretic paradoxes.

Probability Measure \(P\)

The probability measure assigns a number in \([0, 1]\) to each event in \(\mathcal{F}\). It answers the question “how likely?” for every allowed event. This assignment must satisfy three axioms.

Kolmogorov’s Axioms

The entire theory of probability rests on three axioms (Kolmogorov 1933; Casella and Berger 2002):

  1. Non-negativity: \(P(A) \ge 0\) for every event \(A \in \mathcal{F}\)
  2. Normalization: \(P(\Omega) = 1\)
  3. Countable additivity (\(\sigma\)-additivity): If \(A_1, A_2, \ldots\) are pairwise disjoint events, then \(P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)\)

The first two axioms are intuitive: probabilities are non-negative, and something must happen. The third axiom is the deep one. It says that for any countable collection of mutually exclusive events, the probability of their union equals the sum of their individual probabilities.

Why countable additivity rather than just finite additivity? Finite additivity alone cannot guarantee the continuity properties needed for limits. Without countable additivity, you cannot prove that \(P(A_n) \to P(A)\) when \(A_n \uparrow A\), and foundational results like the law of large numbers break down.

Everything else — conditional probability, Bayes’ theorem, the law of large numbers — is a consequence of these three statements.

Simulation: Fair Die vs. Loaded Die

To see the axioms in action, consider two dice: a fair die where each face has probability \(1/6\), and a loaded die where \(P(6) = 1/3\) with the remaining probability split equally among faces 1–5 (each \(2/15\)).

Both assignments satisfy the axioms. The difference is the choice of probability measure, not the validity of the framework.

Code
import numpy as np

# Fair die
fair = np.ones(6) / 6
print("Fair die:")
print(f"  Probabilities: {fair}")
print(f"  Sum = {fair.sum():.1f}  (Axiom 2)")
print(f"  All >= 0: {(fair >= 0).all()}  (Axiom 1)")

# Loaded die
loaded = np.array([2/15, 2/15, 2/15, 2/15, 2/15, 1/3])
print("\nLoaded die:")
print(f"  Probabilities: {np.round(loaded, 4)}")
print(f"  Sum = {loaded.sum():.1f}  (Axiom 2)")
print(f"  All >= 0: {(loaded >= 0).all()}  (Axiom 1)")
Fair die:
  Probabilities: [0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]
  Sum = 1.0  (Axiom 2)
  All >= 0: True  (Axiom 1)

Loaded die:
  Probabilities: [0.1333 0.1333 0.1333 0.1333 0.1333 0.3333]
  Sum = 1.0  (Axiom 2)
  All >= 0: True  (Axiom 1)

Figure 1 compares observed relative frequencies from 100,000 rolls against the theoretical probabilities for each die. The close agreement confirms that the simulation respects the specified probability measure.

Figure 1: Fair die and loaded die: observed vs. theoretical probabilities

For the loaded die, a chi-squared goodness-of-fit test yields \(\chi^2 = 10.55\) with \(p = 0.061\), confirming that the observed frequencies are consistent with the theoretical distribution at the 5% significance level.

Convergence of Relative Frequency

The axioms define probability as a mathematical object, not as a long-run frequency. Yet the two perspectives connect through the law of large numbers: as the number of trials grows, relative frequency converges to the theoretical probability.

Figure 2 tracks the relative frequency of rolling an even number on a fair die. The theoretical probability is \(P(\text{even}) = 1/2\). After 50,000 rolls, the relative frequency settles to within 0.004 of the target. The 95% confidence band \(\pm 1.96\sqrt{p(1-p)/n}\) shrinks as \(n\) grows, illustrating the \(O(1/\sqrt{n})\) convergence rate.

Figure 2: Convergence of relative frequency to \(P(\text{even}) = 0.5\)

Monte Carlo convergence animation

This convergence is not an axiom — it is a theorem that follows from the axioms. But it is what makes the axiomatic framework useful: the numbers you compute from the axioms match the frequencies you observe in practice.

Countable Additivity in Action

The third axiom is the workhorse of probability. For disjoint events, probabilities add:

\[ P(A \cup B) = P(A) + P(B) \qquad \text{when } A \cap B = \emptyset \]

This is the finite case of the axiom. The full countable version extends this to infinitely many disjoint events — a property essential for working with continuous distributions and infinite sequences.

To verify the finite case empirically, define two disjoint events on a fair die: \(A = \{1, 2\}\) and \(B = \{5, 6\}\). Theoretically, \(P(A) = P(B) = 2/6\) and \(P(A \cup B) = 4/6\).

Code
import numpy as np

rng = np.random.default_rng(seed=12345)
rolls = rng.integers(1, 7, size=100_000)

# Disjoint events: A = {1,2}, B = {5,6}
in_A = np.isin(rolls, [1, 2])
in_B = np.isin(rolls, [5, 6])

P_A = in_A.mean()
P_B = in_B.mean()
P_union = (in_A | in_B).mean()

print(f"P(A)         = {P_A:.4f}  (theoretical: {2/6:.4f})")
print(f"P(B)         = {P_B:.4f}  (theoretical: {2/6:.4f})")
print(f"P(A) + P(B)  = {P_A + P_B:.4f}")
print(f"P(A ∪ B)     = {P_union:.4f}  (theoretical: {4/6:.4f})")
print(f"Difference   = {abs((P_A + P_B) - P_union):.6f}")
P(A)         = 0.3332  (theoretical: 0.3333)
P(B)         = 0.3323  (theoretical: 0.3333)
P(A) + P(B)  = 0.6655
P(A ∪ B)     = 0.6655  (theoretical: 0.6667)
Difference   = 0.000000

Figure 3 visualizes this comparison. The near-zero difference between \(P(A) + P(B)\) and \(P(A \cup B)\) confirms the additivity axiom empirically.

Figure 3: Countable additivity: \(P(A \cup B) = P(A) + P(B)\) for disjoint events

Summary and Connections

This article built the probability space \((\Omega, \mathcal{F}, P)\) from the ground up. The key takeaways:

  • A probability space has three components: sample space, \(\sigma\)-algebra, and probability measure
  • Three axioms (non-negativity, normalization, countable additivity) are sufficient to build all of probability theory
  • The \(\sigma\)-algebra specifies which events can carry probabilities — a technicality for finite spaces, but essential for continuous ones
  • Countable additivity, not just finite additivity, is what makes limits and convergence theorems work
  • Different probability measures on the same sample space encode different models of the world
  • Relative frequency converges to theoretical probability (a consequence, not an axiom)

Next: Random Variables — once you have a probability space, you need a way to map outcomes to numbers. That is what random variables do.

Application preview: In credit risk, the sample space of a loan outcome might be \(\Omega = \{\text{default}, \text{no default}\}\). The probability measure assigns \(P(\text{default}) = PD\). Portfolio models extend this to \(n\) loans, where the \(\sigma\)-algebra encodes every possible combination of defaults. The axioms ensure that portfolio-level probabilities are consistent with individual-loan probabilities.

References

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Cengage Learning.
Kolmogorov, Andrey N. 1933. Grundbegriffe Der Wahrscheinlichkeitsrechnung. Springer.