---
title: "Expectation, Variance, and Moments"
subtitle: "Summarizing Distributions with Numbers"
author: "Universe Office"
date: 2026-04-04
categories: [probability, foundations]
bibliography: references.bib
format:
html:
code-fold: true
toc: true
---
## Introduction
Imagine you commute to work every day. Some days it takes 20 minutes, others 40. If someone asks "how long is your commute?", you do not list every trip --- you say something like "about 30 minutes, give or take 10." That "about 30" is the expectation. The "give or take 10" is the standard deviation. Without realizing it, you have summarized an entire distribution with two numbers.
The [previous article](../random-variables/index.qmd) introduced random variables and their distributions (PMFs, PDFs, CDFs). A full distribution contains every detail, but in practice you need concise numerical summaries: "Where is the center?" "How spread out is it?" "Is it symmetric?" "How heavy are the tails?"
This article develops the machinery for answering those questions --- expectation, variance, higher moments, and the relationships between pairs of random variables [@casella2002; @wasserman2004].
## Expectation
### What Expectation Means
The **expected value** of a random variable is its **long-run average**. If you could repeat an experiment infinitely many times and average the results, the number you would get is the expectation. It is the distribution's center of gravity --- the point where the probability mass balances.
::: {.callout-note}
## Definition (Expectation; Casella & Berger, 2002)
The **expectation** (or **expected value**, **mean**) of a random variable $X$ is:
- **Discrete**: $E[X] = \sum_x x \cdot p(x)$, where $p$ is the PMF
- **Continuous**: $E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx$, where $f$ is the PDF
provided the sum or integral converges absolutely.
:::
### Linearity
Expectation is **linear**: for any constants $a, b$ and random variable $X$,
$$
E[aX + b] = aE[X] + b
$$
More generally, for random variables $X_1, \ldots, X_n$ (not necessarily independent):
$$
E\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i E[X_i]
$$
Linearity is the single most useful property of expectation. It holds regardless of dependence between the variables.
### LOTUS (Law of the Unconscious Statistician)
To compute $E[g(X)]$, you do **not** need the distribution of $Y = g(X)$. Instead:
$$
E[g(X)] = \begin{cases}
\sum_x g(x) \cdot p(x) & \text{(discrete)} \\
\int_{-\infty}^{\infty} g(x) \cdot f(x)\,dx & \text{(continuous)}
\end{cases}
$$
This result, known as **LOTUS**, is indispensable: it lets you compute expectations of transformed variables directly from the original distribution.
### Simulation
@fig-expectation demonstrates the law of large numbers in action. As the sample size grows, the running sample mean converges to the theoretical expectation for three distributions.
{#fig-expectation}
```{python}
#| label: expectation-demo
import numpy as np
from scipy import stats
rng = np.random.default_rng(seed=12345)
# Verify E[X] for Binomial(n=20, p=0.3)
X = stats.binom(n=20, p=0.3)
samples = rng.binomial(n=20, p=0.3, size=100_000)
print(f"Binomial(20, 0.3):")
print(f" Theoretical E[X] = {X.mean():.4f}")
print(f" Simulated mean = {samples.mean():.4f}")
# Verify linearity: E[2X + 3] = 2*E[X] + 3
print(f"\nLinearity check: E[2X + 3]")
print(f" Theoretical = {2 * X.mean() + 3:.4f}")
print(f" Simulated = {(2 * samples + 3).mean():.4f}")
```
## Variance and Standard Deviation
### The Idea: Measuring Spread
Expectation tells you where the distribution is centered, but nothing about how spread out it is. Two distributions can have the same mean but very different shapes. **Variance** measures the average squared distance from the mean --- it quantifies how much a random variable typically deviates from its expected value.
::: {.callout-note}
## Definition (Variance; Casella & Berger, 2002)
The **variance** of $X$ is:
$$
\text{Var}(X) = E[(X - \mu)^2]
$$
where $\mu = E[X]$. The **standard deviation** is $\text{SD}(X) = \sqrt{\text{Var}(X)}$.
:::
A large variance means the distribution is spread out. A small variance means it is concentrated near the mean. The standard deviation has the same units as $X$, making it more interpretable than variance.
### Computational Formula
The following identity is often more convenient for calculation:
$$
\text{Var}(X) = E[X^2] - (E[X])^2
$$
*Proof*: $\text{Var}(X) = E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu E[X] + \mu^2 = E[X^2] - \mu^2$.
### Properties
For constants $a, b$:
$$
\text{Var}(aX + b) = a^2 \text{Var}(X)
$$
Adding a constant shifts the distribution but does not change its spread. Scaling by $a$ scales the variance by $a^2$.
If $X_1, \ldots, X_n$ are **independent**:
$$
\text{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i)
$$
### Simulation
@fig-variance shows Normal distributions with different variances, making the effect of $\sigma^2$ visually clear.
{#fig-variance}
```{python}
#| label: variance-demo
import numpy as np
from scipy import stats
rng = np.random.default_rng(seed=12345)
# Verify Var(X) = E[X^2] - (E[X])^2 for Exponential(lambda=2)
X = stats.expon(scale=0.5) # scale = 1/lambda
samples = rng.exponential(scale=0.5, size=100_000)
var_def = np.mean((samples - samples.mean())**2)
var_formula = np.mean(samples**2) - samples.mean()**2
print(f"Exponential(lambda=2):")
print(f" Theoretical Var = {X.var():.6f}")
print(f" Definition = {var_def:.6f}")
print(f" E[X^2]-(E[X])^2 = {var_formula:.6f}")
```
## Higher Moments and Moment Generating Function
### Moments
The **$k$-th moment** of $X$ is $E[X^k]$, and the **$k$-th central moment** is $E[(X - \mu)^k]$. The first moment is the mean; the second central moment is the variance.
### Skewness and Kurtosis
::: {.callout-note}
## Definition (Skewness and Kurtosis; Casella & Berger, 2002)
The **skewness** measures asymmetry:
$$
\gamma_1 = E\left[\left(\frac{X - \mu}{\sigma}\right)^3\right]
$$
The **excess kurtosis** measures tail heaviness relative to the Normal distribution:
$$
\gamma_2 = E\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3
$$
:::
- $\gamma_1 > 0$: right-skewed (long right tail)
- $\gamma_1 < 0$: left-skewed (long left tail)
- $\gamma_2 > 0$: heavier tails than Normal (leptokurtic)
- $\gamma_2 < 0$: lighter tails than Normal (platykurtic)
**Why these matter in practice.** In risk management, skewness tells you whether losses are more likely to be extreme in one direction. Positive skewness in credit losses means occasional large defaults. Kurtosis measures how likely extreme events are. A distribution with high excess kurtosis produces more "black swan" events than a Normal distribution with the same mean and variance --- this is why regulatory capital models pay close attention to tail behavior.
@fig-skewness-kurtosis compares distributions with different skewness and kurtosis.
{#fig-skewness-kurtosis}
### Moment Generating Function
::: {.callout-note}
## Definition (MGF; Casella & Berger, 2002)
The **moment generating function** of $X$ is:
$$
M_X(t) = E[e^{tX}]
$$
provided this expectation exists for $t$ in a neighborhood of 0.
:::
The MGF is called "moment generating" because:
$$
E[X^k] = M_X^{(k)}(0) = \frac{d^k}{dt^k} M_X(t) \bigg|_{t=0}
$$
Key properties:
1. **Uniqueness**: If $M_X(t) = M_Y(t)$ for all $t$ in a neighborhood of 0, then $X$ and $Y$ have the same distribution
2. **Independence**: If $X$ and $Y$ are independent, then $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$
```{python}
#| label: skewness-kurtosis-demo
from scipy import stats
distributions = {
"Normal(0,1)": stats.norm(0, 1),
"Exponential(1)": stats.expon(scale=1),
"Chi-squared(5)": stats.chi2(df=5),
"t(5)": stats.t(df=5),
}
print(f"{'Distribution':<20} {'Skewness':>10} {'Ex. Kurtosis':>14}")
print("-" * 46)
for name, dist in distributions.items():
print(f"{name:<20} {float(dist.stats(moments='s')):>10.4f} {float(dist.stats(moments='k')):>14.4f}")
```
## Covariance and Correlation
### Covariance
When you have two random variables, you often want to know whether they move together. **Covariance** captures the direction and strength of their linear relationship.
::: {.callout-note}
## Definition (Covariance; Casella & Berger, 2002)
The **covariance** of two random variables $X$ and $Y$ is:
$$
\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]
$$
:::
- $\text{Cov}(X, Y) > 0$: $X$ and $Y$ tend to move in the same direction
- $\text{Cov}(X, Y) < 0$: $X$ and $Y$ tend to move in opposite directions
### Correlation
The **Pearson correlation coefficient** normalizes covariance to $[-1, 1]$:
$$
\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\text{SD}(X) \cdot \text{SD}(Y)}
$$
### Independence and Uncorrelatedness
::: {.callout-important}
If $X$ and $Y$ are **independent**, then $\text{Cov}(X, Y) = 0$ (they are uncorrelated). The converse is **false** in general. Uncorrelated does not imply independent.
:::
A classic counterexample: let $X \sim N(0, 1)$ and $Y = X^2$. Then $\text{Cov}(X, Y) = E[X^3] = 0$ (by symmetry of the Normal distribution), but $X$ and $Y$ are clearly dependent --- $Y$ is a deterministic function of $X$.
@fig-correlation shows bivariate Normal distributions with different correlation coefficients.
{#fig-correlation}
```{python}
#| label: covariance-demo
import numpy as np
rng = np.random.default_rng(seed=12345)
# Bivariate Normal with rho = 0.7
rho = 0.7
mean = [0, 0]
cov_matrix = [[1, rho], [rho, 1]]
samples = rng.multivariate_normal(mean, cov_matrix, size=100_000)
print(f"Bivariate Normal (rho = {rho}):")
print(f" Theoretical Cov = {rho:.4f}")
print(f" Simulated Cov = {np.cov(samples.T)[0, 1]:.4f}")
print(f" Simulated Corr = {np.corrcoef(samples.T)[0, 1]:.4f}")
# Counterexample: X and X^2 are uncorrelated but dependent
X = rng.standard_normal(100_000)
Y = X**2
print(f"\nCounterexample: X ~ N(0,1), Y = X^2")
print(f" Corr(X, Y) = {np.corrcoef(X, Y)[0, 1]:.4f} (~ 0)")
print(f" Dependent? Yes (Y is a function of X)")
```
## Inequalities
These three inequalities provide bounds on probabilities and expectations without knowing the full distribution. They are the "safety nets" of probability theory.
### Markov's Inequality
For a non-negative random variable $X$ and $a > 0$:
$$
P(X \ge a) \le \frac{E[X]}{a}
$$
### Chebyshev's Inequality
For any random variable $X$ with finite mean $\mu$ and variance $\sigma^2$, and $k > 0$:
$$
P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2}
$$
This provides a distribution-free bound on the probability of deviating from the mean by more than $k$ standard deviations. No matter what the distribution looks like, at most 25% of the probability can lie beyond $2\sigma$ from the mean.
### Jensen's Inequality
If $g$ is a **convex** function:
$$
g(E[X]) \le E[g(X)]
$$
If $g$ is **concave**, the inequality reverses. This inequality has far-reaching consequences --- for example, it explains why $E[\log X] \le \log E[X]$ and why diversification reduces risk.
@fig-inequalities demonstrates Chebyshev's inequality via simulation: the actual probability of falling outside $k\sigma$ is always below the Chebyshev bound.
{#fig-inequalities}
```{python}
#| label: inequalities-demo
import numpy as np
from scipy import stats
rng = np.random.default_rng(seed=12345)
# Chebyshev's inequality for Exponential(1)
X = stats.expon(scale=1)
samples = rng.exponential(scale=1, size=100_000)
mu = samples.mean()
sigma = samples.std()
print("Chebyshev's inequality for Exponential(1):")
print(f"{'k':>4} {'Chebyshev bound':>16} {'Actual P':>12}")
print("-" * 36)
for k in [1, 1.5, 2, 3, 4]:
bound = 1 / k**2
actual = np.mean(np.abs(samples - mu) >= k * sigma)
print(f"{k:>4.1f} {bound:>16.4f} {actual:>12.4f}")
```
## Summary and Connections
This article developed the core numerical summaries for probability distributions:
- **Expectation** is the long-run average --- the center of gravity of a distribution. It is linear, which makes it remarkably easy to work with.
- **Variance** quantifies spread as the average squared deviation from the mean. The computational formula $E[X^2] - (E[X])^2$ simplifies calculation.
- **Skewness** and **kurtosis** capture asymmetry and tail behavior --- both critical for assessing risk beyond what the mean and variance reveal.
- **Covariance** and **correlation** describe linear relationships between variables. Remember: uncorrelated does not mean independent.
- **Markov**, **Chebyshev**, and **Jensen** provide distribution-free bounds that hold universally.
**Next**: [Conditional Probability and Expectation](../conditional-probability/index.qmd) --- conditioning is the mechanism by which new information updates probabilities and expectations.
**Application preview**: In risk management, **Value at Risk (VaR)** and **Expected Shortfall (ES)** are direct applications of the concepts developed here. VaR is a quantile of the loss distribution, while ES is a conditional expectation. Portfolio risk depends critically on the covariance structure among assets --- the same tools of variance and correlation introduced above are what make modern portfolio theory work.
## References
::: {#refs}
:::