Expectation, Variance, and Moments

Summarizing Distributions with Numbers

probability

foundations

Author

Universe Office

Published

April 4, 2026

Introduction

Imagine you commute to work every day. Some days it takes 20 minutes, others 40. If someone asks “how long is your commute?”, you do not list every trip — you say something like “about 30 minutes, give or take 10.” That “about 30” is the expectation. The “give or take 10” is the standard deviation. Without realizing it, you have summarized an entire distribution with two numbers.

The previous article introduced random variables and their distributions (PMFs, PDFs, CDFs). A full distribution contains every detail, but in practice you need concise numerical summaries: “Where is the center?” “How spread out is it?” “Is it symmetric?” “How heavy are the tails?”

This article develops the machinery for answering those questions — expectation, variance, higher moments, and the relationships between pairs of random variables (Casella and Berger 2002; Wasserman 2004).

Expectation

What Expectation Means

The expected value of a random variable is its long-run average. If you could repeat an experiment infinitely many times and average the results, the number you would get is the expectation. It is the distribution’s center of gravity — the point where the probability mass balances.

Definition (Expectation; Casella & Berger, 2002)

The expectation (or expected value, mean) of a random variable $X$ is:

Discrete: $E[X] = \sum_x x \cdot p(x)$, where $p$ is the PMF
Continuous: $E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx$, where $f$ is the PDF

provided the sum or integral converges absolutely.

Linearity

Expectation is linear: for any constants $a, b$ and random variable $X$,

\[ E[aX + b] = aE[X] + b \]

More generally, for random variables $X_1, \ldots, X_n$ (not necessarily independent):

\[ E\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i E[X_i] \]

Linearity is the single most useful property of expectation. It holds regardless of dependence between the variables.

LOTUS (Law of the Unconscious Statistician)

To compute $E[g(X)]$, you do not need the distribution of $Y = g(X)$. Instead:

\[ E[g(X)] = \begin{cases} \sum_x g(x) \cdot p(x) & \text{(discrete)} \\ \int_{-\infty}^{\infty} g(x) \cdot f(x)\,dx & \text{(continuous)} \end{cases} \]

This result, known as LOTUS, is indispensable: it lets you compute expectations of transformed variables directly from the original distribution.

Simulation

Figure 1 demonstrates the law of large numbers in action. As the sample size grows, the running sample mean converges to the theoretical expectation for three distributions.

Figure 1: Sample mean converging to theoretical expectation

Code

import numpy as np
from scipy import stats

rng = np.random.default_rng(seed=12345)

# Verify E[X] for Binomial(n=20, p=0.3)
X = stats.binom(n=20, p=0.3)
samples = rng.binomial(n=20, p=0.3, size=100_000)
print(f"Binomial(20, 0.3):")
print(f"  Theoretical E[X] = {X.mean():.4f}")
print(f"  Simulated mean   = {samples.mean():.4f}")

# Verify linearity: E[2X + 3] = 2*E[X] + 3
print(f"\nLinearity check: E[2X + 3]")
print(f"  Theoretical = {2 * X.mean() + 3:.4f}")
print(f"  Simulated   = {(2 * samples + 3).mean():.4f}")

Binomial(20, 0.3):
  Theoretical E[X] = 6.0000
  Simulated mean   = 6.0051

Linearity check: E[2X + 3]
  Theoretical = 15.0000
  Simulated   = 15.0103

Variance and Standard Deviation

The Idea: Measuring Spread

Expectation tells you where the distribution is centered, but nothing about how spread out it is. Two distributions can have the same mean but very different shapes. Variance measures the average squared distance from the mean — it quantifies how much a random variable typically deviates from its expected value.

Definition (Variance; Casella & Berger, 2002)

The variance of $X$ is:

\[ \text{Var}(X) = E[(X - \mu)^2] \]

where $\mu = E[X]$. The standard deviation is $\text{SD}(X) = \sqrt{\text{Var}(X)}$.

A large variance means the distribution is spread out. A small variance means it is concentrated near the mean. The standard deviation has the same units as $X$, making it more interpretable than variance.

Computational Formula

The following identity is often more convenient for calculation:

\[ \text{Var}(X) = E[X^2] - (E[X])^2 \]

Proof: $\text{Var}(X) = E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu E[X] + \mu^2 = E[X^2] - \mu^2$.

Properties

For constants $a, b$:

\[ \text{Var}(aX + b) = a^2 \text{Var}(X) \]

Adding a constant shifts the distribution but does not change its spread. Scaling by $a$ scales the variance by $a^2$.

If $X_1, \ldots, X_n$ are independent:

\[ \text{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) \]

Simulation

Figure 2 shows Normal distributions with different variances, making the effect of $\sigma^2$ visually clear.

Figure 2: Normal distributions with different variances

Code

import numpy as np
from scipy import stats

rng = np.random.default_rng(seed=12345)

# Verify Var(X) = E[X^2] - (E[X])^2 for Exponential(lambda=2)
X = stats.expon(scale=0.5)  # scale = 1/lambda
samples = rng.exponential(scale=0.5, size=100_000)

var_def = np.mean((samples - samples.mean())**2)
var_formula = np.mean(samples**2) - samples.mean()**2

print(f"Exponential(lambda=2):")
print(f"  Theoretical Var = {X.var():.6f}")
print(f"  Definition      = {var_def:.6f}")
print(f"  E[X^2]-(E[X])^2 = {var_formula:.6f}")

Exponential(lambda=2):
  Theoretical Var = 0.250000
  Definition      = 0.248482
  E[X^2]-(E[X])^2 = 0.248482

Higher Moments and Moment Generating Function

Moments

The $k$-th moment of $X$ is $E[X^k]$, and the $k$-th central moment is $E[(X - \mu)^k]$. The first moment is the mean; the second central moment is the variance.

Skewness and Kurtosis

Definition (Skewness and Kurtosis; Casella & Berger, 2002)

The skewness measures asymmetry:

\[ \gamma_1 = E\left[\left(\frac{X - \mu}{\sigma}\right)^3\right] \]

The excess kurtosis measures tail heaviness relative to the Normal distribution:

\[ \gamma_2 = E\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3 \]

$\gamma_1 > 0$: right-skewed (long right tail)
$\gamma_1 < 0$: left-skewed (long left tail)
$\gamma_2 > 0$: heavier tails than Normal (leptokurtic)
$\gamma_2 < 0$: lighter tails than Normal (platykurtic)

Why these matter in practice. In risk management, skewness tells you whether losses are more likely to be extreme in one direction. Positive skewness in credit losses means occasional large defaults. Kurtosis measures how likely extreme events are. A distribution with high excess kurtosis produces more “black swan” events than a Normal distribution with the same mean and variance — this is why regulatory capital models pay close attention to tail behavior.

Figure 3 compares distributions with different skewness and kurtosis.

Figure 3: Distributions with different skewness and kurtosis

Moment Generating Function

Definition (MGF; Casella & Berger, 2002)

The moment generating function of $X$ is:

\[ M_X(t) = E[e^{tX}] \]

provided this expectation exists for $t$ in a neighborhood of 0.

The MGF is called “moment generating” because:

\[ E[X^k] = M_X^{(k)}(0) = \frac{d^k}{dt^k} M_X(t) \bigg|_{t=0} \]

Key properties:

Uniqueness: If $M_X(t) = M_Y(t)$ for all $t$ in a neighborhood of 0, then $X$ and $Y$ have the same distribution
Independence: If $X$ and $Y$ are independent, then $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$

Code

from scipy import stats

distributions = {
    "Normal(0,1)": stats.norm(0, 1),
    "Exponential(1)": stats.expon(scale=1),
    "Chi-squared(5)": stats.chi2(df=5),
    "t(5)": stats.t(df=5),
}

print(f"{'Distribution':<20} {'Skewness':>10} {'Ex. Kurtosis':>14}")
print("-" * 46)
for name, dist in distributions.items():
    print(f"{name:<20} {float(dist.stats(moments='s')):>10.4f} {float(dist.stats(moments='k')):>14.4f}")

Distribution           Skewness   Ex. Kurtosis
----------------------------------------------
Normal(0,1)              0.0000         0.0000
Exponential(1)           2.0000         6.0000
Chi-squared(5)           1.2649         2.4000
t(5)                     0.0000         6.0000

Covariance and Correlation

Covariance

When you have two random variables, you often want to know whether they move together. Covariance captures the direction and strength of their linear relationship.

Definition (Covariance; Casella & Berger, 2002)

The covariance of two random variables $X$ and $Y$ is:

\[ \text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y] \]

$\text{Cov}(X, Y) > 0$: $X$ and $Y$ tend to move in the same direction
$\text{Cov}(X, Y) < 0$: $X$ and $Y$ tend to move in opposite directions

Correlation

The Pearson correlation coefficient normalizes covariance to $[-1, 1]$:

\[ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\text{SD}(X) \cdot \text{SD}(Y)} \]

Independence and Uncorrelatedness

Important

If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0$ (they are uncorrelated). The converse is false in general. Uncorrelated does not imply independent.

A classic counterexample: let $X \sim N(0, 1)$ and $Y = X^2$. Then $\text{Cov}(X, Y) = E[X^3] = 0$ (by symmetry of the Normal distribution), but $X$ and $Y$ are clearly dependent — $Y$ is a deterministic function of $X$.

Figure 4 shows bivariate Normal distributions with different correlation coefficients.

Figure 4: Bivariate Normal with different correlations

Code

import numpy as np

rng = np.random.default_rng(seed=12345)

# Bivariate Normal with rho = 0.7
rho = 0.7
mean = [0, 0]
cov_matrix = [[1, rho], [rho, 1]]
samples = rng.multivariate_normal(mean, cov_matrix, size=100_000)

print(f"Bivariate Normal (rho = {rho}):")
print(f"  Theoretical Cov = {rho:.4f}")
print(f"  Simulated Cov   = {np.cov(samples.T)[0, 1]:.4f}")
print(f"  Simulated Corr  = {np.corrcoef(samples.T)[0, 1]:.4f}")

# Counterexample: X and X^2 are uncorrelated but dependent
X = rng.standard_normal(100_000)
Y = X**2
print(f"\nCounterexample: X ~ N(0,1), Y = X^2")
print(f"  Corr(X, Y)  = {np.corrcoef(X, Y)[0, 1]:.4f} (~ 0)")
print(f"  Dependent?   Yes (Y is a function of X)")

Bivariate Normal (rho = 0.7):
  Theoretical Cov = 0.7000
  Simulated Cov   = 0.6913
  Simulated Corr  = 0.6969

Counterexample: X ~ N(0,1), Y = X^2
  Corr(X, Y)  = 0.0067 (~ 0)
  Dependent?   Yes (Y is a function of X)

Inequalities

These three inequalities provide bounds on probabilities and expectations without knowing the full distribution. They are the “safety nets” of probability theory.

Markov’s Inequality

For a non-negative random variable $X$ and $a > 0$:

\[ P(X \ge a) \le \frac{E[X]}{a} \]

Chebyshev’s Inequality

For any random variable $X$ with finite mean $\mu$ and variance $\sigma^2$, and $k > 0$:

\[ P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2} \]

This provides a distribution-free bound on the probability of deviating from the mean by more than $k$ standard deviations. No matter what the distribution looks like, at most 25% of the probability can lie beyond $2\sigma$ from the mean.

Jensen’s Inequality

If $g$ is a convex function:

\[ g(E[X]) \le E[g(X)] \]

If $g$ is concave, the inequality reverses. This inequality has far-reaching consequences — for example, it explains why $E[\log X] \le \log E[X]$ and why diversification reduces risk.

Figure 5 demonstrates Chebyshev’s inequality via simulation: the actual probability of falling outside $k\sigma$ is always below the Chebyshev bound.

Figure 5: Chebyshev’s inequality: simulation vs bound

Code

import numpy as np
from scipy import stats

rng = np.random.default_rng(seed=12345)

# Chebyshev's inequality for Exponential(1)
X = stats.expon(scale=1)
samples = rng.exponential(scale=1, size=100_000)
mu = samples.mean()
sigma = samples.std()

print("Chebyshev's inequality for Exponential(1):")
print(f"{'k':>4}  {'Chebyshev bound':>16}  {'Actual P':>12}")
print("-" * 36)
for k in [1, 1.5, 2, 3, 4]:
    bound = 1 / k**2
    actual = np.mean(np.abs(samples - mu) >= k * sigma)
    print(f"{k:>4.1f}  {bound:>16.4f}  {actual:>12.4f}")

Chebyshev's inequality for Exponential(1):
   k   Chebyshev bound      Actual P
------------------------------------
 1.0            1.0000        0.1361
 1.5            0.4444        0.0821
 2.0            0.2500        0.0497
 3.0            0.1111        0.0185
 4.0            0.0625        0.0068

Summary and Connections

This article developed the core numerical summaries for probability distributions:

Expectation is the long-run average — the center of gravity of a distribution. It is linear, which makes it remarkably easy to work with.
Variance quantifies spread as the average squared deviation from the mean. The computational formula $E[X^2] - (E[X])^2$ simplifies calculation.
Skewness and kurtosis capture asymmetry and tail behavior — both critical for assessing risk beyond what the mean and variance reveal.
Covariance and correlation describe linear relationships between variables. Remember: uncorrelated does not mean independent.
Markov, Chebyshev, and Jensen provide distribution-free bounds that hold universally.

Next: Conditional Probability and Expectation — conditioning is the mechanism by which new information updates probabilities and expectations.

Application preview: In risk management, Value at Risk (VaR) and Expected Shortfall (ES) are direct applications of the concepts developed here. VaR is a quantile of the loss distribution, while ES is a conditional expectation. Portfolio risk depends critically on the covariance structure among assets — the same tools of variance and correlation introduced above are what make modern portfolio theory work.

References

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Cengage Learning.

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer. https://doi.org/10.1007/978-0-387-21736-9.

--- title: "Expectation, Variance, and Moments" subtitle: "Summarizing Distributions with Numbers" author: "Universe Office" date: 2026-04-04 categories: [probability, foundations] bibliography: references.bib format: html: code-fold: true toc: true --- ## Introduction Imagine you commute to work every day. Some days it takes 20 minutes, others 40. If someone asks "how long is your commute?", you do not list every trip --- you say something like "about 30 minutes, give or take 10." That "about 30" is the expectation. The "give or take 10" is the standard deviation. Without realizing it, you have summarized an entire distribution with two numbers. The [previous article](../random-variables/index.qmd) introduced random variables and their distributions (PMFs, PDFs, CDFs). A full distribution contains every detail, but in practice you need concise numerical summaries: "Where is the center?" "How spread out is it?" "Is it symmetric?" "How heavy are the tails?" This article develops the machinery for answering those questions --- expectation, variance, higher moments, and the relationships between pairs of random variables [@casella2002; @wasserman2004]. ## Expectation ### What Expectation Means The **expected value** of a random variable is its **long-run average**. If you could repeat an experiment infinitely many times and average the results, the number you would get is the expectation. It is the distribution's center of gravity --- the point where the probability mass balances. ::: {.callout-note} ## Definition (Expectation; Casella & Berger, 2002) The **expectation** (or **expected value**, **mean**) of a random variable $X$ is: - **Discrete**: $E[X] = \sum_x x \cdot p(x)$, where $p$ is the PMF - **Continuous**: $E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx$, where $f$ is the PDF provided the sum or integral converges absolutely. ::: ### Linearity Expectation is **linear**: for any constants $a, b$ and random variable $X$, $$ E[aX + b] = aE[X] + b $$ More generally, for random variables $X_1, \ldots, X_n$ (not necessarily independent): $$ E\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i E[X_i] $$ Linearity is the single most useful property of expectation. It holds regardless of dependence between the variables. ### LOTUS (Law of the Unconscious Statistician) To compute $E[g(X)]$, you do **not** need the distribution of $Y = g(X)$. Instead: $$ E[g(X)] = \begin{cases} \sum_x g(x) \cdot p(x) & \text{(discrete)} \\ \int_{-\infty}^{\infty} g(x) \cdot f(x)\,dx & \text{(continuous)} \end{cases} $$ This result, known as **LOTUS**, is indispensable: it lets you compute expectations of transformed variables directly from the original distribution. ### Simulation @fig-expectation demonstrates the law of large numbers in action. As the sample size grows, the running sample mean converges to the theoretical expectation for three distributions. ![Sample mean converging to theoretical expectation](figures/fig1_expectation.png){#fig-expectation} ```{python} #| label: expectation-demo import numpy as np from scipy import stats rng = np.random.default_rng(seed=12345) # Verify E[X] for Binomial(n=20, p=0.3) X = stats.binom(n=20, p=0.3) samples = rng.binomial(n=20, p=0.3, size=100_000) print(f"Binomial(20, 0.3):") print(f" Theoretical E[X] = {X.mean():.4f}") print(f" Simulated mean = {samples.mean():.4f}") # Verify linearity: E[2X + 3] = 2*E[X] + 3 print(f"\nLinearity check: E[2X + 3]") print(f" Theoretical = {2 * X.mean() + 3:.4f}") print(f" Simulated = {(2 * samples + 3).mean():.4f}") ``` ## Variance and Standard Deviation ### The Idea: Measuring Spread Expectation tells you where the distribution is centered, but nothing about how spread out it is. Two distributions can have the same mean but very different shapes. **Variance** measures the average squared distance from the mean --- it quantifies how much a random variable typically deviates from its expected value. ::: {.callout-note} ## Definition (Variance; Casella & Berger, 2002) The **variance** of $X$ is: $$ \text{Var}(X) = E[(X - \mu)^2] $$ where $\mu = E[X]$. The **standard deviation** is $\text{SD}(X) = \sqrt{\text{Var}(X)}$. ::: A large variance means the distribution is spread out. A small variance means it is concentrated near the mean. The standard deviation has the same units as $X$, making it more interpretable than variance. ### Computational Formula The following identity is often more convenient for calculation: $$ \text{Var}(X) = E[X^2] - (E[X])^2 $$ *Proof*: $\text{Var}(X) = E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu E[X] + \mu^2 = E[X^2] - \mu^2$. ### Properties For constants $a, b$: $$ \text{Var}(aX + b) = a^2 \text{Var}(X) $$ Adding a constant shifts the distribution but does not change its spread. Scaling by $a$ scales the variance by $a^2$. If $X_1, \ldots, X_n$ are **independent**: $$ \text{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) $$ ### Simulation @fig-variance shows Normal distributions with different variances, making the effect of $\sigma^2$ visually clear. ![Normal distributions with different variances](figures/fig2_variance.png){#fig-variance} ```{python} #| label: variance-demo import numpy as np from scipy import stats rng = np.random.default_rng(seed=12345) # Verify Var(X) = E[X^2] - (E[X])^2 for Exponential(lambda=2) X = stats.expon(scale=0.5) # scale = 1/lambda samples = rng.exponential(scale=0.5, size=100_000) var_def = np.mean((samples - samples.mean())**2) var_formula = np.mean(samples**2) - samples.mean()**2 print(f"Exponential(lambda=2):") print(f" Theoretical Var = {X.var():.6f}") print(f" Definition = {var_def:.6f}") print(f" E[X^2]-(E[X])^2 = {var_formula:.6f}") ``` ## Higher Moments and Moment Generating Function ### Moments The **$k$-th moment** of $X$ is $E[X^k]$, and the **$k$-th central moment** is $E[(X - \mu)^k]$. The first moment is the mean; the second central moment is the variance. ### Skewness and Kurtosis ::: {.callout-note} ## Definition (Skewness and Kurtosis; Casella & Berger, 2002) The **skewness** measures asymmetry: $$ \gamma_1 = E\left[\left(\frac{X - \mu}{\sigma}\right)^3\right] $$ The **excess kurtosis** measures tail heaviness relative to the Normal distribution: $$ \gamma_2 = E\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3 $$ ::: - $\gamma_1 > 0$: right-skewed (long right tail) - $\gamma_1 < 0$: left-skewed (long left tail) - $\gamma_2 > 0$: heavier tails than Normal (leptokurtic) - $\gamma_2 < 0$: lighter tails than Normal (platykurtic) **Why these matter in practice.** In risk management, skewness tells you whether losses are more likely to be extreme in one direction. Positive skewness in credit losses means occasional large defaults. Kurtosis measures how likely extreme events are. A distribution with high excess kurtosis produces more "black swan" events than a Normal distribution with the same mean and variance --- this is why regulatory capital models pay close attention to tail behavior. @fig-skewness-kurtosis compares distributions with different skewness and kurtosis. ![Distributions with different skewness and kurtosis](figures/fig3_skewness_kurtosis.png){#fig-skewness-kurtosis} ### Moment Generating Function ::: {.callout-note} ## Definition (MGF; Casella & Berger, 2002) The **moment generating function** of $X$ is: $$ M_X(t) = E[e^{tX}] $$ provided this expectation exists for $t$ in a neighborhood of 0. ::: The MGF is called "moment generating" because: $$ E[X^k] = M_X^{(k)}(0) = \frac{d^k}{dt^k} M_X(t) \bigg|_{t=0} $$ Key properties: 1. **Uniqueness**: If $M_X(t) = M_Y(t)$ for all $t$ in a neighborhood of 0, then $X$ and $Y$ have the same distribution 2. **Independence**: If $X$ and $Y$ are independent, then $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ ```{python} #| label: skewness-kurtosis-demo from scipy import stats distributions = { "Normal(0,1)": stats.norm(0, 1), "Exponential(1)": stats.expon(scale=1), "Chi-squared(5)": stats.chi2(df=5), "t(5)": stats.t(df=5), } print(f"{'Distribution':<20} {'Skewness':>10} {'Ex. Kurtosis':>14}") print("-" * 46) for name, dist in distributions.items(): print(f"{name:<20} {float(dist.stats(moments='s')):>10.4f} {float(dist.stats(moments='k')):>14.4f}") ``` ## Covariance and Correlation ### Covariance When you have two random variables, you often want to know whether they move together. **Covariance** captures the direction and strength of their linear relationship. ::: {.callout-note} ## Definition (Covariance; Casella & Berger, 2002) The **covariance** of two random variables $X$ and $Y$ is: $$ \text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y] $$ ::: - $\text{Cov}(X, Y) > 0$: $X$ and $Y$ tend to move in the same direction - $\text{Cov}(X, Y) < 0$: $X$ and $Y$ tend to move in opposite directions ### Correlation The **Pearson correlation coefficient** normalizes covariance to $[-1, 1]$: $$ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\text{SD}(X) \cdot \text{SD}(Y)} $$ ### Independence and Uncorrelatedness ::: {.callout-important} If $X$ and $Y$ are **independent**, then $\text{Cov}(X, Y) = 0$ (they are uncorrelated). The converse is **false** in general. Uncorrelated does not imply independent. ::: A classic counterexample: let $X \sim N(0, 1)$ and $Y = X^2$. Then $\text{Cov}(X, Y) = E[X^3] = 0$ (by symmetry of the Normal distribution), but $X$ and $Y$ are clearly dependent --- $Y$ is a deterministic function of $X$. @fig-correlation shows bivariate Normal distributions with different correlation coefficients. ![Bivariate Normal with different correlations](figures/fig4_correlation.png){#fig-correlation} ```{python} #| label: covariance-demo import numpy as np rng = np.random.default_rng(seed=12345) # Bivariate Normal with rho = 0.7 rho = 0.7 mean = [0, 0] cov_matrix = [[1, rho], [rho, 1]] samples = rng.multivariate_normal(mean, cov_matrix, size=100_000) print(f"Bivariate Normal (rho = {rho}):") print(f" Theoretical Cov = {rho:.4f}") print(f" Simulated Cov = {np.cov(samples.T)[0, 1]:.4f}") print(f" Simulated Corr = {np.corrcoef(samples.T)[0, 1]:.4f}") # Counterexample: X and X^2 are uncorrelated but dependent X = rng.standard_normal(100_000) Y = X**2 print(f"\nCounterexample: X ~ N(0,1), Y = X^2") print(f" Corr(X, Y) = {np.corrcoef(X, Y)[0, 1]:.4f} (~ 0)") print(f" Dependent? Yes (Y is a function of X)") ``` ## Inequalities These three inequalities provide bounds on probabilities and expectations without knowing the full distribution. They are the "safety nets" of probability theory. ### Markov's Inequality For a non-negative random variable $X$ and $a > 0$: $$ P(X \ge a) \le \frac{E[X]}{a} $$ ### Chebyshev's Inequality For any random variable $X$ with finite mean $\mu$ and variance $\sigma^2$, and $k > 0$: $$ P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2} $$ This provides a distribution-free bound on the probability of deviating from the mean by more than $k$ standard deviations. No matter what the distribution looks like, at most 25% of the probability can lie beyond $2\sigma$ from the mean. ### Jensen's Inequality If $g$ is a **convex** function: $$ g(E[X]) \le E[g(X)] $$ If $g$ is **concave**, the inequality reverses. This inequality has far-reaching consequences --- for example, it explains why $E[\log X] \le \log E[X]$ and why diversification reduces risk. @fig-inequalities demonstrates Chebyshev's inequality via simulation: the actual probability of falling outside $k\sigma$ is always below the Chebyshev bound. ![Chebyshev's inequality: simulation vs bound](figures/fig5_inequalities.png){#fig-inequalities} ```{python} #| label: inequalities-demo import numpy as np from scipy import stats rng = np.random.default_rng(seed=12345) # Chebyshev's inequality for Exponential(1) X = stats.expon(scale=1) samples = rng.exponential(scale=1, size=100_000) mu = samples.mean() sigma = samples.std() print("Chebyshev's inequality for Exponential(1):") print(f"{'k':>4} {'Chebyshev bound':>16} {'Actual P':>12}") print("-" * 36) for k in [1, 1.5, 2, 3, 4]: bound = 1 / k**2 actual = np.mean(np.abs(samples - mu) >= k * sigma) print(f"{k:>4.1f} {bound:>16.4f} {actual:>12.4f}") ``` ## Summary and Connections This article developed the core numerical summaries for probability distributions: - **Expectation** is the long-run average --- the center of gravity of a distribution. It is linear, which makes it remarkably easy to work with. - **Variance** quantifies spread as the average squared deviation from the mean. The computational formula $E[X^2] - (E[X])^2$ simplifies calculation. - **Skewness** and **kurtosis** capture asymmetry and tail behavior --- both critical for assessing risk beyond what the mean and variance reveal. - **Covariance** and **correlation** describe linear relationships between variables. Remember: uncorrelated does not mean independent. - **Markov**, **Chebyshev**, and **Jensen** provide distribution-free bounds that hold universally. **Next**: [Conditional Probability and Expectation](../conditional-probability/index.qmd) --- conditioning is the mechanism by which new information updates probabilities and expectations. **Application preview**: In risk management, **Value at Risk (VaR)** and **Expected Shortfall (ES)** are direct applications of the concepts developed here. VaR is a quantile of the loss distribution, while ES is a conditional expectation. Portfolio risk depends critically on the covariance structure among assets --- the same tools of variance and correlation introduced above are what make modern portfolio theory work. ## References ::: {#refs} :::