Stochastic Modeling Erich Baur
ZHAW School of Engineering
Most Important Concepts ¶ Probability Space & Axioms : Foundation of all probability theory
Conditional Probability : Essential for Bayesian reasoning
Expectation & Variance : Fundamental measures of central tendency and spread
Limit Theorems : LLN and CLT are the most important results in probability
Markov Property : Foundation for Markov Chains and many stochastic processes
Stationary Distributions : Long-term behavior of Markov Chains
Poisson Process : Fundamental continuous-time stochastic process
Independence vs. Uncorrelated : Independent implies uncorrelated, but not vice versa (except for jointly normal RVs)
Continuous vs. Discrete : Remember to use PMF for discrete and PDF for continuous RVs
Conditional Probability : Always check that P ( F ) > 0 P(F) > 0 P ( F ) > 0 before using P ( E ∣ F ) P(E|F) P ( E ∣ F )
Expectation of Functions : E [ g ( X ) ] ≠ g ( E [ X ] ) \mathbb{E}[g(X)] \neq g(\mathbb{E}[X]) E [ g ( X )] = g ( E [ X ]) in general (Jensen’s inequality)
Variance of Sums : Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) \operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y )
Markov Chain Convergence : Requires both irreducibility and aperiodicity
Problem-Solving Strategies ¶ Identify the Distribution : Recognize which distribution the problem involves
Use Definitions : Start with the definition (PMF, PDF, or property)
Check Conditions : Verify all conditions are met before applying theorems
Use Linearity : Expectation is linear - use this whenever possible
Conditioning : Break complex problems into simpler conditional problems
Symmetry : Look for symmetry to simplify calculations
Approximations : Use CLT for large n n n approximations
Probability ¶ P ( E ∣ F ) = P ( E ∩ F ) P ( F ) P ( E ) = P ( E ∣ A ) P ( A ) + P ( E ∣ A c ) P ( A c ) P ( A ∣ E ) = P ( E ∣ A ) P ( A ) P ( E ) P ( A ∩ B ) = P ( A ) P ( B ∣ A ) = P ( B ) P ( A ∣ B ) \begin{align*}
P(E|F) &= \frac{P(E \cap F)}{P(F)} & P(E) &= P(E|A)P(A) + P(E|A^c)P(A^c) \\
P(A|E) &= \frac{P(E|A)P(A)}{P(E)} & P(A \cap B) &= P(A)P(B|A) = P(B)P(A|B)
\end{align*} P ( E ∣ F ) P ( A ∣ E ) = P ( F ) P ( E ∩ F ) = P ( E ) P ( E ∣ A ) P ( A ) P ( E ) P ( A ∩ B ) = P ( E ∣ A ) P ( A ) + P ( E ∣ A c ) P ( A c ) = P ( A ) P ( B ∣ A ) = P ( B ) P ( A ∣ B ) Expectation & Variance ¶ E [ X ] = ∑ x P ( X = x ) E [ X ] = ∫ x f X ( x ) d x Var ( X ) = E [ X 2 ] − ( E [ X ] ) 2 Var ( a X + b ) = a 2 Var ( X ) Cov ( X , Y ) = E [ X Y ] − E [ X ] E [ Y ] ρ X , Y = Cov ( X , Y ) Var ( X ) Var ( Y ) \begin{align*}
\mathbb{E}[X] &= \sum x P(X=x) & \mathbb{E}[X] &= \int x f_X(x) dx \\
\operatorname{Var}(X) &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2 & \operatorname{Var}(aX+b) &= a^2 \operatorname{Var}(X) \\
\operatorname{Cov}(X,Y) &= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] & \rho_{X,Y} &= \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}}
\end{align*} E [ X ] Var ( X ) Cov ( X , Y ) = ∑ x P ( X = x ) = E [ X 2 ] − ( E [ X ] ) 2 = E [ X Y ] − E [ X ] E [ Y ] E [ X ] Var ( a X + b ) ρ X , Y = ∫ x f X ( x ) d x = a 2 Var ( X ) = Var ( X ) Var ( Y ) Cov ( X , Y ) Distributions ¶ Binomial: P ( X = k ) = ( n k ) p k ( 1 − p ) n − k E [ X ] = n p , Var ( X ) = n p ( 1 − p ) Poisson: P ( X = k ) = λ k e − λ k ! E [ X ] = λ , Var ( X ) = λ Geometric: P ( X = k ) = ( 1 − p ) k − 1 p E [ X ] = 1 / p , Var ( X ) = ( 1 − p ) / p 2 Normal: f ( x ) = 1 2 π σ 2 e − ( x − μ ) 2 / ( 2 σ 2 ) E [ X ] = μ , Var ( X ) = σ 2 Exp: f ( x ) = λ e − λ x E [ X ] = 1 / λ , Var ( X ) = 1 / λ 2 \begin{align*}
\text{Binomial: } P(X=k) &= \binom{n}{k} p^k (1-p)^{n-k} & \mathbb{E}[X] = np, \operatorname{Var}(X) = np(1-p) \\
\text{Poisson: } P(X=k) &= \frac{\lambda^k e^{-\lambda}}{k!} & \mathbb{E}[X] = \lambda, \operatorname{Var}(X) = \lambda \\
\text{Geometric: } P(X=k) &= (1-p)^{k-1} p & \mathbb{E}[X] = 1/p, \operatorname{Var}(X) = (1-p)/p^2 \\
\text{Normal: } f(x) &= \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)} & \mathbb{E}[X] = \mu, \operatorname{Var}(X) = \sigma^2 \\
\text{Exp: } f(x) &= \lambda e^{-\lambda x} & \mathbb{E}[X] = 1/\lambda, \operatorname{Var}(X) = 1/\lambda^2
\end{align*} Binomial: P ( X = k ) Poisson: P ( X = k ) Geometric: P ( X = k ) Normal: f ( x ) Exp: f ( x ) = ( k n ) p k ( 1 − p ) n − k = k ! λ k e − λ = ( 1 − p ) k − 1 p = 2 π σ 2 1 e − ( x − μ ) 2 / ( 2 σ 2 ) = λ e − λ x E [ X ] = n p , Var ( X ) = n p ( 1 − p ) E [ X ] = λ , Var ( X ) = λ E [ X ] = 1/ p , Var ( X ) = ( 1 − p ) / p 2 E [ X ] = μ , Var ( X ) = σ 2 E [ X ] = 1/ λ , Var ( X ) = 1/ λ 2 Limit Theorems ¶ LLN: S n n → P μ CLT: Z = S n − μ σ → d N ( 0 , 1 ) Chebyshev: P ( ∣ X − μ ∣ ≥ ε ) ≤ σ 2 ε 2 Markov: P ( X ≥ a ) ≤ E [ X ] a \begin{align*}
\text{LLN: } \frac{S_n}{n} &\xrightarrow{P} \mu & \text{CLT: } Z = \frac{S_n - \mu}{\sigma} &\xrightarrow{d} N(0,1) \\
\text{Chebyshev: } P(|X-\mu| \geq \varepsilon) &\leq \frac{\sigma^2}{\varepsilon^2} & \text{Markov: } P(X \geq a) &\leq \frac{\mathbb{E}[X]}{a}
\end{align*} LLN: n S n Chebyshev: P ( ∣ X − μ ∣ ≥ ε ) P μ ≤ ε 2 σ 2 CLT: Z = σ S n − μ Markov: P ( X ≥ a ) d N ( 0 , 1 ) ≤ a E [ X ] Markov Chains ¶ Markov: P ( X n + 1 ∣ X n , … , X 0 ) = P ( X n + 1 ∣ X n ) Chapman-Kolmogorov: p i j ( m + n ) = ∑ k p i k ( m ) p k j ( n ) Stationary: π ⃗ = π ⃗ P Convergence: ν ⃗ P n → π ⃗ \begin{align*}
\text{Markov: } P(X_{n+1}|X_n, \ldots, X_0) &= P(X_{n+1}|X_n) & \text{Chapman-Kolmogorov: } p_{ij}^{(m+n)} &= \sum_k p_{ik}^{(m)} p_{kj}^{(n)} \\
\text{Stationary: } \vec{\pi} &= \vec{\pi} P & \text{Convergence: } \vec{\nu} P^n &\to \vec{\pi}
\end{align*} Markov: P ( X n + 1 ∣ X n , … , X 0 ) Stationary: π = P ( X n + 1 ∣ X n ) = π P Chapman-Kolmogorov: p ij ( m + n ) Convergence: ν P n = k ∑ p ik ( m ) p kj ( n ) → π Stochastic Processes ¶ Mean: m X ( t ) = E [ X t ] Correlation: R X ( t 1 , t 2 ) = E [ X t 1 X t 2 ] { t 1 = t 2 t 1 ≠ t 2 Covariance: C X ( t 1 , t 2 ) = R X ( t 1 , t 2 ) − m X ( t 1 ) m X ( t 2 ) Stationary: R X ( t 1 , t 2 ) = R X ( ∣ t 2 − t 1 ∣ ) \begin{align*}
\text{Mean: } m_X(t) &= \mathbb{E}[X_t] & \text{Correlation: } R_X(t_1,t_2) &= \mathbb{E}[X_{t_1}X_{t_2}] \begin{cases} t_1 = t_2 \\ t_1 \neq t_2 \end{cases}\\
\text{Covariance: } C_X(t_1,t_2) &= R_X(t_1,t_2) - m_X(t_1)m_X(t_2) & \text{Stationary: } R_X(t_1,t_2) &= R_X(|t_2-t_1|)
\end{align*} Mean: m X ( t ) Covariance: C X ( t 1 , t 2 ) = E [ X t ] = R X ( t 1 , t 2 ) − m X ( t 1 ) m X ( t 2 ) Correlation: R X ( t 1 , t 2 ) Stationary: R X ( t 1 , t 2 ) = E [ X t 1 X t 2 ] { t 1 = t 2 t 1 = t 2 = R X ( ∣ t 2 − t 1 ∣ ) Differentiation Rules ¶ Power Rule d d x x n = n x n − 1 \frac{d}{dx} x^n = n x^{n-1} d x d x n = n x n − 1 Exponential d d x e a x = a e a x \frac{d}{dx} e^{ax} = a e^{ax} d x d e a x = a e a x Natural Log d d x ln ( x ) = 1 x \frac{d}{dx} \ln(x) = \frac{1}{x} d x d ln ( x ) = x 1 Product Rule d d x [ f ( x ) g ( x ) ] = f ′ ( x ) g ( x ) + f ( x ) g ′ ( x ) \frac{d}{dx} [f(x)g(x)] = f'(x)g(x) + f(x)g'(x) d x d [ f ( x ) g ( x )] = f ′ ( x ) g ( x ) + f ( x ) g ′ ( x ) Quotient Rule d d x [ f ( x ) g ( x ) ] = f ′ ( x ) g ( x ) − f ( x ) g ′ ( x ) [ g ( x ) ] 2 \frac{d}{dx} \left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2} d x d [ g ( x ) f ( x ) ] = [ g ( x ) ] 2 f ′ ( x ) g ( x ) − f ( x ) g ′ ( x ) Chain Rule d d x f ( g ( x ) ) = f ′ ( g ( x ) ) ⋅ g ′ ( x ) \frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x) d x d f ( g ( x )) = f ′ ( g ( x )) ⋅ g ′ ( x )
Integration Rules ¶ Power Rule ∫ x n d x = x n + 1 n + 1 + C \int x^n dx = \frac{x^{n+1}}{n+1} + C ∫ x n d x = n + 1 x n + 1 + C (for n ≠ − 1 n \neq -1 n = − 1 )Exponential ∫ e a x d x = 1 a e a x + C \int e^{ax} dx = \frac{1}{a} e^{ax} + C ∫ e a x d x = a 1 e a x + C Natural Log ∫ 1 x d x = ln ∣ x ∣ + C \int \frac{1}{x} dx = \ln|x | + C ∫ x 1 d x = ln ∣ x ∣ + C Integration by Parts ∫ u d v = u v − ∫ v d u \int u dv = uv - \int v du ∫ u d v = uv − ∫ v d u
Integration by Parts Example :
∫ x e a x d x = x e a x a − 1 a ∫ e a x d x = e a x a 2 ( a x − 1 ) + C \int x e^{ax} dx = \frac{x e^{ax}}{a} - \frac{1}{a} \int e^{ax} dx = \frac{e^{ax}}{a^2} (a x - 1) + C ∫ x e a x d x = a x e a x − a 1 ∫ e a x d x = a 2 e a x ( a x − 1 ) + C Special Integrals ¶ ∫ − ∞ ∞ e − x 2 d x = π ∫ 0 ∞ x n e − a x d x = n ! a n + 1 (Gamma function: Γ ( n + 1 ) = n ! ) ∫ 0 ∞ e − a x d x = 1 a for a > 0 ∫ 0 ∞ x e − a x d x = 1 a 2 ∫ 0 ∞ x 2 e − a x d x = 2 a 3 \begin{align*}
\int_{-\infty}^{\infty} e^{-x^2} dx &= \sqrt{\pi} \\
\int_0^{\infty} x^n e^{-ax} dx &= \frac{n!}{a^{n+1}} \quad \text{(Gamma function: } \Gamma(n+1) = n!) \\
\int_0^{\infty} e^{-ax} dx &= \frac{1}{a} \quad \text{for } a > 0 \\
\int_0^{\infty} x e^{-ax} dx &= \frac{1}{a^2} \\
\int_0^{\infty} x^2 e^{-ax} dx &= \frac{2}{a^3}
\end{align*} ∫ − ∞ ∞ e − x 2 d x ∫ 0 ∞ x n e − a x d x ∫ 0 ∞ e − a x d x ∫ 0 ∞ x e − a x d x ∫ 0 ∞ x 2 e − a x d x = π = a n + 1 n ! (Gamma function: Γ ( n + 1 ) = n !) = a 1 for a > 0 = a 2 1 = a 3 2 Taylor Series ¶ e x = ∑ n = 0 ∞ x n n ! = 1 + x + x 2 2 ! + x 3 3 ! + … ln ( 1 + x ) = ∑ n = 1 ∞ ( − 1 ) n + 1 x n n = x − x 2 2 + x 3 3 − … for ∣ x ∣ < 1 sin ( x ) = ∑ n = 0 ∞ ( − 1 ) n x 2 n + 1 ( 2 n + 1 ) ! = x − x 3 3 ! + x 5 5 ! − … cos ( x ) = ∑ n = 0 ∞ ( − 1 ) n x 2 n ( 2 n ) ! = 1 − x 2 2 ! + x 4 4 ! − … e x ≈ 1 + x + x 2 2 (2nd order Taylor) \begin{align*}
e^x &= \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots \\
\ln(1+x) &= \sum_{n=1}^{\infty} (-1)^{n+1} \frac{x^n}{n} = x - \frac{x^2}{2} + \frac{x^3}{3} - \ldots \quad \text{for } |x| < 1 \\
\sin(x) &= \sum_{n=0}^{\infty} (-1)^n \frac{x^{2n+1}}{(2n+1)!} = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \ldots \\
\cos(x) &= \sum_{n=0}^{\infty} (-1)^n \frac{x^{2n}}{(2n)!} = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \ldots \\
e^{x} &\approx 1 + x + \frac{x^2}{2} \quad \text{(2nd order Taylor)}
\end{align*} e x ln ( 1 + x ) sin ( x ) cos ( x ) e x = n = 0 ∑ ∞ n ! x n = 1 + x + 2 ! x 2 + 3 ! x 3 + … = n = 1 ∑ ∞ ( − 1 ) n + 1 n x n = x − 2 x 2 + 3 x 3 − … for ∣ x ∣ < 1 = n = 0 ∑ ∞ ( − 1 ) n ( 2 n + 1 )! x 2 n + 1 = x − 3 ! x 3 + 5 ! x 5 − … = n = 0 ∑ ∞ ( − 1 ) n ( 2 n )! x 2 n = 1 − 2 ! x 2 + 4 ! x 4 − … ≈ 1 + x + 2 x 2 (2nd order Taylor) Common Series ¶ ∑ k = 0 ∞ r k = 1 1 − r for ∣ r ∣ < 1 ∑ k = 1 ∞ k r k − 1 = 1 ( 1 − r ) 2 for ∣ r ∣ < 1 ∑ k = 1 ∞ k 2 r k − 1 = 1 + r ( 1 − r ) 3 for ∣ r ∣ < 1 ∑ k = 1 n k = n ( n + 1 ) 2 ∑ k = 1 n k 2 = n ( n + 1 ) ( 2 n + 1 ) 6 ∑ k = 1 ∞ 1 k 2 = π 2 6 (Basel problem) H n = ∑ k = 1 n 1 k ≈ ln ( n ) + γ where γ ≈ 0.5772 \begin{align*}
\sum_{k=0}^{\infty} r^k &= \frac{1}{1-r} \quad \text{for } |r| < 1 \\
\sum_{k=1}^{\infty} k r^{k-1} &= \frac{1}{(1-r)^2} \quad \text{for } |r| < 1 \\
\sum_{k=1}^{\infty} k^2 r^{k-1} &= \frac{1+r}{(1-r)^3} \quad \text{for } |r| < 1 \\
\sum_{k=1}^{n} k &= \frac{n(n+1)}{2} \\
\sum_{k=1}^{n} k^2 &= \frac{n(n+1)(2n+1)}{6} \\
\sum_{k=1}^{\infty} \frac{1}{k^2} &= \frac{\pi^2}{6} \quad \text{(Basel problem)} \\
H_n &= \sum_{k=1}^n \frac{1}{k} \approx \ln(n) + \gamma \quad \text{where } \gamma \approx 0.5772
\end{align*} k = 0 ∑ ∞ r k k = 1 ∑ ∞ k r k − 1 k = 1 ∑ ∞ k 2 r k − 1 k = 1 ∑ n k k = 1 ∑ n k 2 k = 1 ∑ ∞ k 2 1 H n = 1 − r 1 for ∣ r ∣ < 1 = ( 1 − r ) 2 1 for ∣ r ∣ < 1 = ( 1 − r ) 3 1 + r for ∣ r ∣ < 1 = 2 n ( n + 1 ) = 6 n ( n + 1 ) ( 2 n + 1 ) = 6 π 2 (Basel problem) = k = 1 ∑ n k 1 ≈ ln ( n ) + γ where γ ≈ 0.5772 Probability Theory Fundamentals ¶ Probability Space & Axioms ¶ A probability space is a triple ( Ω , A , P ) (\Omega, \mathcal{A}, P) ( Ω , A , P ) where:
Ω \Omega Ω : Sample space - set of all possible outcomes
A \mathcal{A} A : σ \sigma σ -algebra - set of all measurable events (subsets of Ω \Omega Ω )
P P P : Probability measure satisfying Kolmogorov’s Axioms :
1. 0 ≤ P ( E ) ≤ 1 ∀ E ∈ A 2. P ( Ω ) = 1 3. P ( ⋃ i = 1 ∞ E i ) = ∑ i = 1 ∞ P ( E i ) for pairwise disjoint E i \begin{align*}
1. &\quad 0 \leq P(E) \leq 1 \quad \forall E \in \mathcal{A} \\
2. &\quad P(\Omega) = 1 \\
3. &\quad P\left(\bigcup_{i=1}^{\infty} E_i\right) = \sum_{i=1}^{\infty} P(E_i) \quad \text{for pairwise disjoint } E_i
\end{align*} 1. 2. 3. 0 ≤ P ( E ) ≤ 1 ∀ E ∈ A P ( Ω ) = 1 P ( i = 1 ⋃ ∞ E i ) = i = 1 ∑ ∞ P ( E i ) for pairwise disjoint E i Key Properties ¶ P ( ∅ ) = 0 P ( A c ) = 1 − P ( A ) P ( A ∖ B ) = P ( A ) − P ( A ∩ B ) P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) (Inclusion-Exclusion) \begin{align*}
P(\emptyset) &= 0 \\
P(A^c) &= 1 - P(A) \\
P(A \setminus B) &= P(A) - P(A \cap B) \\
P(A \cup B) &= P(A) + P(B) - P(A \cap B) \quad \text{(Inclusion-Exclusion)}
\end{align*} P ( ∅ ) P ( A c ) P ( A ∖ B ) P ( A ∪ B ) = 0 = 1 − P ( A ) = P ( A ) − P ( A ∩ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) (Inclusion-Exclusion) Conditional Probability ¶ Interpretation : Restricts the sample space to F F F and considers the relative frequency of E E E within F F F .
Independence ¶ Two events E , F E, F E , F are independent if:
P ( E ∩ F ) = P ( E ) ⋅ P ( F ) P(E \cap F) = P(E) \cdot P(F) P ( E ∩ F ) = P ( E ) ⋅ P ( F ) Equivalently: P ( E ∣ F ) = P ( E ) P(E|F) = P(E) P ( E ∣ F ) = P ( E ) or P ( F ∣ E ) = P ( F ) P(F|E) = P(F) P ( F ∣ E ) = P ( F ) .
For n n n events E 1 , E 2 , … , E n E_1, E_2, \ldots, E_n E 1 , E 2 , … , E n :
P ( ⋂ i = 1 n E i ) = ∏ i = 1 n P ( E i ) P\left(\bigcap_{i=1}^n E_i\right) = \prod_{i=1}^n P(E_i) P ( i = 1 ⋂ n E i ) = i = 1 ∏ n P ( E i ) Important : Independence is pairwise and for all subsets, not just pairs.
Law of Total Probability ¶ For a partition A 1 , A 2 , … , A n A_1, A_2, \ldots, A_n A 1 , A 2 , … , A n of Ω \Omega Ω :
P ( E ) = ∑ i = 1 n P ( E ∣ A i ) ⋅ P ( A i ) P(E) = \sum_{i=1}^n P(E|A_i) \cdot P(A_i) P ( E ) = i = 1 ∑ n P ( E ∣ A i ) ⋅ P ( A i ) Simplified (for binary partition):
P ( E ) = P ( E ∣ A ) ⋅ P ( A ) + P ( E ∣ A c ) ⋅ P ( A c ) P(E) = P(E|A) \cdot P(A) + P(E|A^c) \cdot P(A^c) P ( E ) = P ( E ∣ A ) ⋅ P ( A ) + P ( E ∣ A c ) ⋅ P ( A c ) Bayes’ Theorem ¶ P ( A ∣ E ) = P ( E ∣ A ) ⋅ P ( A ) P ( E ) = P ( E ∣ A ) ⋅ P ( A ) P ( E ∣ A ) ⋅ P ( A ) + P ( E ∣ A c ) ⋅ P ( A c ) P(A|E) = \frac{P(E|A) \cdot P(A)}{P(E)} = \frac{P(E|A) \cdot P(A)}{P(E|A) \cdot P(A) + P(E|A^c) \cdot P(A^c)} P ( A ∣ E ) = P ( E ) P ( E ∣ A ) ⋅ P ( A ) = P ( E ∣ A ) ⋅ P ( A ) + P ( E ∣ A c ) ⋅ P ( A c ) P ( E ∣ A ) ⋅ P ( A ) Example : Medical testing - given test result, compute probability of disease.
Random Variables ¶ Definition ¶ A random variable (RV) X : Ω → R X: \Omega \to \mathbb{R} X : Ω → R assigns a numerical value to each outcome.
Types of Random Variables ¶ Type Definition Example Discrete Takes finite or countable values Dice roll, coin toss Continuous Takes uncountable values Height, time
Probability Mass Function (PMF) ¶ For discrete RV X X X :
P ( X = x i ) = p X ( x i ) where ∑ i p X ( x i ) = 1 P(X = x_i) = p_X(x_i) \quad \text{where } \sum_i p_X(x_i) = 1 P ( X = x i ) = p X ( x i ) where i ∑ p X ( x i ) = 1 Probability Density Function (PDF) ¶ For continuous RV X X X :
P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x where ∫ − ∞ ∞ f X ( x ) d x = 1 P(a \leq X \leq b) = \int_a^b f_X(x) \, dx \quad \text{where } \int_{-\infty}^{\infty} f_X(x) \, dx = 1 P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x where ∫ − ∞ ∞ f X ( x ) d x = 1 Distribution Function (CDF) ¶ For any RV X X X :
F X ( t ) = P ( X ≤ t ) = { ∑ x i ≤ t P ( X = x i ) for discrete X ∫ − ∞ t f X ( x ) d x for continuous X F_X(t) = P(X \leq t) = \begin{cases}
\sum_{x_i \leq t} P(X = x_i) & \text{for discrete } X \\
\int_{-\infty}^t f_X(x) \, dx & \text{for continuous } X
\end{cases} F X ( t ) = P ( X ≤ t ) = { ∑ x i ≤ t P ( X = x i ) ∫ − ∞ t f X ( x ) d x for discrete X for continuous X Properties : F X F_X F X is right-continuous, non-decreasing, with lim t → − ∞ F X ( t ) = 0 \lim_{t\to-\infty} F_X(t) = 0 lim t → − ∞ F X ( t ) = 0 and lim t → ∞ F X ( t ) = 1 \lim_{t\to\infty} F_X(t) = 1 lim t → ∞ F X ( t ) = 1 .
Expectation (Mean) ¶ E [ X ] = { ∑ i x i P ( X = x i ) for discrete X ∫ − ∞ ∞ x f X ( x ) d x for continuous X \mathbb{E}[X] = \begin{cases}
\sum_i x_i P(X = x_i) & \text{for discrete } X \\
\int_{-\infty}^{\infty} x f_X(x) \, dx & \text{for continuous } X
\end{cases} E [ X ] = { ∑ i x i P ( X = x i ) ∫ − ∞ ∞ x f X ( x ) d x for discrete X for continuous X Linearity of Expectation :
E [ a X + b Y ] = a E [ X ] + b E [ Y ] for any RVs X , Y and constants a , b \mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y] \quad \text{for any RVs } X, Y \text{ and constants } a, b E [ a X + bY ] = a E [ X ] + b E [ Y ] for any RVs X , Y and constants a , b Important : Linearity holds regardless of independence .
Variance ¶ Var ( X ) = E [ ( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \operatorname{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 Var ( X ) = E [( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 Properties :
Var ( a X ) = a 2 Var ( X ) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) Var ( X + c ) = Var ( X ) for constant c \begin{align*}
\operatorname{Var}(aX) &= a^2 \operatorname{Var}(X) \\
\operatorname{Var}(X + Y) &= \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X, Y) \\
\operatorname{Var}(X + c) &= \operatorname{Var}(X) \quad \text{for constant } c
\end{align*} Var ( a X ) Var ( X + Y ) Var ( X + c ) = a 2 Var ( X ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) = Var ( X ) for constant c Moment Generating Function (MGF) ¶ ϕ X ( t ) = E [ e t X ] = { ∑ i e t x i P ( X = x i ) for discrete X ∫ − ∞ ∞ e t x f X ( x ) d x for continuous X \phi_X(t) = \mathbb{E}[e^{tX}] = \begin{cases}
\sum_i e^{t x_i} P(X = x_i) & \text{for discrete } X \\
\int_{-\infty}^{\infty} e^{t x} f_X(x) \, dx & \text{for continuous } X
\end{cases} ϕ X ( t ) = E [ e tX ] = { ∑ i e t x i P ( X = x i ) ∫ − ∞ ∞ e t x f X ( x ) d x for discrete X for continuous X Important Property : The n n n -th derivative at t = 0 t=0 t = 0 gives the n n n -th moment:
ϕ X ( n ) ( 0 ) = E [ X n ] \phi_X^{(n)}(0) = \mathbb{E}[X^n] ϕ X ( n ) ( 0 ) = E [ X n ] Taylor Expansion :
ϕ X ( t ) = ∑ n = 0 ∞ t n n ! E [ X n ] \phi_X(t) = \sum_{n=0}^{\infty} \frac{t^n}{n!} \mathbb{E}[X^n] ϕ X ( t ) = n = 0 ∑ ∞ n ! t n E [ X n ] Important Discrete Distributions ¶ Common Discrete Distributions
Distribution PMF Support E [ X ] \mathbb{E}[X] E [ X ] Var ( X ) \operatorname{Var}(X) Var ( X ) MGF Uniform P ( X = k ) = 1 n P(X=k) = \frac{1}{n} P ( X = k ) = n 1 k ∈ { 1 , 2 , … , n } k \in \{1,2,\ldots,n\} k ∈ { 1 , 2 , … , n } n + 1 2 \frac{n+1}{2} 2 n + 1 n 2 − 1 12 \frac{n^2-1}{12} 12 n 2 − 1 e t ( 1 − e n t ) n ( 1 − e t ) \frac{e^t(1-e^{nt})}{n(1-e^t)} n ( 1 − e t ) e t ( 1 − e n t ) Bernoulli P ( X = k ) = p k ( 1 − p ) 1 − k P(X=k) = p^k(1-p)^{1-k} P ( X = k ) = p k ( 1 − p ) 1 − k k ∈ { 0 , 1 } k \in \{0,1\} k ∈ { 0 , 1 } p p p p ( 1 − p ) p(1-p) p ( 1 − p ) p e t + ( 1 − p ) pe^t + (1-p) p e t + ( 1 − p ) Binomial P ( X = k ) = ( n k ) p k ( 1 − p ) n − k P(X=k) = \binom{n}{k}p^k(1-p)^{n-k} P ( X = k ) = ( k n ) p k ( 1 − p ) n − k k ∈ { 0 , 1 , … , n } k \in \{0,1,\ldots,n\} k ∈ { 0 , 1 , … , n } n p np n p n p ( 1 − p ) np(1-p) n p ( 1 − p ) ( p e t + 1 − p ) n (pe^t + 1-p)^n ( p e t + 1 − p ) n Geometric P ( X = k ) = ( 1 − p ) k − 1 p P(X=k) = (1-p)^{k-1}p P ( X = k ) = ( 1 − p ) k − 1 p k ∈ N k \in \mathbb{N} k ∈ N 1 p \frac{1}{p} p 1 1 − p p 2 \frac{1-p}{p^2} p 2 1 − p p e t 1 − ( 1 − p ) e t \frac{pe^t}{1-(1-p)e^t} 1 − ( 1 − p ) e t p e t Poisson P ( X = k ) = λ k e − λ k ! P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} P ( X = k ) = k ! λ k e − λ k ∈ N 0 k \in \mathbb{N}_0 k ∈ N 0 λ \lambda λ λ \lambda λ e λ ( e t − 1 ) e^{\lambda(e^t-1)} e λ ( e t − 1 )
Bernoulli Distribution ¶ Models a single yes/no experiment with success probability p p p .
Example : Coin toss, single trial in an experiment.
Binomial Distribution ¶ Models the number of successes in n n n independent Bernoulli trials.
Geometric Distribution ¶ Models the number of trials until the first success.
Example : Number of coin tosses until the first head appears.
Coupon Collector’s Problem ¶ Problem : How many boxes must you buy to collect all n n n different coupons?
Solution : Let S n = X 1 + X 2 + … + X n S_n = X_1 + X_2 + \ldots + X_n S n = X 1 + X 2 + … + X n where X k ∼ Geom ( n − ( k − 1 ) n ) X_k \sim \text{Geom}\left(\frac{n-(k-1)}{n}\right) X k ∼ Geom ( n n − ( k − 1 ) ) .
E [ S n ] = n ∑ k = 1 n 1 k = n H n ≈ n ln ( n ) + γ n \mathbb{E}[S_n] = n \sum_{k=1}^n \frac{1}{k} = n H_n \approx n \ln(n) + \gamma n E [ S n ] = n k = 1 ∑ n k 1 = n H n ≈ n ln ( n ) + γn where H n = ∑ k = 1 n 1 k H_n = \sum_{k=1}^n \frac{1}{k} H n = ∑ k = 1 n k 1 is the n n n -th harmonic number and γ ≈ 0.5772 \gamma \approx 0.5772 γ ≈ 0.5772 is the Euler-Mascheroni constant.
Important Continuous Distributions ¶ Common Continuous Distributions
Distribution PDF Support E [ X ] \mathbb{E}[X] E [ X ] Var ( X ) \operatorname{Var}(X) Var ( X ) MGF Uniform f ( x ) = 1 b − a f(x) = \frac{1}{b-a} f ( x ) = b − a 1 x ∈ [ a , b ] x \in [a,b] x ∈ [ a , b ] a + b 2 \frac{a+b}{2} 2 a + b ( b − a ) 2 12 \frac{(b-a)^2}{12} 12 ( b − a ) 2 e t b − e t a t ( b − a ) \frac{e^{tb} - e^{ta}}{t(b-a)} t ( b − a ) e t b − e t a Exponential f ( x ) = λ e − λ x f(x) = \lambda e^{-\lambda x} f ( x ) = λ e − λ x x ≥ 0 x \geq 0 x ≥ 0 1 λ \frac{1}{\lambda} λ 1 1 λ 2 \frac{1}{\lambda^2} λ 2 1 λ λ − t \frac{\lambda}{\lambda - t} λ − t λ Normal f ( x ) = 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} f ( x ) = 2 π σ 2 1 e − 2 σ 2 ( x − μ ) 2 x ∈ R x \in \mathbb{R} x ∈ R μ \mu μ σ 2 \sigma^2 σ 2 e t μ + 1 2 t 2 σ 2 e^{t\mu + \frac{1}{2}t^2\sigma^2} e t μ + 2 1 t 2 σ 2 Cauchy f ( x ) = 1 π 1 1 + x 2 f(x) = \frac{1}{\pi} \frac{1}{1+x^2} f ( x ) = π 1 1 + x 2 1 x ∈ R x \in \mathbb{R} x ∈ R Undefined Undefined Undefined
Exponential Distribution ¶ Memoryless Property : For X ∼ Exp ( λ ) X \sim \text{Exp}(\lambda) X ∼ Exp ( λ ) :
P ( X > s + t ∣ X > t ) = P ( X > s ) = e − λ s P(X > s + t | X > t) = P(X > s) = e^{-\lambda s} P ( X > s + t ∣ X > t ) = P ( X > s ) = e − λ s Interpretation : The probability of surviving an additional s s s units of time is independent of how long the component has already survived.
Normal Distribution ¶ Standard Normal : Z ∼ N ( 0 , 1 ) Z \sim N(0,1) Z ∼ N ( 0 , 1 ) with CDF Φ ( z ) = P ( Z ≤ z ) \Phi(z) = P(Z \leq z) Φ ( z ) = P ( Z ≤ z ) .
Standardization : If X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) , then:
Z = X − μ σ ∼ N ( 0 , 1 ) Z = \frac{X - \mu}{\sigma} \sim N(0,1) Z = σ X − μ ∼ N ( 0 , 1 ) Probability Calculation :
P ( X ≤ t ) = Φ ( t − μ σ ) P(X \leq t) = \Phi\left(\frac{t - \mu}{\sigma}\right) P ( X ≤ t ) = Φ ( σ t − μ ) Symmetric Properties :
Φ ( − z ) = 1 − Φ ( z ) and Φ ( 0 ) = 0.5 \Phi(-z) = 1 - \Phi(z) \quad \text{and} \quad \Phi(0) = 0.5 Φ ( − z ) = 1 − Φ ( z ) and Φ ( 0 ) = 0.5 Cauchy Distribution ¶ Used to model spikes or peaks in large datasets (rare events).
Note : Mean and variance are undefined due to heavy tails.
Joint Distributions & Independence ¶ Joint Distribution Function ¶ For two RVs X , Y X, Y X , Y :
F X , Y ( a , b ) = P ( X ≤ a , Y ≤ b ) F_{X,Y}(a,b) = P(X \leq a, Y \leq b) F X , Y ( a , b ) = P ( X ≤ a , Y ≤ b ) Marginal Distributions :
F X ( a ) = lim b → ∞ F X , Y ( a , b ) F_X(a) = \lim_{b\to\infty} F_{X,Y}(a,b) F X ( a ) = b → ∞ lim F X , Y ( a , b ) Joint PMF/PDF ¶ Discrete :
p X , Y ( x , y ) = P ( X = x , Y = y ) p_{X,Y}(x,y) = P(X = x, Y = y) p X , Y ( x , y ) = P ( X = x , Y = y ) Continuous :
f X , Y ( x , y ) such that ∫ ∫ A f X , Y ( x , y ) d x d y = P ( ( X , Y ) ∈ A ) f_{X,Y}(x,y) \quad \text{such that } \int\int_{A} f_{X,Y}(x,y) \, dx \, dy = P((X,Y) \in A) f X , Y ( x , y ) such that ∫ ∫ A f X , Y ( x , y ) d x d y = P (( X , Y ) ∈ A ) Marginal PDF :
f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y Independence of Random Variables ¶ X X X and Y Y Y are independent if one of the following is true:
F X , Y ( a , b ) = F X ( a ) F Y ( b ) ∀ a , b f X , Y ( x , y ) = f X ( x ) f Y ( y ) , (continous RV) P ( X = x , Y = y ) = P ( X = x ) P ( Y = y ) , (discrete RV) \begin{aligned}
F_{X,Y}(a,b) &= F_X(a) F_Y(b) \quad \forall a,b \\
f_{X,Y}(x,y) &= f_X(x)f_Y(y) \quad \text{, (continous RV)} \\
P(X=x, Y=y) &= P(X=x)P(Y=y) \quad \text{, (discrete RV)}
\end{aligned} F X , Y ( a , b ) f X , Y ( x , y ) P ( X = x , Y = y ) = F X ( a ) F Y ( b ) ∀ a , b = f X ( x ) f Y ( y ) , (continous RV) = P ( X = x ) P ( Y = y ) , (discrete RV) Consequences of Independence :
E [ X Y ] = E [ X ] E [ Y ] Cov ( X , Y ) = 0 Var ( X + Y ) = Var ( X ) + Var ( Y ) ϕ X + Y ( t ) = ϕ X ( t ) ϕ Y ( t ) \begin{align*}
\mathbb{E}[XY] &= \mathbb{E}[X]\mathbb{E}[Y] \\
\operatorname{Cov}(X,Y) &= 0 \\
\operatorname{Var}(X+Y) &= \operatorname{Var}(X) + \operatorname{Var}(Y) \\
\phi_{X+Y}(t) &= \phi_X(t) \phi_Y(t)
\end{align*} E [ X Y ] Cov ( X , Y ) Var ( X + Y ) ϕ X + Y ( t ) = E [ X ] E [ Y ] = 0 = Var ( X ) + Var ( Y ) = ϕ X ( t ) ϕ Y ( t ) Uncorrelated (Cov ( X , Y ) = 0 \operatorname{Cov}(X,Y) = 0 Cov ( X , Y ) = 0 ) does NOT imply independence (except for jointly normal RVs).
Covariance & Correlation ¶ Covariance :
Cov ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] = E [ X Y ] − E [ X ] E [ Y ] \operatorname{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])] = E [ X Y ] − E [ X ] E [ Y ] Correlation :
ρ X , Y = Cov ( X , Y ) Var ( X ) Var ( Y ) ∈ [ − 1 , 1 ] \rho_{X,Y} = \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}} \quad \in [-1, 1] ρ X , Y = Var ( X ) Var ( Y ) Cov ( X , Y ) ∈ [ − 1 , 1 ] Properties :
Cov ( X , X ) = Var ( X ) Cov ( X , Y ) = Cov ( Y , X ) Cov ( a X + b , c Y + d ) = a c Cov ( X , Y ) Cov ( X + Y , Z ) = Cov ( X , Z ) + Cov ( Y , Z ) Cov ( X , Y + Z ) = Cov ( X , Y ) + Cov ( X , Z ) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) \begin{align*}
\operatorname{Cov}(X,X) &= \operatorname{Var}(X) \\
\operatorname{Cov}(X,Y) &= \operatorname{Cov}(Y,X) \\
\operatorname{Cov}(aX + b, cY + d) &= ac \operatorname{Cov}(X,Y) \\
\operatorname{Cov}(X + Y, Z) &= \operatorname{Cov}(X,Z) + \operatorname{Cov}(Y,Z) \\
\operatorname{Cov}(X, Y + Z) &= \operatorname{Cov}(X,Y) + \operatorname{Cov}(X,Z) \\
\operatorname{Var}(X + Y) &= \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y)
\end{align*} Cov ( X , X ) Cov ( X , Y ) Cov ( a X + b , c Y + d ) Cov ( X + Y , Z ) Cov ( X , Y + Z ) Var ( X + Y ) = Var ( X ) = Cov ( Y , X ) = a c Cov ( X , Y ) = Cov ( X , Z ) + Cov ( Y , Z ) = Cov ( X , Y ) + Cov ( X , Z ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) Conditional Distributions & Expectation ¶ Conditional PMF/PDF ¶ Discrete :
P X ∣ Y ( x ∣ y ) = P ( X = x ∣ Y = y ) = P ( X = x , Y = y ) P ( Y = y ) P_{X|Y}(x|y) = P(X = x | Y = y) = \frac{P(X = x, Y = y)}{P(Y = y)} P X ∣ Y ( x ∣ y ) = P ( X = x ∣ Y = y ) = P ( Y = y ) P ( X = x , Y = y ) Continuous :
f X ∣ Y ( x ∣ y ) = f X , Y ( x , y ) f Y ( y ) f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} f X ∣ Y ( x ∣ y ) = f Y ( y ) f X , Y ( x , y ) Conditional Expectation ¶ Discrete :
E [ X ∣ Y = y ] = ∑ x x P X ∣ Y ( x ∣ y ) \mathbb{E}[X|Y = y] = \sum_x x P_{X|Y}(x|y) E [ X ∣ Y = y ] = x ∑ x P X ∣ Y ( x ∣ y ) Continuous :
E [ X ∣ Y = y ] = ∫ − ∞ ∞ x f X ∣ Y ( x ∣ y ) d x \mathbb{E}[X|Y = y] = \int_{-\infty}^{\infty} x f_{X|Y}(x|y) \, dx E [ X ∣ Y = y ] = ∫ − ∞ ∞ x f X ∣ Y ( x ∣ y ) d x Important : E [ X ∣ Y ] \mathbb{E}[X|Y] E [ X ∣ Y ] is itself a random variable (function of Y Y Y ).
Law of Total Expectation ¶ E [ X ] = E [ E [ X ∣ Y ] ] \mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]] E [ X ] = E [ E [ X ∣ Y ]] For discrete Y Y Y :
E [ X ] = ∑ y E [ X ∣ Y = y ] P ( Y = y ) \mathbb{E}[X] = \sum_y \mathbb{E}[X|Y = y] P(Y = y) E [ X ] = y ∑ E [ X ∣ Y = y ] P ( Y = y ) For continuous Y Y Y :
E [ X ] = ∫ − ∞ ∞ E [ X ∣ Y = y ] f Y ( y ) d y \mathbb{E}[X] = \int_{-\infty}^{\infty} \mathbb{E}[X|Y = y] f_Y(y) \, dy E [ X ] = ∫ − ∞ ∞ E [ X ∣ Y = y ] f Y ( y ) d y Conditional Variance ¶ Var ( X ∣ Y = y ) = E [ X 2 ∣ Y = y ] − ( E [ X ∣ Y = y ] ) 2 \operatorname{Var}(X|Y = y) = \mathbb{E}[X^2|Y = y] - (\mathbb{E}[X|Y = y])^2 Var ( X ∣ Y = y ) = E [ X 2 ∣ Y = y ] − ( E [ X ∣ Y = y ] ) 2 Law of Total Variance :
Var ( X ) = E [ Var ( X ∣ Y ) ] + Var ( E [ X ∣ Y ] ) \operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X|Y)] + \operatorname{Var}(\mathbb{E}[X|Y]) Var ( X ) = E [ Var ( X ∣ Y )] + Var ( E [ X ∣ Y ]) Mean Squared Error (MSE) Minimization ¶ Proof : Follows from the orthogonality principle in Hilbert spaces.
For a continuous RV X X X with density f X ( x ) f_X(x) f X ( x ) and a differentiable, strictly monotone function g g g :
Y = g ( X ) ⟹ f Y ( y ) = f X ( g − 1 ( y ) ) ⋅ ∣ d d y g − 1 ( y ) ∣ = f X ( g − 1 ( y ) ) ∣ g ′ ( g − 1 ( y ) ) ∣ Y = g(X) \implies f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right| = \frac{f_X(g^{-1}(y))}{|g'(g^{-1}(y))|} Y = g ( X ) ⟹ f Y ( y ) = f X ( g − 1 ( y )) ⋅ ∣ ∣ d y d g − 1 ( y ) ∣ ∣ = ∣ g ′ ( g − 1 ( y )) ∣ f X ( g − 1 ( y )) Validity : Only for y y y in the range of g g g ; f Y ( y ) = 0 f_Y(y) = 0 f Y ( y ) = 0 outside this range.
If X ∼ Unif ( [ 0 , 1 ] ) X \sim \text{Unif}([0,1]) X ∼ Unif ([ 0 , 1 ]) and Y = e X Y = e^X Y = e X :
Y ∼ Exp ( 1 ) on [ 1 , e ] with f Y ( y ) = 1 y for y ∈ [ 1 , e ] Y \sim \text{Exp}(1) \text{ on } [1, e] \quad \text{with } f_Y(y) = \frac{1}{y} \text{ for } y \in [1, e] Y ∼ Exp ( 1 ) on [ 1 , e ] with f Y ( y ) = y 1 for y ∈ [ 1 , e ] Verification : ∫ 1 e 1 y d y = ln ( e ) − ln ( 1 ) = 1 \int_1^e \frac{1}{y} \, dy = \ln(e) - \ln(1) = 1 ∫ 1 e y 1 d y = ln ( e ) − ln ( 1 ) = 1 .
Multiple Variables ¶ For Y 1 = g 1 ( X 1 , X 2 ) Y_1 = g_1(X_1, X_2) Y 1 = g 1 ( X 1 , X 2 ) and Y 2 = g 2 ( X 1 , X 2 ) Y_2 = g_2(X_1, X_2) Y 2 = g 2 ( X 1 , X 2 ) with invertible transformation:
f Y 1 , Y 2 ( y 1 , y 2 ) = f X 1 , X 2 ( g 1 − 1 ( y 1 , y 2 ) , g 2 − 1 ( y 1 , y 2 ) ) ⋅ ∣ J ∣ f_{Y_1,Y_2}(y_1,y_2) = f_{X_1,X_2}(g_1^{-1}(y_1,y_2), g_2^{-1}(y_1,y_2)) \cdot |J| f Y 1 , Y 2 ( y 1 , y 2 ) = f X 1 , X 2 ( g 1 − 1 ( y 1 , y 2 ) , g 2 − 1 ( y 1 , y 2 )) ⋅ ∣ J ∣ where ∣ J ∣ |J| ∣ J ∣ is the absolute value of the Jacobian determinant.
Limit Theorems ¶ Law of Large Numbers (LLN) ¶ Weak LLN : For i.i.d. RVs X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … with E [ X i ] = μ \mathbb{E}[X_i] = \mu E [ X i ] = μ and Var ( X i ) = σ 2 < ∞ \operatorname{Var}(X_i) = \sigma^2 < \infty Var ( X i ) = σ 2 < ∞ :
S n n = X 1 + X 2 + … + X n n → P μ as n → ∞ \frac{S_n}{n} = \frac{X_1 + X_2 + \ldots + X_n}{n} \xrightarrow{P} \mu \quad \text{as } n \to \infty n S n = n X 1 + X 2 + … + X n P μ as n → ∞ Strong LLN : Under the same conditions (with finite mean):
P ( lim n → ∞ S n n = μ ) = 1 (almost sure convergence) P\left(\lim_{n\to\infty} \frac{S_n}{n} = \mu\right) = 1 \quad \text{(almost sure convergence)} P ( n → ∞ lim n S n = μ ) = 1 (almost sure convergence) Interpretation : The sample mean converges to the true mean as n → ∞ n \to \infty n → ∞ .
Central Limit Theorem (CLT) ¶ For i.i.d. RVs X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … with E [ X i ] = μ \mathbb{E}[X_i] = \mu E [ X i ] = μ and Var ( X i ) = σ 2 < ∞ \operatorname{Var}(X_i) = \sigma^2 < \infty Var ( X i ) = σ 2 < ∞ :
Z = S n − μ σ → d N ( 0 , 1 ) as n → ∞ Z = \frac{S_n - \mu}{\sigma} \xrightarrow{d} N(0,1) \quad \text{as } n \to \infty Z = σ S n − μ d N ( 0 , 1 ) as n → ∞ Chebyshev’s Inequality ¶ For any RV X X X with E [ X ] = μ \mathbb{E}[X] = \mu E [ X ] = μ and Var ( X ) = σ 2 \operatorname{Var}(X) = \sigma^2 Var ( X ) = σ 2 :
P ( ∣ X − μ ∣ ≥ ε ) ≤ σ 2 ε 2 ∀ ε > 0 P(|X - \mu| \geq \varepsilon) \leq \frac{\sigma^2}{\varepsilon^2} \quad \forall \varepsilon > 0 P ( ∣ X − μ ∣ ≥ ε ) ≤ ε 2 σ 2 ∀ ε > 0 Application : Provides a bound on the probability of deviation from the mean.
Example : For S n ∼ Bin ( n , p ) S_n \sim \text{Bin}(n,p) S n ∼ Bin ( n , p ) :
P ( ∣ S n − n p ∣ ≥ ε ) ≤ n p ( 1 − p ) ε 2 P(|S_n - np| \geq \varepsilon) \leq \frac{np(1-p)}{\varepsilon^2} P ( ∣ S n − n p ∣ ≥ ε ) ≤ ε 2 n p ( 1 − p ) Markov’s Inequality ¶ For any non-negative RV X X X :
P ( X ≥ a ) ≤ E [ X ] a ∀ a > 0 P(X \geq a) \leq \frac{\mathbb{E}[X]}{a} \quad \forall a > 0 P ( X ≥ a ) ≤ a E [ X ] ∀ a > 0 Stochastic Processes ¶ Definition ¶ A stochastic process is a collection of RVs X = { X t : t ∈ T } X = \{X_t: t \in T\} X = { X t : t ∈ T } where T T T is an index set (usually time).
Representation :
X : T × Ω → R via ( t , ω ) ↦ X t ( ω ) X: T \times \Omega \to \mathbb{R} \quad \text{via } (t, \omega) \mapsto X_t(\omega) X : T × Ω → R via ( t , ω ) ↦ X t ( ω ) Characterization ¶ A stochastic process is characterized by its finite-dimensional distributions :
F X t 1 , … , X t k ( x 1 , … , x k ) = P ( X t 1 ≤ x 1 , … , X t k ≤ x k ) F_{X_{t_1}, \ldots, X_{t_k}}(x_1, \ldots, x_k) = P(X_{t_1} \leq x_1, \ldots, X_{t_k} \leq x_k) F X t 1 , … , X t k ( x 1 , … , x k ) = P ( X t 1 ≤ x 1 , … , X t k ≤ x k ) for all k ≥ 1 k \geq 1 k ≥ 1 and t 1 < t 2 < … < t k t_1 < t_2 < \ldots < t_k t 1 < t 2 < … < t k .
Mean & Correlation Functions ¶ Mean Function :
m X ( t ) = E [ X t ] m_X(t) = \mathbb{E}[X_t] m X ( t ) = E [ X t ] Correlation Function :
R X ( t 1 , t 2 ) = E [ X t 1 X t 2 ] = Cov ( X t 1 , X t 2 ) Var ( X t 1 ) Var ( X t 2 ) { t 1 = t 2 t 1 ≠ t 2 R_X(t_1, t_2) = \mathbb{E}[X_{t_1} X_{t_2}] = \frac{\operatorname{Cov}(X_{t1},X_{t2})}{\sqrt{\operatorname{Var}(X_{t1})\operatorname{Var}(X_{t2})}} \begin{cases} t_1 = t_2 \\ t_1 \neq t_2 \end{cases} R X ( t 1 , t 2 ) = E [ X t 1 X t 2 ] = Var ( X t 1 ) Var ( X t 2 ) Cov ( X t 1 , X t 2 ) { t 1 = t 2 t 1 = t 2 Covariance Function :
C X ( t 1 , t 2 ) = R X ( t 1 , t 2 ) − m X ( t 1 ) m X ( t 2 ) C_X(t_1, t_2) = R_X(t_1, t_2) - m_X(t_1) m_X(t_2) C X ( t 1 , t 2 ) = R X ( t 1 , t 2 ) − m X ( t 1 ) m X ( t 2 ) Stationary Processes ¶ Definition : A process is stationary if its finite-dimensional distributions are invariant under time shifts:
( X t 1 , … , X t k ) = d ( X t 1 + s , … , X t k + s ) ∀ s (X_{t_1}, \ldots, X_{t_k}) \stackrel{d}{=} (X_{t_1+s}, \ldots, X_{t_k+s}) \quad \forall s ( X t 1 , … , X t k ) = d ( X t 1 + s , … , X t k + s ) ∀ s Wide-Sense Stationary (WSS) :
Constant mean: m X ( t ) = m m_X(t) = m m X ( t ) = m for all t t t
Correlation depends only on time difference: R X ( t 1 , t 2 ) = R X ( ∣ t 2 − t 1 ∣ ) R_X(t_1, t_2) = R_X(|t_2 - t_1|) R X ( t 1 , t 2 ) = R X ( ∣ t 2 − t 1 ∣ )
Examples of Stochastic Processes ¶ Process Description Mean Variance Correlation Bernoulli X n X_n X n i.i.d. ± 1 \pm 1 ± 1 valued2 p − 1 2p-1 2 p − 1 4 p ( 1 − p ) 4p(1-p) 4 p ( 1 − p ) ( 2 p − 1 ) 2 (2p-1)^2 ( 2 p − 1 ) 2 for t 1 ≠ t 2 t_1\neq t_2 t 1 = t 2 Random Walk X n = ∑ i = 1 n B i X_n = \sum_{i=1}^n B_i X n = ∑ i = 1 n B i 0 (if p = 1 / 2 p=1/2 p = 1/2 ) n n n min ( t 1 , t 2 ) \min(t_1, t_2) min ( t 1 , t 2 ) Brownian Motion Continuous limit of RW 0 t t t min ( t 1 , t 2 ) \min(t_1, t_2) min ( t 1 , t 2 ) Poisson Counts events in time λ t \lambda t λ t λ t \lambda t λ t λ min ( t 1 , t 2 ) \lambda \min(t_1, t_2) λ min ( t 1 , t 2 )
Bernoulli Process ¶ X n X_n X n i.i.d. with P ( X n = 1 ) = p P(X_n = 1) = p P ( X n = 1 ) = p and P ( X n = − 1 ) = 1 − p P(X_n = -1) = 1-p P ( X n = − 1 ) = 1 − p .
Random Walk (RW) ¶ X 0 = 0 X_0 = 0 X 0 = 0 , X n = X n − 1 + B n X_n = X_{n-1} + B_n X n = X n − 1 + B n where B n B_n B n i.i.d. ± 1 \pm 1 ± 1 valued.
Symmetric RW : p = 1 / 2 p = 1/2 p = 1/2
Recurrence : In d = 1 , 2 d=1,2 d = 1 , 2 , RW is recurrent (returns to origin infinitely often) with probability 1
Transience : In d ≥ 3 d \geq 3 d ≥ 3 , RW is transient (never returns to origin) with probability 1
CLT for RW :
X n n → d N ( 0 , 1 ) for symmetric RW \frac{X_n}{\sqrt{n}} \xrightarrow{d} N(0,1) \quad \text{for symmetric RW} n X n d N ( 0 , 1 ) for symmetric RW Brownian Motion (BM) ¶ Models particle movement in a liquid (Einstein, 1905)
Models stock market fluctuations (Bachelier, 1900)
Continuous paths, nowhere differentiable
B t ∼ N ( 0 , t ) B_t \sim N(0,t) B t ∼ N ( 0 , t ) for each t t t
Independent increments
Connection to RW : Scaled limit of symmetric RW as step size → 0 \to 0 → 0 and time → ∞ \to \infty → ∞ .
Poisson Process ¶ N t N_t N t counts the number of events in [ 0 , t ] [0,t] [ 0 , t ] for a process with:
N 0 = 0 N_0 = 0 N 0 = 0
Independent increments
N s + t − N s ∼ Pois ( λ t ) N_{s+t} - N_s \sim \text{Pois}(\lambda t) N s + t − N s ∼ Pois ( λ t )
Properties :
E [ N t ] = λ t \mathbb{E}[N_t] = \lambda t E [ N t ] = λ t
Var ( N t ) = λ t \operatorname{Var}(N_t) = \lambda t Var ( N t ) = λ t
C N ( t 1 , t 2 ) = λ min ( t 1 , t 2 ) C_N(t_1, t_2) = \lambda \min(t_1, t_2) C N ( t 1 , t 2 ) = λ min ( t 1 , t 2 )
Inter-arrival times T i ∼ Exp ( λ ) T_i \sim \text{Exp}(\lambda) T i ∼ Exp ( λ ) and independent
Thinning Property : If each event is kept with probability p p p , the thinned process is PP ( p λ ) \text{PP}(p\lambda) PP ( p λ ) .
Merging Property : Sum of independent PP ( λ 1 ) \text{PP}(\lambda_1) PP ( λ 1 ) and PP ( λ 2 ) \text{PP}(\lambda_2) PP ( λ 2 ) is PP ( λ 1 + λ 2 ) \text{PP}(\lambda_1 + \lambda_2) PP ( λ 1 + λ 2 ) .
Random Telegraph Process ¶ X t = ( − 1 ) N t X_t = (-1)^{N_t} X t = ( − 1 ) N t where N t N_t N t is a Poisson process.
Mean : E [ X t ] = e − 2 λ t \mathbb{E}[X_t] = e^{-2\lambda t} E [ X t ] = e − 2 λ t
Correlation : R X ( t 1 , t 2 ) = e − 2 λ ∣ t 2 − t 1 ∣ R_X(t_1, t_2) = e^{-2\lambda |t_2 - t_1|} R X ( t 1 , t 2 ) = e − 2 λ ∣ t 2 − t 1 ∣
Markov Chains ¶ Definition ¶ A Markov Chain (MC) is a discrete-time stochastic process with the Markov Property :
P ( X n + 1 = x n + 1 ∣ X n = x n , X n − 1 = x n − 1 , … , X 0 = x 0 ) = P ( X n + 1 = x n + 1 ∣ X n = x n ) P(X_{n+1} = x_{n+1} | X_n = x_n, X_{n-1} = x_{n-1}, \ldots, X_0 = x_0) = P(X_{n+1} = x_{n+1} | X_n = x_n) P ( X n + 1 = x n + 1 ∣ X n = x n , X n − 1 = x n − 1 , … , X 0 = x 0 ) = P ( X n + 1 = x n + 1 ∣ X n = x n ) Interpretation : Given the present, the future is independent of the past.
Transition Probabilities ¶ One-step :
p i j = P ( X n + 1 = j ∣ X n = i ) p_{ij} = P(X_{n+1} = j | X_n = i) p ij = P ( X n + 1 = j ∣ X n = i ) Transition Matrix :
P = ( p 11 p 12 ⋯ p 1 k p 21 p 22 ⋯ p 2 k ⋮ ⋮ ⋱ ⋮ p k 1 p k 2 ⋯ p k k ) P = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1k} \\ p_{21} & p_{22} & \cdots & p_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ p_{k1} & p_{k2} & \cdots & p_{kk} \end{pmatrix} P = ⎝ ⎛ p 11 p 21 ⋮ p k 1 p 12 p 22 ⋮ p k 2 ⋯ ⋯ ⋱ ⋯ p 1 k p 2 k ⋮ p kk ⎠ ⎞ Properties : p i j ≥ 0 p_{ij} \geq 0 p ij ≥ 0 and ∑ j p i j = 1 \sum_j p_{ij} = 1 ∑ j p ij = 1 for all i i i (stochastic matrix).
m-step :
p i j ( m ) = P ( X n + m = j ∣ X n = i ) p_{ij}^{(m)} = P(X_{n+m} = j | X_n = i) p ij ( m ) = P ( X n + m = j ∣ X n = i ) Chapman-Kolmogorov Equations :
p i j ( m + n ) = ∑ k ∈ S p i k ( m ) p k j ( n ) p_{ij}^{(m+n)} = \sum_{k \in S} p_{ik}^{(m)} p_{kj}^{(n)} p ij ( m + n ) = k ∈ S ∑ p ik ( m ) p kj ( n ) Matrix Form : P ( m + n ) = P m P n P^{(m+n)} = P^m P^n P ( m + n ) = P m P n and P ( m ) = P m P^{(m)} = P^m P ( m ) = P m .
Classification of States ¶ Accessible : i → j i \to j i → j if p i j ( n ) > 0 p_{ij}^{(n)} > 0 p ij ( n ) > 0 for some n ≥ 0 n \geq 0 n ≥ 0
Communicating : i ↔ j i \leftrightarrow j i ↔ j if i → j i \to j i → j and j → i j \to i j → i
Communication Classes : Equivalence classes under ↔ \leftrightarrow ↔
Irreducible : All states communicate (single communication class)
Period : d i = gcd { n ≥ 1 : p i i ( n ) > 0 } d_i = \gcd\{n \geq 1: p_{ii}^{(n)} > 0\} d i = g cd{ n ≥ 1 : p ii ( n ) > 0 }
Aperiodic : d i = 1 d_i = 1 d i = 1 for all i i i
Stationary Distributions ¶ A probability vector π ⃗ = ( π 1 , π 2 , … ) \vec{\pi} = (\pi_1, \pi_2, \ldots) π = ( π 1 , π 2 , … ) is a stationary distribution if:
π ⃗ = π ⃗ P and ∑ i π i = 1 , π i ≥ 0 \vec{\pi} = \vec{\pi} P \quad \text{and} \quad \sum_i \pi_i = 1, \pi_i \geq 0 π = π P and i ∑ π i = 1 , π i ≥ 0 Interpretation : If X 0 ∼ π ⃗ X_0 \sim \vec{\pi} X 0 ∼ π , then X n ∼ π ⃗ X_n \sim \vec{\pi} X n ∼ π for all n n n .
Existence :
Convergence ¶ Corollary : Every row of P n P^n P n converges to π ⃗ \vec{\pi} π as n → ∞ n \to \infty n → ∞ .
Problem : Rank web pages based on link structure.
Model : Random surfer model with:
Google Matrix :
G = α S + ( 1 − α ) 1 n e ⃗ e ⃗ T G = \alpha S + (1-\alpha) \frac{1}{n} \vec{e} \vec{e}^T G = α S + ( 1 − α ) n 1 e e T where S S S is the stochastic matrix from the web graph (with fixes for dangling nodes).
Properties :
Dangling Nodes : Pages without outlinks. Solution: Add uniform row.
Rank Sinks : Groups of pages that trap the random surfer. Solution: Use teleportation.
Continuous-Time Markov Chains ¶ Holding Times ¶ For a continuous-time MC, the time spent in state i i i before jumping:
Poisson Process (Special Case) ¶ A counting process N t N_t N t is a Poisson Process with rate λ > 0 \lambda > 0 λ > 0 if:
N 0 = 0 N_0 = 0 N 0 = 0
Independent increments
N s + t − N s ∼ Pois ( λ t ) N_{s+t} - N_s \sim \text{Pois}(\lambda t) N s + t − N s ∼ Pois ( λ t )
Key Properties :
N t ∼ Pois ( λ t ) N_t \sim \text{Pois}(\lambda t) N t ∼ Pois ( λ t )
E [ N t ] = λ t \mathbb{E}[N_t] = \lambda t E [ N t ] = λ t
Var ( N t ) = λ t \operatorname{Var}(N_t) = \lambda t Var ( N t ) = λ t
Inter-arrival times T i ∼ Exp ( λ ) T_i \sim \text{Exp}(\lambda) T i ∼ Exp ( λ ) and independent
Construction : If T i ∼ Exp ( λ ) T_i \sim \text{Exp}(\lambda) T i ∼ Exp ( λ ) i.i.d., then N t = max { n : ∑ i = 1 n T i ≤ t } N_t = \max\{n: \sum_{i=1}^n T_i \leq t\} N t = max { n : ∑ i = 1 n T i ≤ t } is a Poisson process.
Important Examples from Exercises ¶ Conditional Probability Examples ¶ Example 1 : Two dice, sum equals 6 vs. first die equals 4
E 1 E_1 E 1 : sum equals 6, P ( E 1 ) = 5 36 P(E_1) = \frac{5}{36} P ( E 1 ) = 36 5
E 3 E_3 E 3 : first die equals 4, P ( E 3 ) = 1 6 P(E_3) = \frac{1}{6} P ( E 3 ) = 6 1
P ( E 1 ∣ E 3 ) = 1 6 > P ( E 1 ) P(E_1 | E_3) = \frac{1}{6} > P(E_1) P ( E 1 ∣ E 3 ) = 6 1 > P ( E 1 ) (dependent!)
P ( E 2 ∣ E 3 ) = 1 6 = P ( E 2 ) P(E_2 | E_3) = \frac{1}{6} = P(E_2) P ( E 2 ∣ E 3 ) = 6 1 = P ( E 2 ) (independent!)
Example 2 : Lottery probabilities
Probability of winning with 6 out of 48: P = 1 ( 48 6 ) ≈ 1.2 × 1 0 − 7 P = \frac{1}{\binom{48}{6}} \approx 1.2 \times 10^{-7} P = ( 6 48 ) 1 ≈ 1.2 × 1 0 − 7
Probability that at least one of H H H participants wins: 1 − ( 1 − 1 ( 48 6 ) ) H 1 - \left(1 - \frac{1}{\binom{48}{6}}\right)^H 1 − ( 1 − ( 6 48 ) 1 ) H
Conditional Expectation Examples ¶ Exercise from Week 8 : Given joint density f X , Y ( x , y ) = 6 x y ( 2 − x − y ) f_{X,Y}(x,y) = 6xy(2-x-y) f X , Y ( x , y ) = 6 x y ( 2 − x − y ) for 0 < x , y < 1 0 < x,y < 1 0 < x , y < 1 :
E [ X ∣ Y = y ] = 5 − 4 y 2 ( 4 − 3 y ) for 0 < y < 1 \mathbb{E}[X|Y=y] = \frac{5 - 4y}{2(4 - 3y)} \quad \text{for } 0 < y < 1 E [ X ∣ Y = y ] = 2 ( 4 − 3 y ) 5 − 4 y for 0 < y < 1 Variance Calculation ¶ Exercise from Week 8 : For f X , Y ( x , y ) = 1 18 e − ( x + y ) f_{X,Y}(x,y) = \frac{1}{18} e^{-(x+y)} f X , Y ( x , y ) = 18 1 e − ( x + y ) for 0 < y < x 0 < y < x 0 < y < x :
Var ( X ∣ Y = 2 ) = 36 \operatorname{Var}(X|Y=2) = 36 Var ( X ∣ Y = 2 ) = 36 Geometric Distribution Proof ¶ Exercise from Week 8 : If X ∼ Pois ( Y ) X \sim \text{Pois}(Y) X ∼ Pois ( Y ) and Y ∼ Exp ( λ ) Y \sim \text{Exp}(\lambda) Y ∼ Exp ( λ ) , then W = X + 1 ∼ Geom ( λ λ + 1 ) W = X + 1 \sim \text{Geom}\left(\frac{\lambda}{\lambda+1}\right) W = X + 1 ∼ Geom ( λ + 1 λ ) .
Markov Chain Examples ¶ Example : Gambler’s Ruin
States: { 0 , 1 , 2 , 3 } \{0, 1, 2, 3\} { 0 , 1 , 2 , 3 } (0 and 3 are absorbing)
Communication classes: { 0 } , { 1 , 2 } , { 3 } \{0\}, \{1,2\}, \{3\} { 0 } , { 1 , 2 } , { 3 }
Not irreducible (multiple classes)
Example : Simple Two-State Chain
States: { 1 , 2 } \{1, 2\} { 1 , 2 }
Transition matrix: P = ( 0 1 1 0 ) P = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} P = ( 0 1 1 0 )
Stationary distribution: π ⃗ = ( 1 / 2 , 1 / 2 ) \vec{\pi} = (1/2, 1/2) π = ( 1/2 , 1/2 )
Periodic with period 2
Poisson Process Examples ¶ Example : Machine failures with rate λ = 2 \lambda = 2 λ = 2 per week.
P ( no failure in 2 weeks ) = e − 4 P(\text{no failure in 2 weeks}) = e^{-4} P ( no failure in 2 weeks ) = e − 4
P ( at most 1 failure in 2 weeks ) = e − 4 + 4 e − 4 = 5 e − 4 P(\text{at most 1 failure in 2 weeks}) = e^{-4} + 4e^{-4} = 5e^{-4} P ( at most 1 failure in 2 weeks ) = e − 4 + 4 e − 4 = 5 e − 4
P ( inter-arrival > 5 weeks ) = e − 10 P(\text{inter-arrival > 5 weeks}) = e^{-10} P ( inter-arrival > 5 weeks ) = e − 10