Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Stochastic Modeling

ZHAW School of Engineering

Formula Sheet

Most Important Concepts

  1. Probability Space & Axioms: Foundation of all probability theory

  2. Conditional Probability: Essential for Bayesian reasoning

  3. Expectation & Variance: Fundamental measures of central tendency and spread

  4. Limit Theorems: LLN and CLT are the most important results in probability

  5. Markov Property: Foundation for Markov Chains and many stochastic processes

  6. Stationary Distributions: Long-term behavior of Markov Chains

  7. Poisson Process: Fundamental continuous-time stochastic process

Problem-Solving Strategies

  1. Identify the Distribution: Recognize which distribution the problem involves

  2. Use Definitions: Start with the definition (PMF, PDF, or property)

  3. Check Conditions: Verify all conditions are met before applying theorems

  4. Use Linearity: Expectation is linear - use this whenever possible

  5. Conditioning: Break complex problems into simpler conditional problems

  6. Symmetry: Look for symmetry to simplify calculations

  7. Approximations: Use CLT for large nn approximations

Probability

P(EF)=P(EF)P(F)P(E)=P(EA)P(A)+P(EAc)P(Ac)P(AE)=P(EA)P(A)P(E)P(AB)=P(A)P(BA)=P(B)P(AB)\begin{align*} P(E|F) &= \frac{P(E \cap F)}{P(F)} & P(E) &= P(E|A)P(A) + P(E|A^c)P(A^c) \\ P(A|E) &= \frac{P(E|A)P(A)}{P(E)} & P(A \cap B) &= P(A)P(B|A) = P(B)P(A|B) \end{align*}

Expectation & Variance

E[X]=xP(X=x)E[X]=xfX(x)dxVar(X)=E[X2](E[X])2Var(aX+b)=a2Var(X)Cov(X,Y)=E[XY]E[X]E[Y]ρX,Y=Cov(X,Y)Var(X)Var(Y)\begin{align*} \mathbb{E}[X] &= \sum x P(X=x) & \mathbb{E}[X] &= \int x f_X(x) dx \\ \operatorname{Var}(X) &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2 & \operatorname{Var}(aX+b) &= a^2 \operatorname{Var}(X) \\ \operatorname{Cov}(X,Y) &= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] & \rho_{X,Y} &= \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}} \end{align*}

Distributions

Binomial: P(X=k)=(nk)pk(1p)nkE[X]=np,Var(X)=np(1p)Poisson: P(X=k)=λkeλk!E[X]=λ,Var(X)=λGeometric: P(X=k)=(1p)k1pE[X]=1/p,Var(X)=(1p)/p2Normal: f(x)=12πσ2e(xμ)2/(2σ2)E[X]=μ,Var(X)=σ2Exp: f(x)=λeλxE[X]=1/λ,Var(X)=1/λ2\begin{align*} \text{Binomial: } P(X=k) &= \binom{n}{k} p^k (1-p)^{n-k} & \mathbb{E}[X] = np, \operatorname{Var}(X) = np(1-p) \\ \text{Poisson: } P(X=k) &= \frac{\lambda^k e^{-\lambda}}{k!} & \mathbb{E}[X] = \lambda, \operatorname{Var}(X) = \lambda \\ \text{Geometric: } P(X=k) &= (1-p)^{k-1} p & \mathbb{E}[X] = 1/p, \operatorname{Var}(X) = (1-p)/p^2 \\ \text{Normal: } f(x) &= \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)} & \mathbb{E}[X] = \mu, \operatorname{Var}(X) = \sigma^2 \\ \text{Exp: } f(x) &= \lambda e^{-\lambda x} & \mathbb{E}[X] = 1/\lambda, \operatorname{Var}(X) = 1/\lambda^2 \end{align*}

Limit Theorems

LLN: SnnPμCLT: Z=SnμσdN(0,1)Chebyshev: P(Xμε)σ2ε2Markov: P(Xa)E[X]a\begin{align*} \text{LLN: } \frac{S_n}{n} &\xrightarrow{P} \mu & \text{CLT: } Z = \frac{S_n - \mu}{\sigma} &\xrightarrow{d} N(0,1) \\ \text{Chebyshev: } P(|X-\mu| \geq \varepsilon) &\leq \frac{\sigma^2}{\varepsilon^2} & \text{Markov: } P(X \geq a) &\leq \frac{\mathbb{E}[X]}{a} \end{align*}

Markov Chains

Markov: P(Xn+1Xn,,X0)=P(Xn+1Xn)Chapman-Kolmogorov: pij(m+n)=kpik(m)pkj(n)Stationary: π=πPConvergence: νPnπ\begin{align*} \text{Markov: } P(X_{n+1}|X_n, \ldots, X_0) &= P(X_{n+1}|X_n) & \text{Chapman-Kolmogorov: } p_{ij}^{(m+n)} &= \sum_k p_{ik}^{(m)} p_{kj}^{(n)} \\ \text{Stationary: } \vec{\pi} &= \vec{\pi} P & \text{Convergence: } \vec{\nu} P^n &\to \vec{\pi} \end{align*}

Stochastic Processes

Mean: mX(t)=E[Xt]Correlation: RX(t1,t2)=E[Xt1Xt2]{t1=t2t1t2Covariance: CX(t1,t2)=RX(t1,t2)mX(t1)mX(t2)Stationary: RX(t1,t2)=RX(t2t1)\begin{align*} \text{Mean: } m_X(t) &= \mathbb{E}[X_t] & \text{Correlation: } R_X(t_1,t_2) &= \mathbb{E}[X_{t_1}X_{t_2}] \begin{cases} t_1 = t_2 \\ t_1 \neq t_2 \end{cases}\\ \text{Covariance: } C_X(t_1,t_2) &= R_X(t_1,t_2) - m_X(t_1)m_X(t_2) & \text{Stationary: } R_X(t_1,t_2) &= R_X(|t_2-t_1|) \end{align*}

Differentiation Rules

Power Ruleddxxn=nxn1\frac{d}{dx} x^n = n x^{n-1}
Exponentialddxeax=aeax\frac{d}{dx} e^{ax} = a e^{ax}
Natural Logddxln(x)=1x\frac{d}{dx} \ln(x) = \frac{1}{x}
Product Ruleddx[f(x)g(x)]=f(x)g(x)+f(x)g(x)\frac{d}{dx} [f(x)g(x)] = f'(x)g(x) + f(x)g'(x)
Quotient Ruleddx[f(x)g(x)]=f(x)g(x)f(x)g(x)[g(x)]2\frac{d}{dx} \left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2}
Chain Ruleddxf(g(x))=f(g(x))g(x)\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

Integration Rules

Power Rulexndx=xn+1n+1+C\int x^n dx = \frac{x^{n+1}}{n+1} + C (for n1n \neq -1)
Exponentialeaxdx=1aeax+C\int e^{ax} dx = \frac{1}{a} e^{ax} + C
Natural Log1xdx=lnx+C\int \frac{1}{x} dx = \ln|x | + C
Integration by Partsudv=uvvdu\int u dv = uv - \int v du

Integration by Parts Example:

xeaxdx=xeaxa1aeaxdx=eaxa2(ax1)+C\int x e^{ax} dx = \frac{x e^{ax}}{a} - \frac{1}{a} \int e^{ax} dx = \frac{e^{ax}}{a^2} (a x - 1) + C

Special Integrals

ex2dx=π0xneaxdx=n!an+1(Gamma function: Γ(n+1)=n!)0eaxdx=1afor a>00xeaxdx=1a20x2eaxdx=2a3\begin{align*} \int_{-\infty}^{\infty} e^{-x^2} dx &= \sqrt{\pi} \\ \int_0^{\infty} x^n e^{-ax} dx &= \frac{n!}{a^{n+1}} \quad \text{(Gamma function: } \Gamma(n+1) = n!) \\ \int_0^{\infty} e^{-ax} dx &= \frac{1}{a} \quad \text{for } a > 0 \\ \int_0^{\infty} x e^{-ax} dx &= \frac{1}{a^2} \\ \int_0^{\infty} x^2 e^{-ax} dx &= \frac{2}{a^3} \end{align*}

Taylor Series

ex=n=0xnn!=1+x+x22!+x33!+ln(1+x)=n=1(1)n+1xnn=xx22+x33for x<1sin(x)=n=0(1)nx2n+1(2n+1)!=xx33!+x55!cos(x)=n=0(1)nx2n(2n)!=1x22!+x44!ex1+x+x22(2nd order Taylor)\begin{align*} e^x &= \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots \\ \ln(1+x) &= \sum_{n=1}^{\infty} (-1)^{n+1} \frac{x^n}{n} = x - \frac{x^2}{2} + \frac{x^3}{3} - \ldots \quad \text{for } |x| < 1 \\ \sin(x) &= \sum_{n=0}^{\infty} (-1)^n \frac{x^{2n+1}}{(2n+1)!} = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \ldots \\ \cos(x) &= \sum_{n=0}^{\infty} (-1)^n \frac{x^{2n}}{(2n)!} = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \ldots \\ e^{x} &\approx 1 + x + \frac{x^2}{2} \quad \text{(2nd order Taylor)} \end{align*}

Common Series

k=0rk=11rfor r<1k=1krk1=1(1r)2for r<1k=1k2rk1=1+r(1r)3for r<1k=1nk=n(n+1)2k=1nk2=n(n+1)(2n+1)6k=11k2=π26(Basel problem)Hn=k=1n1kln(n)+γwhere γ0.5772\begin{align*} \sum_{k=0}^{\infty} r^k &= \frac{1}{1-r} \quad \text{for } |r| < 1 \\ \sum_{k=1}^{\infty} k r^{k-1} &= \frac{1}{(1-r)^2} \quad \text{for } |r| < 1 \\ \sum_{k=1}^{\infty} k^2 r^{k-1} &= \frac{1+r}{(1-r)^3} \quad \text{for } |r| < 1 \\ \sum_{k=1}^{n} k &= \frac{n(n+1)}{2} \\ \sum_{k=1}^{n} k^2 &= \frac{n(n+1)(2n+1)}{6} \\ \sum_{k=1}^{\infty} \frac{1}{k^2} &= \frac{\pi^2}{6} \quad \text{(Basel problem)} \\ H_n &= \sum_{k=1}^n \frac{1}{k} \approx \ln(n) + \gamma \quad \text{where } \gamma \approx 0.5772 \end{align*}

Probability Theory Fundamentals

Probability Space & Axioms

A probability space is a triple (Ω,A,P)(\Omega, \mathcal{A}, P) where:

1.0P(E)1EA2.P(Ω)=13.P(i=1Ei)=i=1P(Ei)for pairwise disjoint Ei\begin{align*} 1. &\quad 0 \leq P(E) \leq 1 \quad \forall E \in \mathcal{A} \\ 2. &\quad P(\Omega) = 1 \\ 3. &\quad P\left(\bigcup_{i=1}^{\infty} E_i\right) = \sum_{i=1}^{\infty} P(E_i) \quad \text{for pairwise disjoint } E_i \end{align*}

Key Properties

P()=0P(Ac)=1P(A)P(AB)=P(A)P(AB)P(AB)=P(A)+P(B)P(AB)(Inclusion-Exclusion)\begin{align*} P(\emptyset) &= 0 \\ P(A^c) &= 1 - P(A) \\ P(A \setminus B) &= P(A) - P(A \cap B) \\ P(A \cup B) &= P(A) + P(B) - P(A \cap B) \quad \text{(Inclusion-Exclusion)} \end{align*}

Conditional Probability

Interpretation: Restricts the sample space to FF and considers the relative frequency of EE within FF.

Independence

For nn events E1,E2,,EnE_1, E_2, \ldots, E_n:

P(i=1nEi)=i=1nP(Ei)P\left(\bigcap_{i=1}^n E_i\right) = \prod_{i=1}^n P(E_i)

Important: Independence is pairwise and for all subsets, not just pairs.

Law of Total Probability

For a partition A1,A2,,AnA_1, A_2, \ldots, A_n of Ω\Omega:

P(E)=i=1nP(EAi)P(Ai)P(E) = \sum_{i=1}^n P(E|A_i) \cdot P(A_i)

Simplified (for binary partition):

P(E)=P(EA)P(A)+P(EAc)P(Ac)P(E) = P(E|A) \cdot P(A) + P(E|A^c) \cdot P(A^c)

Bayes’ Theorem

P(AE)=P(EA)P(A)P(E)=P(EA)P(A)P(EA)P(A)+P(EAc)P(Ac)P(A|E) = \frac{P(E|A) \cdot P(A)}{P(E)} = \frac{P(E|A) \cdot P(A)}{P(E|A) \cdot P(A) + P(E|A^c) \cdot P(A^c)}

Example: Medical testing - given test result, compute probability of disease.


Random Variables

Definition

A random variable (RV) X:ΩRX: \Omega \to \mathbb{R} assigns a numerical value to each outcome.

Types of Random Variables

TypeDefinitionExample
DiscreteTakes finite or countable valuesDice roll, coin toss
ContinuousTakes uncountable valuesHeight, time

Probability Mass Function (PMF)

For discrete RV XX:

P(X=xi)=pX(xi)where ipX(xi)=1P(X = x_i) = p_X(x_i) \quad \text{where } \sum_i p_X(x_i) = 1

Probability Density Function (PDF)

For continuous RV XX:

P(aXb)=abfX(x)dxwhere fX(x)dx=1P(a \leq X \leq b) = \int_a^b f_X(x) \, dx \quad \text{where } \int_{-\infty}^{\infty} f_X(x) \, dx = 1

Distribution Function (CDF)

For any RV XX:

FX(t)=P(Xt)={xitP(X=xi)for discrete XtfX(x)dxfor continuous XF_X(t) = P(X \leq t) = \begin{cases} \sum_{x_i \leq t} P(X = x_i) & \text{for discrete } X \\ \int_{-\infty}^t f_X(x) \, dx & \text{for continuous } X \end{cases}

Properties: FXF_X is right-continuous, non-decreasing, with limtFX(t)=0\lim_{t\to-\infty} F_X(t) = 0 and limtFX(t)=1\lim_{t\to\infty} F_X(t) = 1.

Expectation (Mean)

E[X]={ixiP(X=xi)for discrete XxfX(x)dxfor continuous X\mathbb{E}[X] = \begin{cases} \sum_i x_i P(X = x_i) & \text{for discrete } X \\ \int_{-\infty}^{\infty} x f_X(x) \, dx & \text{for continuous } X \end{cases}

Linearity of Expectation:

E[aX+bY]=aE[X]+bE[Y]for any RVs X,Y and constants a,b\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y] \quad \text{for any RVs } X, Y \text{ and constants } a, b

Important: Linearity holds regardless of independence.

Variance

Var(X)=E[(XE[X])2]=E[X2](E[X])2\operatorname{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Properties:

Var(aX)=a2Var(X)Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X+c)=Var(X)for constant c\begin{align*} \operatorname{Var}(aX) &= a^2 \operatorname{Var}(X) \\ \operatorname{Var}(X + Y) &= \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X, Y) \\ \operatorname{Var}(X + c) &= \operatorname{Var}(X) \quad \text{for constant } c \end{align*}

Moment Generating Function (MGF)

ϕX(t)=E[etX]={ietxiP(X=xi)for discrete XetxfX(x)dxfor continuous X\phi_X(t) = \mathbb{E}[e^{tX}] = \begin{cases} \sum_i e^{t x_i} P(X = x_i) & \text{for discrete } X \\ \int_{-\infty}^{\infty} e^{t x} f_X(x) \, dx & \text{for continuous } X \end{cases}

Important Property: The nn-th derivative at t=0t=0 gives the nn-th moment:

ϕX(n)(0)=E[Xn]\phi_X^{(n)}(0) = \mathbb{E}[X^n]

Taylor Expansion:

ϕX(t)=n=0tnn!E[Xn]\phi_X(t) = \sum_{n=0}^{\infty} \frac{t^n}{n!} \mathbb{E}[X^n]

Important Discrete Distributions

Common Discrete Distributions

DistributionPMFSupportE[X]\mathbb{E}[X]Var(X)\operatorname{Var}(X)MGF
UniformP(X=k)=1nP(X=k) = \frac{1}{n}k{1,2,,n}k \in \{1,2,\ldots,n\}n+12\frac{n+1}{2}n2112\frac{n^2-1}{12}et(1ent)n(1et)\frac{e^t(1-e^{nt})}{n(1-e^t)}
BernoulliP(X=k)=pk(1p)1kP(X=k) = p^k(1-p)^{1-k}k{0,1}k \in \{0,1\}ppp(1p)p(1-p)pet+(1p)pe^t + (1-p)
BinomialP(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k}p^k(1-p)^{n-k}k{0,1,,n}k \in \{0,1,\ldots,n\}npnpnp(1p)np(1-p)(pet+1p)n(pe^t + 1-p)^n
GeometricP(X=k)=(1p)k1pP(X=k) = (1-p)^{k-1}pkNk \in \mathbb{N}1p\frac{1}{p}1pp2\frac{1-p}{p^2}pet1(1p)et\frac{pe^t}{1-(1-p)e^t}
PoissonP(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}kN0k \in \mathbb{N}_0λ\lambdaλ\lambdaeλ(et1)e^{\lambda(e^t-1)}

Bernoulli Distribution

Models a single yes/no experiment with success probability pp.

Example: Coin toss, single trial in an experiment.

Binomial Distribution

Models the number of successes in nn independent Bernoulli trials.

Geometric Distribution

Models the number of trials until the first success.

Example: Number of coin tosses until the first head appears.

Coupon Collector’s Problem

Problem: How many boxes must you buy to collect all nn different coupons?

Solution: Let Sn=X1+X2++XnS_n = X_1 + X_2 + \ldots + X_n where XkGeom(n(k1)n)X_k \sim \text{Geom}\left(\frac{n-(k-1)}{n}\right).

E[Sn]=nk=1n1k=nHnnln(n)+γn\mathbb{E}[S_n] = n \sum_{k=1}^n \frac{1}{k} = n H_n \approx n \ln(n) + \gamma n

where Hn=k=1n1kH_n = \sum_{k=1}^n \frac{1}{k} is the nn-th harmonic number and γ0.5772\gamma \approx 0.5772 is the Euler-Mascheroni constant.


Important Continuous Distributions

Common Continuous Distributions

DistributionPDFSupportE[X]\mathbb{E}[X]Var(X)\operatorname{Var}(X)MGF
Uniformf(x)=1baf(x) = \frac{1}{b-a}x[a,b]x \in [a,b]a+b2\frac{a+b}{2}(ba)212\frac{(b-a)^2}{12}etbetat(ba)\frac{e^{tb} - e^{ta}}{t(b-a)}
Exponentialf(x)=λeλxf(x) = \lambda e^{-\lambda x}x0x \geq 01λ\frac{1}{\lambda}1λ2\frac{1}{\lambda^2}λλt\frac{\lambda}{\lambda - t}
Normalf(x)=12πσ2e(xμ)22σ2f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}xRx \in \mathbb{R}μ\muσ2\sigma^2etμ+12t2σ2e^{t\mu + \frac{1}{2}t^2\sigma^2}
Cauchyf(x)=1π11+x2f(x) = \frac{1}{\pi} \frac{1}{1+x^2}xRx \in \mathbb{R}UndefinedUndefinedUndefined

Exponential Distribution

Memoryless Property: For XExp(λ)X \sim \text{Exp}(\lambda):

P(X>s+tX>t)=P(X>s)=eλsP(X > s + t | X > t) = P(X > s) = e^{-\lambda s}

Interpretation: The probability of surviving an additional ss units of time is independent of how long the component has already survived.

Normal Distribution

Standard Normal: ZN(0,1)Z \sim N(0,1) with CDF Φ(z)=P(Zz)\Phi(z) = P(Z \leq z).

Standardization: If XN(μ,σ2)X \sim N(\mu, \sigma^2), then:

Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma} \sim N(0,1)

Probability Calculation:

P(Xt)=Φ(tμσ)P(X \leq t) = \Phi\left(\frac{t - \mu}{\sigma}\right)

Symmetric Properties:

Φ(z)=1Φ(z)andΦ(0)=0.5\Phi(-z) = 1 - \Phi(z) \quad \text{and} \quad \Phi(0) = 0.5

Cauchy Distribution

Used to model spikes or peaks in large datasets (rare events).

Note: Mean and variance are undefined due to heavy tails.


Joint Distributions & Independence

Joint Distribution Function

For two RVs X,YX, Y:

FX,Y(a,b)=P(Xa,Yb)F_{X,Y}(a,b) = P(X \leq a, Y \leq b)

Marginal Distributions:

FX(a)=limbFX,Y(a,b)F_X(a) = \lim_{b\to\infty} F_{X,Y}(a,b)

Joint PMF/PDF

Discrete:

pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x,y) = P(X = x, Y = y)

Continuous:

fX,Y(x,y)such that AfX,Y(x,y)dxdy=P((X,Y)A)f_{X,Y}(x,y) \quad \text{such that } \int\int_{A} f_{X,Y}(x,y) \, dx \, dy = P((X,Y) \in A)

Marginal PDF:

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy

Independence of Random Variables

XX and YY are independent if one of the following is true:

FX,Y(a,b)=FX(a)FY(b)a,bfX,Y(x,y)=fX(x)fY(y), (continous RV)P(X=x,Y=y)=P(X=x)P(Y=y), (discrete RV)\begin{aligned} F_{X,Y}(a,b) &= F_X(a) F_Y(b) \quad \forall a,b \\ f_{X,Y}(x,y) &= f_X(x)f_Y(y) \quad \text{, (continous RV)} \\ P(X=x, Y=y) &= P(X=x)P(Y=y) \quad \text{, (discrete RV)} \end{aligned}

Consequences of Independence:

E[XY]=E[X]E[Y]Cov(X,Y)=0Var(X+Y)=Var(X)+Var(Y)ϕX+Y(t)=ϕX(t)ϕY(t)\begin{align*} \mathbb{E}[XY] &= \mathbb{E}[X]\mathbb{E}[Y] \\ \operatorname{Cov}(X,Y) &= 0 \\ \operatorname{Var}(X+Y) &= \operatorname{Var}(X) + \operatorname{Var}(Y) \\ \phi_{X+Y}(t) &= \phi_X(t) \phi_Y(t) \end{align*}

Covariance & Correlation

Covariance:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\operatorname{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Correlation:

ρX,Y=Cov(X,Y)Var(X)Var(Y)[1,1]\rho_{X,Y} = \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}} \quad \in [-1, 1]

Properties:

Cov(X,X)=Var(X)Cov(X,Y)=Cov(Y,X)Cov(aX+b,cY+d)=acCov(X,Y)Cov(X+Y,Z)=Cov(X,Z)+Cov(Y,Z)Cov(X,Y+Z)=Cov(X,Y)+Cov(X,Z)Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\begin{align*} \operatorname{Cov}(X,X) &= \operatorname{Var}(X) \\ \operatorname{Cov}(X,Y) &= \operatorname{Cov}(Y,X) \\ \operatorname{Cov}(aX + b, cY + d) &= ac \operatorname{Cov}(X,Y) \\ \operatorname{Cov}(X + Y, Z) &= \operatorname{Cov}(X,Z) + \operatorname{Cov}(Y,Z) \\ \operatorname{Cov}(X, Y + Z) &= \operatorname{Cov}(X,Y) + \operatorname{Cov}(X,Z) \\ \operatorname{Var}(X + Y) &= \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y) \end{align*}

Conditional Distributions & Expectation

Conditional PMF/PDF

Discrete:

PXY(xy)=P(X=xY=y)=P(X=x,Y=y)P(Y=y)P_{X|Y}(x|y) = P(X = x | Y = y) = \frac{P(X = x, Y = y)}{P(Y = y)}

Continuous:

fXY(xy)=fX,Y(x,y)fY(y)f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}

Conditional Expectation

Discrete:

E[XY=y]=xxPXY(xy)\mathbb{E}[X|Y = y] = \sum_x x P_{X|Y}(x|y)

Continuous:

E[XY=y]=xfXY(xy)dx\mathbb{E}[X|Y = y] = \int_{-\infty}^{\infty} x f_{X|Y}(x|y) \, dx

Important: E[XY]\mathbb{E}[X|Y] is itself a random variable (function of YY).

Law of Total Expectation

E[X]=E[E[XY]]\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]]

For discrete YY:

E[X]=yE[XY=y]P(Y=y)\mathbb{E}[X] = \sum_y \mathbb{E}[X|Y = y] P(Y = y)

For continuous YY:

E[X]=E[XY=y]fY(y)dy\mathbb{E}[X] = \int_{-\infty}^{\infty} \mathbb{E}[X|Y = y] f_Y(y) \, dy

Conditional Variance

Var(XY=y)=E[X2Y=y](E[XY=y])2\operatorname{Var}(X|Y = y) = \mathbb{E}[X^2|Y = y] - (\mathbb{E}[X|Y = y])^2

Law of Total Variance:

Var(X)=E[Var(XY)]+Var(E[XY])\operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X|Y)] + \operatorname{Var}(\mathbb{E}[X|Y])

Mean Squared Error (MSE) Minimization

Proof: Follows from the orthogonality principle in Hilbert spaces.


Transformation of Random Variables

General Transformation Formula

For a continuous RV XX with density fX(x)f_X(x) and a differentiable, strictly monotone function gg:

Y=g(X)    fY(y)=fX(g1(y))ddyg1(y)=fX(g1(y))g(g1(y))Y = g(X) \implies f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right| = \frac{f_X(g^{-1}(y))}{|g'(g^{-1}(y))|}

Validity: Only for yy in the range of gg; fY(y)=0f_Y(y) = 0 outside this range.

Example: Exponential Transformation

If XUnif([0,1])X \sim \text{Unif}([0,1]) and Y=eXY = e^X:

YExp(1) on [1,e]with fY(y)=1y for y[1,e]Y \sim \text{Exp}(1) \text{ on } [1, e] \quad \text{with } f_Y(y) = \frac{1}{y} \text{ for } y \in [1, e]

Verification: 1e1ydy=ln(e)ln(1)=1\int_1^e \frac{1}{y} \, dy = \ln(e) - \ln(1) = 1.

Multiple Variables

For Y1=g1(X1,X2)Y_1 = g_1(X_1, X_2) and Y2=g2(X1,X2)Y_2 = g_2(X_1, X_2) with invertible transformation:

fY1,Y2(y1,y2)=fX1,X2(g11(y1,y2),g21(y1,y2))Jf_{Y_1,Y_2}(y_1,y_2) = f_{X_1,X_2}(g_1^{-1}(y_1,y_2), g_2^{-1}(y_1,y_2)) \cdot |J|

where J|J| is the absolute value of the Jacobian determinant.


Limit Theorems

Law of Large Numbers (LLN)

Weak LLN: For i.i.d. RVs X1,X2,X_1, X_2, \ldots with E[Xi]=μ\mathbb{E}[X_i] = \mu and Var(Xi)=σ2<\operatorname{Var}(X_i) = \sigma^2 < \infty:

Snn=X1+X2++XnnPμas n\frac{S_n}{n} = \frac{X_1 + X_2 + \ldots + X_n}{n} \xrightarrow{P} \mu \quad \text{as } n \to \infty

Strong LLN: Under the same conditions (with finite mean):

P(limnSnn=μ)=1(almost sure convergence)P\left(\lim_{n\to\infty} \frac{S_n}{n} = \mu\right) = 1 \quad \text{(almost sure convergence)}

Interpretation: The sample mean converges to the true mean as nn \to \infty.

Central Limit Theorem (CLT)

For i.i.d. RVs X1,X2,X_1, X_2, \ldots with E[Xi]=μ\mathbb{E}[X_i] = \mu and Var(Xi)=σ2<\operatorname{Var}(X_i) = \sigma^2 < \infty:

Z=SnμσdN(0,1)as nZ = \frac{S_n - \mu}{\sigma} \xrightarrow{d} N(0,1) \quad \text{as } n \to \infty

Chebyshev’s Inequality

For any RV XX with E[X]=μ\mathbb{E}[X] = \mu and Var(X)=σ2\operatorname{Var}(X) = \sigma^2:

P(Xμε)σ2ε2ε>0P(|X - \mu| \geq \varepsilon) \leq \frac{\sigma^2}{\varepsilon^2} \quad \forall \varepsilon > 0

Application: Provides a bound on the probability of deviation from the mean.

Example: For SnBin(n,p)S_n \sim \text{Bin}(n,p):

P(Snnpε)np(1p)ε2P(|S_n - np| \geq \varepsilon) \leq \frac{np(1-p)}{\varepsilon^2}

Markov’s Inequality

For any non-negative RV XX:

P(Xa)E[X]aa>0P(X \geq a) \leq \frac{\mathbb{E}[X]}{a} \quad \forall a > 0

Stochastic Processes

Definition

A stochastic process is a collection of RVs X={Xt:tT}X = \{X_t: t \in T\} where TT is an index set (usually time).

Representation:

X:T×ΩRvia (t,ω)Xt(ω)X: T \times \Omega \to \mathbb{R} \quad \text{via } (t, \omega) \mapsto X_t(\omega)

Characterization

A stochastic process is characterized by its finite-dimensional distributions:

FXt1,,Xtk(x1,,xk)=P(Xt1x1,,Xtkxk)F_{X_{t_1}, \ldots, X_{t_k}}(x_1, \ldots, x_k) = P(X_{t_1} \leq x_1, \ldots, X_{t_k} \leq x_k)

for all k1k \geq 1 and t1<t2<<tkt_1 < t_2 < \ldots < t_k.

Mean & Correlation Functions

Mean Function:

mX(t)=E[Xt]m_X(t) = \mathbb{E}[X_t]

Correlation Function:

RX(t1,t2)=E[Xt1Xt2]=Cov(Xt1,Xt2)Var(Xt1)Var(Xt2){t1=t2t1t2R_X(t_1, t_2) = \mathbb{E}[X_{t_1} X_{t_2}] = \frac{\operatorname{Cov}(X_{t1},X_{t2})}{\sqrt{\operatorname{Var}(X_{t1})\operatorname{Var}(X_{t2})}} \begin{cases} t_1 = t_2 \\ t_1 \neq t_2 \end{cases}

Covariance Function:

CX(t1,t2)=RX(t1,t2)mX(t1)mX(t2)C_X(t_1, t_2) = R_X(t_1, t_2) - m_X(t_1) m_X(t_2)

Stationary Processes

Definition: A process is stationary if its finite-dimensional distributions are invariant under time shifts:

(Xt1,,Xtk)=d(Xt1+s,,Xtk+s)s(X_{t_1}, \ldots, X_{t_k}) \stackrel{d}{=} (X_{t_1+s}, \ldots, X_{t_k+s}) \quad \forall s

Wide-Sense Stationary (WSS):

Examples of Stochastic Processes

ProcessDescriptionMeanVarianceCorrelation
BernoulliXnX_n i.i.d. ±1\pm 1 valued2p12p-14p(1p)4p(1-p)(2p1)2(2p-1)^2 for t1t2t_1\neq t_2
Random WalkXn=i=1nBiX_n = \sum_{i=1}^n B_i0 (if p=1/2p=1/2)nnmin(t1,t2)\min(t_1, t_2)
Brownian MotionContinuous limit of RW0ttmin(t1,t2)\min(t_1, t_2)
PoissonCounts events in timeλt\lambda tλt\lambda tλmin(t1,t2)\lambda \min(t_1, t_2)

Bernoulli Process

XnX_n i.i.d. with P(Xn=1)=pP(X_n = 1) = p and P(Xn=1)=1pP(X_n = -1) = 1-p.

Random Walk (RW)

X0=0X_0 = 0, Xn=Xn1+BnX_n = X_{n-1} + B_n where BnB_n i.i.d. ±1\pm 1 valued.

CLT for RW:

XnndN(0,1)for symmetric RW\frac{X_n}{\sqrt{n}} \xrightarrow{d} N(0,1) \quad \text{for symmetric RW}

Brownian Motion (BM)

Connection to RW: Scaled limit of symmetric RW as step size 0\to 0 and time \to \infty.

Poisson Process

NtN_t counts the number of events in [0,t][0,t] for a process with:

  1. N0=0N_0 = 0

  2. Independent increments

  3. Ns+tNsPois(λt)N_{s+t} - N_s \sim \text{Pois}(\lambda t)

Properties:

Thinning Property: If each event is kept with probability pp, the thinned process is PP(pλ)\text{PP}(p\lambda).

Merging Property: Sum of independent PP(λ1)\text{PP}(\lambda_1) and PP(λ2)\text{PP}(\lambda_2) is PP(λ1+λ2)\text{PP}(\lambda_1 + \lambda_2).

Random Telegraph Process

Xt=(1)NtX_t = (-1)^{N_t} where NtN_t is a Poisson process.

Mean: E[Xt]=e2λt\mathbb{E}[X_t] = e^{-2\lambda t}

Correlation: RX(t1,t2)=e2λt2t1R_X(t_1, t_2) = e^{-2\lambda |t_2 - t_1|}


Markov Chains

Definition

A Markov Chain (MC) is a discrete-time stochastic process with the Markov Property:

P(Xn+1=xn+1Xn=xn,Xn1=xn1,,X0=x0)=P(Xn+1=xn+1Xn=xn)P(X_{n+1} = x_{n+1} | X_n = x_n, X_{n-1} = x_{n-1}, \ldots, X_0 = x_0) = P(X_{n+1} = x_{n+1} | X_n = x_n)

Interpretation: Given the present, the future is independent of the past.

Transition Probabilities

One-step:

pij=P(Xn+1=jXn=i)p_{ij} = P(X_{n+1} = j | X_n = i)

Transition Matrix:

P=(p11p12p1kp21p22p2kpk1pk2pkk)P = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1k} \\ p_{21} & p_{22} & \cdots & p_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ p_{k1} & p_{k2} & \cdots & p_{kk} \end{pmatrix}

Properties: pij0p_{ij} \geq 0 and jpij=1\sum_j p_{ij} = 1 for all ii (stochastic matrix).

m-step:

pij(m)=P(Xn+m=jXn=i)p_{ij}^{(m)} = P(X_{n+m} = j | X_n = i)

Chapman-Kolmogorov Equations:

pij(m+n)=kSpik(m)pkj(n)p_{ij}^{(m+n)} = \sum_{k \in S} p_{ik}^{(m)} p_{kj}^{(n)}

Matrix Form: P(m+n)=PmPnP^{(m+n)} = P^m P^n and P(m)=PmP^{(m)} = P^m.

Classification of States

Stationary Distributions

A probability vector π=(π1,π2,)\vec{\pi} = (\pi_1, \pi_2, \ldots) is a stationary distribution if:

π=πPandiπi=1,πi0\vec{\pi} = \vec{\pi} P \quad \text{and} \quad \sum_i \pi_i = 1, \pi_i \geq 0

Interpretation: If X0πX_0 \sim \vec{\pi}, then XnπX_n \sim \vec{\pi} for all nn.

Existence:

Convergence

Corollary: Every row of PnP^n converges to π\vec{\pi} as nn \to \infty.

PageRank Algorithm

Problem: Rank web pages based on link structure.

Model: Random surfer model with:

Google Matrix:

G=αS+(1α)1neeTG = \alpha S + (1-\alpha) \frac{1}{n} \vec{e} \vec{e}^T

where SS is the stochastic matrix from the web graph (with fixes for dangling nodes).

Properties:

Dangling Nodes: Pages without outlinks. Solution: Add uniform row.

Rank Sinks: Groups of pages that trap the random surfer. Solution: Use teleportation.


Continuous-Time Markov Chains

Holding Times

For a continuous-time MC, the time spent in state ii before jumping:

Poisson Process (Special Case)

A counting process NtN_t is a Poisson Process with rate λ>0\lambda > 0 if:

  1. N0=0N_0 = 0

  2. Independent increments

  3. Ns+tNsPois(λt)N_{s+t} - N_s \sim \text{Pois}(\lambda t)

Key Properties:

Construction: If TiExp(λ)T_i \sim \text{Exp}(\lambda) i.i.d., then Nt=max{n:i=1nTit}N_t = \max\{n: \sum_{i=1}^n T_i \leq t\} is a Poisson process.

Important Examples from Exercises

Conditional Probability Examples

Example 1: Two dice, sum equals 6 vs. first die equals 4

Example 2: Lottery probabilities

Conditional Expectation Examples

Exercise from Week 8: Given joint density fX,Y(x,y)=6xy(2xy)f_{X,Y}(x,y) = 6xy(2-x-y) for 0<x,y<10 < x,y < 1:

E[XY=y]=54y2(43y)for 0<y<1\mathbb{E}[X|Y=y] = \frac{5 - 4y}{2(4 - 3y)} \quad \text{for } 0 < y < 1

Variance Calculation

Exercise from Week 8: For fX,Y(x,y)=118e(x+y)f_{X,Y}(x,y) = \frac{1}{18} e^{-(x+y)} for 0<y<x0 < y < x:

Var(XY=2)=36\operatorname{Var}(X|Y=2) = 36

Geometric Distribution Proof

Exercise from Week 8: If XPois(Y)X \sim \text{Pois}(Y) and YExp(λ)Y \sim \text{Exp}(\lambda), then W=X+1Geom(λλ+1)W = X + 1 \sim \text{Geom}\left(\frac{\lambda}{\lambda+1}\right).

Markov Chain Examples

Example: Gambler’s Ruin

Example: Simple Two-State Chain

Poisson Process Examples

Example: Machine failures with rate λ=2\lambda = 2 per week.