Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Bayesian Machine Learning

Philosophical Foundations and Bayes’ Theorem

Interpretations of probability

Frequentist definition: Probability of event EE is the long-run relative frequency of EE in infinitely many repetitions of the same experiment. Statements are about data, not parameters (P(E)P(E) is defined via an idealized limit of repetitions).

Bayesian definition: Probability is a subjective degree of belief about events or parameters, given current knowledge (P(E)P(E) quantifies how plausible you think EE is).

Key philosophical contrast

Frequentist: uncertainty is in the world (randomness of experiments).

Bayesian: uncertainty is in the head of the observer (state of information; different observers with different prior knowledge can have different probabilities).

Sampling vs. inference

Sampling (forward problem): Assume parameter is known, ask: “If pp is the probability of success, what is P(Y=kp)P(Y = k \mid p)?”. For example: Number of heads in 10 tosses with known pp ~ Binomial(n=10,pn=10, p).

Inference (inverse problem): Assume data is observed, but the parameters are unknown, ask: “Given I observed kk heads in 10 tosses, what do I believe about pp?”

Frequentist: Estimate single “best” value (e.g. MLE p^=k/n\hat p = k/n)

Bayesian: derive a posterior distribution over pp: p(pdata)p(p \mid \text{data})

Bayes’ theorem and roles of likelihood

Bayes’ theorem (discrete): P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B\mid A)P(A)}{P(B)}

Belief-updater view: P(Ad)posteriorP(dA)likelihood×P(A)prior\underbrace{P(A\mid d)}_{\text{posterior}}\propto\underbrace{P(d\mid A)}_{\text{likelihood}}\times\underbrace{P(A)}_{\text{prior}}

Prior: beliefs before data.

Likelihood: how plausible data are under different parameter values.

Posterior: updated beliefs after data.

Likelihood in both schools:

In frequentist statistics, the likelihood function is used to perform Maximum Likelihood Estimation (MLE). This identifies the single parameter value that maximizes the probability of having produced the observed data.

In Bayesian statistics, the likelihood function serves as a belief updater. It is multiplied by the prior distribution to calculate the posterior, allowing you to estimate a full distribution of all plausible values rather than just one “best” number.

When Bayesian methods are preferable

Small sample sizes / rare events: frequentist asymptotics unreliable.

Strong prior information available (expert opinion, previous studies) and should be integrated.

Need full uncertainty quantification (posterior distributions, credible intervals) rather than point estimates or p-values.

Complex hierarchical or graphical models, where likelihood-based methods get complicated but Bayesian methods remain conceptually coherent.

The computational curse of Bayesian statistics

Core difficulty: the evidence (marginal likelihood) is an often intractable integral: p(d)=p(dθ)p(θ)dθp(d) = \int p(d\mid \theta)p(\theta)\,d\theta

In simple conjugate families this integral is analytic.

In realistic models (many parameters, non-Gaussian likelihoods), high‑dimensional integrals are impossible to compute in closed form.

This computational curse motivates Markov Chain Monte Carlo (MCMC) and Variational Inference.

Conjugate Families and Continuous Inference

Conjugate families are pedagogically useful to understand how priors and data trade off.

Beta-Binomial

Likelihood: P(Yπ)Binomial(n,π)P(Y=kπ)=(nk)πk(1π)nkP(Y \mid \pi) \sim \text{Binomial}(n, \pi) \sim P(Y=k \mid \pi) = \binom{n}{k}\pi^k(1-\pi)^{n-k}

E[Y]=nπVar[Y]=nπ(1π)\mathbb{E}[Y] = n\pi \quad \operatorname{Var}[Y] = n\pi(1-\pi)

Mode[Y]={(n+1)πif (n+1)πZ(n+1)π,(n+1)π1if (n+1)πZ\operatorname{Mode}[Y] = \begin{cases} \lfloor (n+1)\pi \rfloor & \text{if } (n+1)\pi \notin \mathbb{Z} \\ (n+1)\pi, (n+1)\pi - 1 & \text{if } (n+1)\pi \in \mathbb{Z} \end{cases}

(nk)\binom{n}{k} is the binomial coefficient, representing the number of ways to choose kk successes from nn trials: (nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!}

πk\pi^k is the probability of getting exactly kk successes

(1π)nk(1-\pi)^{n-k} is the probability of getting exactly nkn-k failures

Prior: P(π)Beta(πα,β)=Γ(α+β)Γ(α)Γ(β)πα1(1π)β1P(\pi) \sim \text{Beta}(\pi \mid \alpha,\beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \pi^{\alpha-1}(1-\pi)^{\beta-1}

with: Γ(n)=(n1)!nN\Gamma(n) = (n - 1)! \quad \forall n \in \mathbb{N}

E[π]=αα+βVar[π]=αβ(α+β)2(α+β+1)\mathbb{E}[\pi] = \frac{\alpha}{\alpha + \beta} \quad \operatorname{Var}[\pi] = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}

Mode[π]=α1α+β2if α,β>1\operatorname{Mode}[\pi] = \dfrac{\alpha - 1}{\alpha + \beta - 2} \quad \text{if } \alpha, \beta > 1

Posterior: P(πY=k)=Beta(α+k,β+nk)P(\pi \mid Y=k) = \text{Beta}(\alpha + k, \beta + n - k)

Key properties: Prior “pseudo-counts”: α1\alpha-1 successes, β1\beta-1 failures

With more data, likelihood dominates; with little data, prior dominates

Order of data does not matter; only sufficient statistics (n,k)(n,k)

Notebook Cell
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def plot_beta_binomial( alpha, beta, n, k, figsize=(13,3) ):
    # create figure
    plt.figure( figsize=figsize )

    # numeric evaluation range for pi
    pi_range = np.linspace(0, 1, 1000)

    # prior
    prior = [stats.beta.pdf(pi, a=alpha, b=beta) for pi in pi_range]
    plt.plot( pi_range, prior, alpha=0.5, label="prior", c="orange" )
    plt.fill_between( pi_range, prior, alpha=0.3, color="orange" )

    # scaled likelihood
    likelihood = [stats.binom.pmf(n=n, k=k, p=pi) for pi in pi_range]
    likelihood /= np.sum( likelihood ) * (pi_range[1]-pi_range[0])
    plt.plot( pi_range, likelihood, alpha=0.5, label="(scaled) likelihood", c="blue" )
    plt.fill_between( pi_range, likelihood, alpha=0.3, color="blue" )

    # posterior
    posterior = [stats.beta.pdf(pi, a=alpha+k, b=beta+n-k) for pi in pi_range]
    plt.plot( pi_range, posterior, alpha=0.5, label="posterior", color="darkgreen" )
    plt.fill_between( pi_range, posterior, alpha=0.3, color="darkgreen" )

    # enable legend and set descriptive title
    plt.legend( fontsize=14 )
    plt.title( "$\\alpha = {}, \\beta={}, n={}, k={}$".format(alpha, beta, n, k) )

plot_beta_binomial(alpha=5, beta=11, n=20, k=9)
<Figure size 1300x300 with 1 Axes>
Notebook Cell
# Likelihood (Binomial counts):
import preliz as pz
pz.Binomial(n=1, p=0.75).plot_pdf()
<Axes: >
<Figure size 640x480 with 1 Axes>
Notebook Cell
# Prior:
import preliz as pz
pz.Beta(alpha=3, beta=5).plot_pdf()
<Axes: >
<Figure size 640x480 with 1 Axes>

Poisson–Gamma: For Counts

Likelihood: P(Yλ)Poisson(λ)P(Y=yλ)=eλλyy!P(Y \mid \lambda) \sim \text{Poisson}(\lambda) \sim P(Y = y \mid \lambda) = \frac{e^{-\lambda}\lambda^y}{y!}

E[Y]=λVar[Y]=λ\mathbb{E}[Y] = \lambda \quad \text{Var}[Y] = \lambda

Mode[λ]={λif λ is not an integerλ,λ1if λ is an integer\operatorname{Mode}[\lambda] = \begin{cases} \lfloor \lambda \rfloor & \text{if } \lambda \text{ is not an integer} \\ \lambda, \lambda - 1 & \text{if } \lambda \text{ is an integer} \end{cases}

  • λy\lambda^y is the contribution from yy events occurring

  • eλe^{-\lambda} is the probability of no other events occurring

  • 1y!\frac{1}{y!} normalizes for the ordering of events

Prior: P(λ)Gamma(s,r)rsΓ(s)λs1erλP(\lambda) \sim \text{Gamma}(s, r) \sim \frac{r^s}{\Gamma(s)}\lambda^{s-1}e^{-r\lambda}

with Γ(s)=(s1)!nN\Gamma(s) = (s-1)! \quad \forall n \in \mathbb{N}

E[λ]=srVar[λ]=sr2Mode[λ]=s1rfor s1\mathbb{E}[\lambda] = \frac{s}{r} \quad \text{Var}[\lambda] = \frac{s}{r^2} \quad \operatorname{Mode}[\lambda] = \frac{s - 1}{r} \quad \text{for } s \geq 1

Posterior: P(λY=y)Gamma(s+iyi,r+n)P(\lambda \mid Y = y) \sim \text{Gamma}\left(s + \sum_i y_i, r + n\right)

Key properties: Prior “pseudo-counts”: ss represents prior event count, rr represents prior observation periods.

With more data, likelihood dominates; with little data, prior dominates.

Order of data does not matter; only sufficient statistic yi\sum y_i.

Notebook Cell
def plot_gamma_poisson(alpha, beta, y, figsize=(13, 3)):
    y = np.asarray(y)
    n = len(y)
    sum_y = y.sum()

    # Posterior hyperparameters
    alpha_post = alpha + sum_y
    beta_post  = beta + n

    # Choose a range for lambda using high quantiles of prior and posterior
    # (robust and adaptive to parameter scale)
    # scipy.stats.gamma uses shape 'a' and scale = 1/rate
    scale_prior = 1.0 / beta
    scale_post  = 1.0 / beta_post

    lam_max = max(
        stats.gamma.ppf(0.999, a=alpha,      scale=scale_prior),
        stats.gamma.ppf(0.999, a=alpha_post, scale=scale_post)
    )
    lam_range = np.linspace(0, lam_max, 1000)

    plt.figure(figsize=figsize)

    # Prior
    prior = stats.gamma.pdf(lam_range, a=alpha, scale=scale_prior)
    plt.plot(lam_range, prior, alpha=0.5, label="prior", c="orange")
    plt.fill_between(lam_range, prior, alpha=0.3, color="orange")

    # Likelihood as function of lambda: L(λ) = ∏_i Pois(y_i | λ)
    # Use log-likelihood for numerical stability
    log_like = np.zeros_like(lam_range)
    for yi in y:
        log_like += stats.poisson.logpmf(yi, mu=lam_range)
    likelihood = np.exp(log_like)

    # Numerically normalize likelihood over lam_range so it behaves like a pdf on this grid
    norm_const = np.trapz(likelihood, lam_range)
    if norm_const > 0:
        likelihood /= norm_const
    plt.plot(lam_range, likelihood, alpha=0.5, label="(scaled) likelihood", c="blue")
    plt.fill_between(lam_range, likelihood, alpha=0.3, color="blue")

    # Posterior
    posterior = stats.gamma.pdf(lam_range, a=alpha_post, scale=scale_post)
    plt.plot(lam_range, posterior, alpha=0.5, label="posterior", color="darkgreen")
    plt.fill_between(lam_range, posterior, alpha=0.3, color="darkgreen")

    # Legend and title
    plt.legend(fontsize=14)
    plt.xlabel(r"$\lambda$")
    plt.title(r"$\alpha = {}, \beta = {}, n = {}, \sum y_i = {}$"
              .format(alpha, beta, n, int(sum_y)))
    plt.tight_layout()
    plt.show()

plot_gamma_poisson(alpha=2.0, beta=0.75, y=[7, 8, 3, 7, 11, 7, 9, 19, 7, 15, 9, 5, 12, 3, 7, 6, 5, 3, 11, 5])
<Figure size 1300x300 with 1 Axes>

Negative Binomial For Overdispersed Counts

Unlike the Poisson distribution, the variance is not equal to the mean; it is larger (overdispersion). As the dispersion parameter α\alpha \to \infty, the Negative Binomial distribution converges to the Poisson distribution. The Negative Binomial likelihood does not have a simple closed-form conjugate posterior - posterior inference must be done via MCMC simulation.

Likelihood: P(Yr,p)NegBin(r,p)P(Y \mid r, p) \sim \text{NegBin}(r, p)

P(K=kr,p)=(k+r1k)(1p)kprandP(K=kr,λ)=(r+k1)!(r1)!(r+λ)kλkk!1(1+λr)r\begin{aligned} P(K = k \mid r, p) &= \binom{k+r-1}{k} (1-p)^kp^r\\ &\text{and}\\ P(K = k \mid r, \lambda) &= \frac{(r + k - 1)!}{(r - 1)! (r + \lambda)^k} \cdot \frac{\lambda^k}{k!} \cdot \frac{1}{\left(1 + \frac{\lambda}{r}\right)^r} \end{aligned}

E[kr,p]=r(1p)pandE[kr,λ]=λ\mathbb{E}[k \mid r,p] = \frac{r(1-p)}{p} \quad \text{and} \quad \mathbb{E}[k \mid r, \lambda] = \lambda

Var[kr,p]=r(1p)p2andVar[kr,λ]=λ(1+λr)\text{Var}[k \mid r,p] = \frac{r(1-p)}{p^2} \quad \text{and} \quad \text{Var}[k \mid r, \lambda] = \lambda\left(1 + \frac{\lambda}{r}\right)

  • pp probability of success

  • rr is the number of successes (-1 because last success is fixed as its the stopping criterion)

  • kk is the number of failures

Prior: Negative Binomial is typically used in models where conjugacy does not hold. Therefore, we often use:

  • in simple models λGamma(s,r)\lambda \sim \text{Gamma}(s, r) or in regression log(λ)=β0+β1X\log(\lambda) = \beta_0 + \beta_1 X with βNormal\beta \sim \text{Normal}

  • rGammar \sim \text{Gamma} or rExponentialr \sim \text{Exponential} (must be positive)

Normal–Normal (variance needs to be known!)

Likelihood: P(Yμ,σ)Normal(μ,σ2)12πσ2exp((yμ)22σ2)P(Y \mid \mu, \sigma) \sim \text{Normal}(\mu, \sigma^2) \sim \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right)

E[Y]=μVar[Y]=σ2Mode[Y]=μ\mathbb{E}[Y] = \mu \quad \text{Var}[Y] = \sigma^2 \quad \operatorname{Mode}[Y] = \mu

  • μ\mu is the mean (location parameter)

  • σ2\sigma^2 is the variance (assumed known)

  • (yμ)2(y-\mu)^2 measures squared deviation from the mean

Prior: P(μ)Normal(μ0,τ02)P(\mu) \sim \text{Normal}(\mu_0, \tau_0^2)

E[μ]=μ0Var[μ]=τ02\mathbb{E}[\mu] = \mu_0 \quad \text{Var}[\mu] = \tau_0^2

  • μ0\mu_0 is the prior mean

  • τ02\tau_0^2 is the prior variance (reflects uncertainty about μ\mu)

Posterior: P(μY=y)Normal(μn,τn2)P(\mu \mid Y = y) \sim \text{Normal}(\mu_n, \tau_n^2)

μn=μ0τ02+yσ21τ02+1σ2=σ2μ0+τ02yσ2+τ02,τn2=11τ02+1σ2=τ02σ2τ02+σ2\mu_n = \frac{\frac{\mu_0}{\tau_0^2} + \frac{y}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{1}{\sigma^2}} = \frac{\sigma^2\mu_0 + \tau_0^2 y}{\sigma^2 + \tau_0^2}, \quad \tau_n^2 = \frac{1}{\frac{1}{\tau_0^2} + \frac{1}{\sigma^2}} = \frac{\tau_0^2\sigma^2}{\tau_0^2 + \sigma^2}

Key properties: Posterior mean is a precision-weighted average of prior mean μ0\mu_0 and data yy.

Precision = 1variance\frac{1}{\text{variance}}: higher precision = more influence.

Posterior variance is always smaller than both prior and data variance (information accumulation).

With more data (nn observations), data precision nσ2\frac{n}{\sigma^2} dominates.

Notebook Cell
def plot_normal_normal(mu0, tau0, sigma, y, figsize=(13, 3)):
    y = np.asarray(y)
    n = len(y)
    ybar = y.mean()

    # Posterior variance and mean
    tau_n2 = 1.0 / (n / sigma**2 + 1.0 / tau0**2)
    mu_n = tau_n2 * (n * ybar / sigma**2 + mu0 / tau0**2)
    tau_n = np.sqrt(tau_n2)

    # Choose a plotting range for mu around data and prior
    mu_min = min(ybar - 4 * sigma, mu0 - 4 * tau0, mu_n - 4 * tau_n)
    mu_max = max(ybar + 4 * sigma, mu0 + 4 * tau0, mu_n + 4 * tau_n)
    mu_range = np.linspace(mu_min, mu_max, 1000)

    plt.figure(figsize=figsize)

    # Prior over mu
    prior = stats.norm.pdf(mu_range, loc=mu0, scale=tau0)
    plt.plot(mu_range, prior, alpha=0.5, label="prior", c="orange")
    plt.fill_between(mu_range, prior, alpha=0.3, color="orange")

    # Likelihood as function of mu (up to constant)
    # Equivalent to N(mu | ybar, sigma^2/n)
    like_unnorm = stats.norm.pdf(mu_range, loc=ybar, scale=sigma / np.sqrt(n))
    # Numerically normalize to integrate ~1 on this grid
    norm_const = np.trapz(like_unnorm, mu_range)
    likelihood = like_unnorm / norm_const if norm_const > 0 else like_unnorm

    plt.plot(mu_range, likelihood, alpha=0.5, label="(scaled) likelihood", c="blue")
    plt.fill_between(mu_range, likelihood, alpha=0.3, color="blue")

    # Posterior over mu
    posterior = stats.norm.pdf(mu_range, loc=mu_n, scale=tau_n)
    plt.plot(mu_range, posterior, alpha=0.5, label="posterior", color="darkgreen")
    plt.fill_between(mu_range, posterior, alpha=0.3, color="darkgreen")

    # Legend and title
    plt.legend(fontsize=14)
    plt.xlabel(r"$\mu$")
    plt.title(
        r"$\mu_0 = {:.2f}, \tau_0 = {:.2f}, \sigma = {:.2f}, n = {}, \bar y = {:.2f}$"
        .format(mu0, tau0, sigma, n, ybar)
    )
    plt.tight_layout()
    plt.show()

np.random.seed(0)
plot_normal_normal(mu0=0.0, tau0=1.0, sigma=2.0, y=np.random.normal(loc=5.0, scale=2.0, size=10))
<Figure size 1300x300 with 1 Axes>

Multivariate Problems

Covariance: Quantifying Relationship Between Variables

Cov(X,Y)=E[(XμX)(YμY)]\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]
  • Cov(X,X)=Var[X]0\text{Cov}(X, X) = \text{Var}[X] \geq 0: A variable’s covariance with itself is its variance

  • Cov(X,Y)0\text{Cov}(X, Y) \approx 0: XX and YY are unrelated; on average there is no “co-variance”

For observed data (empirical), the sample covariance is:

cov(x,y)=1n1i=1n(xixˉ)(yiyˉ)\text{cov}(x, y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})
  • xˉ\bar{x} and yˉ\bar{y}: the sample means

  • nn is the number of observations (the denominator is n1n-1 to account for the loss of one degree of freedom when estimating the means)

Pearson Correlation = Standardized Covariance

Pearson correlation normalizes covariance by the standard deviations of both variables, producing a unitless measure bounded between 0 and 1. By standardizing covariance, correlation becomes interpretable on a common scale where r=1r = 1 indicates perfect positive correlation, r=0r = 0 indicates no correlation, and r=1r = -1 indicates perfect negative correlation.

ρXY=Cov(X,Y)σXσY[0,1]\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \in [0, 1]
  • σX\sigma_X and σY\sigma_Y are the standard deviations of XX and YY respectively

For observed data (empirical), the Pearson Correlation is:

r(x,y)=cov(x,y)σ^xσ^yr(\mathbf{x}, \mathbf{y}) = \frac{\text{cov}(\mathbf{x}, \mathbf{y})}{\hat{\sigma}_x \hat{\sigma}_y}
  • σ^x\hat{\sigma}_x and σ^y\hat{\sigma}_y are the sample standard deviations

Multinomial–Dirichlet: For Categorical Counts (Generalized Beta-Binomial)

Use whenever we want to model experiments with > 2 mutually exclusive outcomes!

Likelihood: P(Kπ)Multinomial(n,π)P(\mathbf{K} \mid \boldsymbol{\pi}) \sim \text{Multinomial}(n, \boldsymbol{\pi})

P(Ki=kiπ)=n!i=1rki!i=1rπikiP(K_i = k_i \mid \boldsymbol{\pi}) = \frac{n!}{\prod_{i=1}^{r} k_i!} \prod_{i=1}^{r} \pi_i^{k_i}
E[Ki]=nπiVar[Ki]=nπi(1πi)\mathbb{E}[K_i] = n \pi_i \quad \text{Var}[K_i] = n\pi_i(1-\pi_i)
  • i=1rπiki\prod_{i=1}^{r} \pi_i^{k_i} is the contribution from observed category counts

  • n!i=1rki!\frac{n!}{\prod_{i=1}^{r} k_i!} normalizes for the number of orderings

  • Constraint: i=1rki=n\sum_{i=1}^{r} k_i = n and i=1rπi=1\sum_{i=1}^{r} \pi_i = 1

Prior: P(π)Dirichlet(α)P(\boldsymbol{\pi}) \sim \text{Dirichlet}(\boldsymbol{\alpha})

p(π)=Γ(i=1rαi)i=1rΓ(αi)i=1rπiαi1p(\boldsymbol{\pi}) = \frac{\Gamma\left(\sum_{i=1}^{r} \alpha_i\right)}{\prod_{i=1}^{r} \Gamma(\alpha_i)} \prod_{i=1}^{r} \pi_i^{\alpha_i - 1}

with α0=i=1rαi\alpha_0 = \sum_{i=1}^{r} \alpha_i

E[πi]=αiα0Var[πi]=αiα0(1αiα0)1+α0Mode[πi]=αi1α0rfor αi1\mathbb{E}[\pi_i] = \frac{\alpha_i}{\alpha_0} \quad \text{Var}[\pi_i] = \frac{\frac{\alpha_i}{\alpha_0}\left(1 - \frac{\alpha_i}{\alpha_0}\right)}{1 + \alpha_0} \quad \operatorname{Mode}[\pi_i] = \frac{\alpha_i - 1}{\alpha_0 - r} \quad \text{for } \alpha_i \geq 1
Cov(πi,πj)=δijαiα0αiαjα021+α0\text{Cov}(\pi_i, \pi_j) = \frac{\delta_{ij}\frac{\alpha_i}{\alpha_0} - \frac{\alpha_i \alpha_j}{\alpha_0^2}}{1 + \alpha_0}

Posterior: P(πK=k)Dirichlet(α+k)P(\boldsymbol{\pi} \mid \mathbf{K} = \mathbf{k}) \sim \text{Dirichlet}(\boldsymbol{\alpha} + \mathbf{k})

Dirichlet(α1,,αr)Dirichlet(α1+k1,,αr+kr)\text{Dirichlet}(\alpha_1, \ldots, \alpha_r) \to \text{Dirichlet}(\alpha_1 + k_1, \ldots, \alpha_r + k_r)

Key properties: Prior “pseudo-counts”: αi\alpha_i represents prior observations in category ii. Setting αi=1\alpha_i = 1 (flat prior) means no prior preference.

With more data, likelihood dominates; with little data, prior dominates.

Order of data does not matter; only sufficient statistic k=(k1,,kr)\mathbf{k} = (k_1, \ldots, k_r).

Conjugate family ensures posterior is also Dirichlet with simple parameter update.

Multivariate Normal

P(X=xμ,Σ)=1(2πΣ)k/2exp(12(xμ)TΣ1(xμ))P(\mathbf{X} = \mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi|\boldsymbol{\Sigma}|)^{k/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)
  • xRdx\in\mathbb R^d, xN(μ,Σ)x\sim\mathcal N(\mu,\Sigma) with mean μRd\mu \in \mathbb{R}^d and covariance matrix ΣRd×d\Sigma \in \mathbb{R}^{d\times d}

  • Closed under linear transformations, addition, marginalization, conditioning

  • Covariance controls scale and correlation; must be symmetric and positive definite

  • LKJ prior for correlation matrices: flexible prior on correlation structures; together with priors on standard deviations gives LKJCov prior for Σ\Sigma

Multinomial–Dirichlet

  • Likelihood: counts over KK categories, YMultinomial(n,π)Y\sim\text{Multinomial}(n,\boldsymbol\pi).

  • Prior: Dirichlet(α)(\boldsymbol\alpha), α=(α1,,αK)\boldsymbol\alpha=(\alpha_1,\dots,\alpha_K).

  • Posterior: Dirichlet(α+k)(\boldsymbol\alpha + \mathbf k), where kjk_j counts in category jj.

Used for compositional data (election results, topic proportions).

Markov Chain Monte Carlo and the Metropolis Algorithm

Goal: obtain samples from p(θd)p(\theta\mid d) when no closed form is available. MCMC constructs a Markov chain whose stationary distribution is the posterior.

Metropolis algorithm (random-walk variant)

Notebook Cell
from scipy import stats
import seaborn as sns
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

def propose_jump(pi, sigma):
    return stats.norm.rvs( loc=pi, scale=sigma )

def decide(pi, pi2, alpha, beta, n, k, debug=False):
    # evaluate prior and likelihood
    prior1 = stats.beta.pdf( pi, alpha, beta )
    prior2 = stats.beta.pdf( pi2, alpha, beta )
    likelihood1 = stats.binom.pmf( n=n, k=k, p=pi )
    likelihood2 = stats.binom.pmf( n=n, k=k, p=pi2 )

    # define alpha
    alpha = min(1, prior2/prior1 * likelihood2/likelihood1)
    if debug: print("pi = {:.5f}, pi2 = {:.5f}, alpha = {:.5f}".format(pi, pi2, alpha))

    # jump decision
    p = np.random.rand()
    if p<=alpha and 0 < pi2 < 1:
        return pi2
    else:
        return pi

def metropolis_chain(n_steps, alpha, beta, n, k, sigma):
    # initialize
    pi = np.random.rand()
    chain = []

    # create chain
    for i in tqdm( range( n_steps ) ):
        pi2 = propose_jump( pi, sigma )
        pi = decide( pi, pi2, alpha, beta, n, k )
        chain.append( pi )

    return chain

def metropolis( n_chains, n_steps, alpha, beta, n, k, sigma ):
    res = [metropolis_chain( n_steps, alpha, beta, n, k, sigma ) for i in range( n_chains)]
    return np.array(res).T # to have chains in columns

def trace_plots(sim):
    for i in range( sim.shape[1] ):
        plt.figure()
        plt.plot( sim[:,i] )
        plt.title("Chain {}".format(i+1))

def density_plots( sim ):
    plt.figure()
    for i in range( sim.shape[1] ):
        sns.kdeplot( sim[:,i], label="chain {}".format(i) )
    plt.legend()

def rhat( sim ):
    total_var = np.var( sim.flatten() )
    individual_var = np.var( sim, axis=0 )
    rhat = np.sqrt( total_var / np.mean(individual_var) )
    return rhat

sim = metropolis(n_chains=1, n_steps=1000, alpha=5, beta=2, n=10, k=8, sigma=0.2)

trace_plots(sim)
density_plots(sim)
print(f"Rhat: {rhat(sim)}")
Loading...
Rhat: 1.0
<Figure size 640x480 with 1 Axes>
<Figure size 640x480 with 1 Axes>

For target density π(θ)=p(θd)\pi(\theta) = p(\theta\mid d) and symmetric proposal distribution q(θθ)q(\theta' \mid \theta), the Metropolis steps are:

  1. Initialize θ(0)\theta^{(0)} (e.g. some reasonable value).

  2. For iterations t=0,1,t = 0,1,\dots:

  3. Propose new state θq(θθ(t))\theta' \sim q(\theta' \mid \theta^{(t)}). E.g. random walk: q=N(θ(t),σprop2)q = \mathcal N(\theta^{(t)}, \sigma_{\text{prop}}^2).

  4. Compute acceptance probability α=min{1,  π(θ)π(θ(t))}=min{1,  p(dθ)p(θ)p(dθ(t))p(θ(t))}\alpha = \min\left\{1,\;\frac{\pi(\theta')}{\pi(\theta^{(t)})}\right\} = \min\left\{1,\;\frac{p(d\mid \theta')p(\theta')}{p(d\mid \theta^{(t)})p(\theta^{(t)})}\right\}

  5. Accept or reject: Draw uUniform(0,1)u \sim \text{Uniform}(0,1), If u<αu < \alpha, set θ(t+1)=θ\theta^{(t+1)}=\theta' (move), Else set θ(t+1)=θ(t)\theta^{(t+1)}=\theta^{(t)} (stay).

  6. After burn‑in and thinning, use remaining samples as approximate draws from the posterior.

Proposal distribution: the distribution from which proposed new parameters are drawn (q(θθ)q(\theta' \mid \theta)). Choice of its scale strongly affects mixing and acceptance rate.

MCMC with PyMC

import pymc as pm
# PyMC for Beta-Binominal Conjugate Family
with pm.Model() as beta_binom_model:
    pi = pm.Beta("pi", alpha=5, beta=2)
    y = pm.Binomial("y", p=pi, n=10, observed=8)
    trace = pm.sample(1000)
Output
Loading...
Loading...
Notebook Cell
import arviz as az
import pandas as pd

def uncertainty_binominal(trace_posterior, var_name: str, n: int, **kwargs):
    assert var_name != None, "var_name is required for pi"
    assert n != None and n > 0, "n must be > 0"

    values = trace_posterior[var_name].values
    aleatoric_var = (n * values * (1.0 - values)).mean()
    epistemic_var = (values * n).var(ddof=1)

    return aleatoric_var, epistemic_var

def uncertainty_normal(trace_posterior, mu: str = "mu", sigma: str = "sigma", **kwargs):
    aleatoric_var = (trace_posterior[sigma].values**2).mean()
    epistemic_var = trace_posterior[mu].var(ddof=1).values

    return aleatoric_var, epistemic_var

def uncertainty_poisson(trace_posterior, var_name: str, n: int, **kwargs):
    assert var_name != None, "var_name is required for lambda"

    values = trace_posterior[var_name].values
    aleatoric_var = values.mean()
    epistemic_var = values.var(ddof=1)

    return aleatoric_var, epistemic_var

def uncertainty_summary(trace_posterior, family=None, **kwargs):
    """
    Computes the aleatoric (data), epistemic (model uncertainty) and predictive (overall uncertainty) variance of a pymc trace.posterior

    Parameters
    ----------
    trace_posterior : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    family : str
        Either one of 'binominal', 'normal' or 'poisson'.

    For family='binominal':
    -----------------------
    var_name : str
        Name of the Pi variable in the trace_posterior
    n : int
        Number of trials

    For family='normal':
    -----------------------
    mu : str
        Name of the parameter that represents the mean of the normal distribution.

    sigma : str
        Name of the parameter that represents the standard deviation of the normal distribution.

    For family='poisson':
    -----------------------
    var_name : str
        Name of the lambda variable in the trace_posterior
    """
    assert trace_posterior != None, "trace_posterior must be provided"
    assert family != None, "family must be provided"

    uncertainty_stats = pd.DataFrame()
    match family:
        case 'binominal':
            aleatoric_var, epistemic_var = uncertainty_binominal(trace_posterior, **kwargs)
        case 'normal':
            aleatoric_var, epistemic_var = uncertainty_normal(trace_posterior, **kwargs)
        case 'poisson':
            aleatoric_var, epistemic_var = uncertainty_poisson(trace_posterior, **kwargs)
        case _:
            raise ValueError(f"Unknown family: {family}")

    predictive_var = aleatoric_var + epistemic_var

    uncertainty_stats[family] = {
        'var_aleatoric': aleatoric_var,
        'var_epistemic': epistemic_var,
        'var_predictive': predictive_var,
        'var_aleatoric_percent': aleatoric_var / predictive_var,
        'var_epistemic_percent': epistemic_var / predictive_var,
    }

    # switch rows and columns:
    uncertainty_stats = uncertainty_stats.T

    return uncertainty_stats

def trace_summary(trace, hdi_prob=0.8, var_names=None):
    # Use `var_names=['pi', ...]` to limit to vars of interest
    stats = az.summary(trace, hdi_prob=hdi_prob, var_names=var_names)

    # Variance = standard deviation ** 2:
    stats.insert(stats.columns.get_loc('sd') + 1, 'var', stats.sd ** 2)
    stats.insert(stats.columns.get_loc('mcse_sd') + 1, 'mcse_var', stats.mcse_sd ** 2)

    hdi_alpha = 1.0 - hdi_prob

    hdi_low = f'hdi_{100 * hdi_alpha / 2:g}%'
    hdi_high = f'hdi_{100 * (1 - hdi_alpha / 2):g}%'

    stats.insert(stats.columns.get_loc(hdi_high) + 1, 'hdi_low', stats[hdi_low])
    stats.insert(stats.columns.get_loc(hdi_high) + 2, 'hdi_high', stats[hdi_high])

    return stats
Notebook Cell
import pymc as pm
import arviz as az

# Data: k successes out of n trials
n = 40
k = 26

with pm.Model() as model:
    # Prior on probability pi
    pi = pm.Beta("pi", alpha=1.0, beta=1.0)  # flat prior

    # Likelihood
    y = pm.Binomial("y", n=n, p=pi, observed=k)

    # Sample with NUTS (HMC variant; more efficient than plain Metropolis)
    trace = pm.sample(
        draws=2000,
        tune=1000,
        chains=4,
        target_accept=0.9,
        random_seed=42
    )
Loading...
Loading...

Diagnosing convergence and mixing

Key visual and numeric diagnostics:

Trace plots: Plot each chain’s sampled values over iterations

Trace Plot Diagnostic

For Metropolis specifically, problematic trace plots often indicate: Proposal variance too small (tiny steps, high acceptance, poor exploration). Proposal variance too large (proposals rejected, chain rarely moves). Tunable by adjusting σprop\sigma_{\text{prop}}.

Notebook Cell
import arviz as az
az.plot_trace(trace, kind='trace', figsize=(12, 3))
plt.show()
<Figure size 1200x300 with 2 Axes>

Chains explore the same region (“fuzzy caterpillars”)

No systematic upward or downward drift.

No chains stuck in a narrow band.

Different chains overlap well.

Run multiple chains: Posterior histograms or kernel density estimates for each chain separately

Run Multiple Chains Diagnostic Plot

Distributions from all chains almost perfectly overlap.

No unique behavior in one chain.

Autocorrelation plots: Autocorrelation function (ACF) of each chain as a function of lag

Source
import arviz as az
import matplotlib.pyplot as plt
az.plot_autocorr(trace, max_lag=30, figsize=(12, 3))
plt.show()
<Figure size 1200x300 with 4 Axes>

Autocorrelation decays quickly with lag.

Very slow decay or persistent high correlation means poor mixing and low effective sample size.

Rank plots / rank bars: For each parameter, pool all samples from all chains, compute their ranks, then see how ranks are distributed per chain (as in ArviZ’s rank plots)

Source
import arviz as az
import matplotlib.pyplot as plt
az.plot_trace(trace, kind='rank_bars', figsize=(12, 3))
plt.show()
<Figure size 1200x300 with 2 Axes>

Ranks within each chain are roughly uniformly distributed.

All chains show similar, flat-ish rank patterns.

Strong deviations (U‑shape, spikes, or clearly different patterns per chain) indicate non‑convergence or poor mixing.

R^\hat R (R‑hat, Gelman–Rubin statistic): Compares total chain data variance (concatenated) vs. the mean variance within individual chains. Summarizes the overlaid density plots of multiple chains and provides a numerical measure for their diagreement.

R^=VartotalVarwithin=Var({ci})1nci=1ncVar(ci)\hat{R} = \sqrt{\frac{\mathrm{Var}_{\text{total}}}{\mathrm{Var}_{\text{within}}}} = \sqrt{\frac{\mathrm{Var}(\{c_i\})}{\frac{1}{n_c}\sum_{i=1}^{n_c}\mathrm{Var}(c_i)}}
Notebook Cell
trace_summary(trace)
Loading...

R^1.00\hat R \approx 1.00 (e.g. < 1.01–1.05) means chains have likely converged to same target distribution.

R^>1.05\hat R > 1.05 means potential non‑convergence; increase iterations, needs better initialization, or reparameterization.

ESS (effective sample size): Estimates how many independent draws your autocorrelated chain is equivalent to

Notebook Cell
trace_summary(trace)
Loading...

ESS large relative to the number of draws, giving stable posterior summaries. Rule of thumb: ESS / N > 10%

Very small ESS (especially compared to total draws) means strong autocorrelation, poor mixing, unreliable summaries.

ess_bulk (Bulk ESS): Use this to assess reliability of posterior means, medians, and central credible intervals. Estimates the effective sample size for the central part of the distribution (around the median/mean). Interpretation: Higher is better. If ess_bulk is very small relative to the total number of draws, the chain is autocorrelated or has poor mixing in the center of the distribution.

ess_tail (Tail ESS): Use this to assess reliability of HDI/credible interval bounds and tail probabilities. Estimates the effective sample size specifically for the tails of the distribution (extreme quantiles, typically 5th and 95th percentiles). Interpretation: Higher is better. Can be substantially lower than ess_bulk if the tails are poorly explored.

Posterior Summaries, Prediction, and Uncertainty

Notebook Cell
import pymc as pm
import arviz as az

# Data: k successes out of n trials
n = 40
k = 26

with pm.Model() as model:
    pi = pm.Beta("pi", alpha=1.0, beta=1.0)  # Prior on probability pi
    y = pm.Binomial("y", n=n, p=pi, observed=k) # Likelihood
    trace = pm.sample( # Sample posterior
        draws=2000,
        tune=1000,
        chains=4,
        target_accept=0.9,
        random_seed=42
    )

az.plot_posterior(trace, hdi_prob=0.80) # Posterior plot: shows density + HDI interval + mean
trace_summary(trace, hdi_prob=0.80)
Loading...
Loading...
Loading...
<Figure size 640x480 with 1 Axes>

Point Summaries: Mean, Variance, Maximum A Posteriori (MAP / Posterior Mode)

Notebook Cell
stats = trace_summary(trace)
stats # Use `stats.loc['pi']['mean']` to access specific values
Loading...

Given posterior samples θ1,,θN\theta_1,\dots,\theta_N:

Posterior mean: θ^mean=1Ni=1Nθi\hat\theta_{\text{mean}} = \frac{1}{N}\sum_{i=1}^N \theta_i

Posterior variance: Var^(θ)=1N1i=1N(θiθ^mean)2\widehat{\text{Var}}(\theta) = \frac{1}{N-1}\sum_{i=1}^N(\theta_i - \hat\theta_{\text{mean}})^2

Maximum A Posteriori (MAP) or Posterior Mode: θ^MAP=argmaxθP(θD)=argmaxθP(Dθ)P(θ)=logP(Dθ)+logP(θ)\hat{\theta}_{\text{MAP}}=\arg\max_{\theta} P(\theta \mid D) = \arg\max_{\theta} P(D\mid\theta) P(\theta) = \log P(D\mid\theta) + \log P(\theta); To compute: Use closed form Posterior Mode if available; or Differentiate with respect to θ\theta, set the derivative to zero, and solve for θ\theta; or use numerical optimization when there is no closed form.

Credible Intervals vs. Confidence Intervals

Source
import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
from scipy.stats import beta
from statsmodels.stats.proportion import proportion_confint

pi_x = np.linspace(0, 1, 500)

# Data: k successes out of n trials
n = 40
k = 26
p_hat = k / n

# Prior parameters: Beta(1,1) -> flat
alpha_prior = 1.0
beta_prior = 1.0

# Posterior parameters
alpha_posterior = alpha_prior + k
beta_posterior  = beta_prior + n - k
pi_y = beta.pdf(pi_x, a=alpha_posterior, b=beta_posterior)

with pm.Model() as model_binom:
    pi = pm.Beta("pi", alpha=alpha_prior, beta=beta_prior) # Prior on probability pi
    y = pm.Binomial("y", n=n, p=pi, observed=k) # Likelihood
    trace = pm.sample( # Sample posterior
        draws=2000,
        tune=1000,
        chains=4,
        target_accept=0.9,
        random_seed=42
    )

stats = trace_summary(trace, hdi_prob=0.95, var_names=['pi'])
ci_low, ci_high = proportion_confint(k, n, alpha=0.05)

# print(f"Posterior mean (pi)                  = {stats.loc['pi']['mean']:.4f}")
# print(f"Bayesian credible interval (HDI)     = {[stats.loc['pi']['hdi_low'], stats.loc['pi']['hdi_high']]}")
# print(f"Frequentist confidence interval (CI) = {[ci_low, ci_high]}")

plt.figure(figsize=(7, 4))

# Posterior density
def plot_density(x, y, label, color='darkgreen'):
    plt.plot(x, y, label=label, lw=2, color=color)
    plt.fill_between(x, y, alpha=0.2, color=color)
plot_density(pi_x, pi_y, label="Posterior density")

# Shade Bayesian credible interval (HDI)
plt.axvspan(stats.loc['pi']['hdi_low'], stats.loc['pi']['hdi_high'], color="green", alpha=0.2, label="Bayesian credible interval (HDI)")

# Mark frequentist CI as a horizontal bar at y=0
plt.hlines(y=0.0,xmin=ci_low,xmax=ci_high,colors="red",linewidth=4,label="Frequentist confidence interval (CI)")

# Mark posterior mean and p_hat
plt.axvline(stats.loc['pi']['mean'], color="k", linestyle="--", label="Posterior mean")
plt.axvline(p_hat, color="blue", linestyle="--", label="Sample proportion $\hat p$")

plt.xlim(0, 1)
plt.xlabel(r"$\pi$")
plt.ylabel("density")
plt.title("Posterior density with Bayesian credible interval vs frequentist CI")
plt.legend(loc="upper left", fontsize=9)
plt.tight_layout()
plt.show()
Loading...
Loading...
<Figure size 700x400 with 1 Axes>

Credible interval (Bayesian): For posterior p(θd)p(\theta\mid d), an 80% credible interval [a,b][a,b] satisfies: P(aθbd)=0.8P(a \le \theta \le b \mid d) = 0.8. It’s a direct statement about the parameter: “Given data and model, we believe with 80% plausibility that θ\theta lies in [a,b][a,b].”

HDI (Highest Density Interval): Among all 80% intervals, choose the one where posterior density is everywhere at least as high as outside. It minimizes width for a given mass; It contains the most probable values by construction.

Confidence interval (frequentist): 80% CI: in 80% of repeated experiments, the interval constructed by a given procedure will contain the fixed true θ\theta. It does not allow the statement “θ\theta is in this specific interval with probability 0.8”.

Aleatoric vs. Epistemic Uncertainty

Law of total variance:

Var(Ynewd)=Eθd[Var(Ynewθ)]aleatoric+Varθd(E[Ynewθ])epistemic\text{Var}(Y_{\text{new}}\mid d) = \underbrace{\mathbb E_{\theta\mid d}[\text{Var}(Y_{\text{new}}\mid \theta)]}_{\text{aleatoric}} + \underbrace{\text{Var}_{\theta\mid d}\big(\mathbb E[Y_{\text{new}}\mid \theta]\big)}_{\text{epistemic}}

Aleatoric Uncertainty: irreducible data noise (e.g. binomial or Gaussian observation noise).

Epistemic Uncertainty: uncertainty about model parameters; reducible with more/better data.

If Epistemic is bigger than Aleatoric, it means collecting more data can meaningfully reduce predictive uncertainty.

If Aleatoric is bigger than Epistemic, it means that even infinite data cannot substantially reduce predictive spread (inherent randomness).

Notebook Cell
uncertainty_summary(trace.posterior, family='binominal', n=n, var_name='pi')
# uncertainty_summary(trace.posterior, family='normal', mu='mu', sigma='sigma')
# uncertainty_summary(trace.posterior, family='poisson', var_name='lambda')
Loading...
Notebook Cell
# Double check with posterior predictive:

var_name = 'y'
with model:
    predictions = pm.sample_posterior_predictive(trace, var_names=[var_name], random_seed=42)
predictive_var = predictions.posterior_predictive[var_name].var(ddof=1).values

print(f"Predictive Variance = {predictive_var} (for {predictions.posterior_predictive.y.values.size} samples)")
trace_summary(predictions.posterior_predictive, var_names=[var_name])
Loading...
Loading...
Predictive Variance = 17.244173631078883 (for 8000 samples)
Loading...

Hypothesis Testing

Many branches of science depend on hypothesis testing to build knowledge. Hypothesis tests are often used for feature selection (e.g. t-test p-value for features in frequentist linear regression). In Bayesian statistics, one method for model selection is based on hypothesis testing (a model is a hypothesis).

Frequentist vs. Bayesian Hypothesis Tests

Frequentist hypothesis testing can only compute the probability of data under a given hypothesis. Set up a null hypothesis and discard it if the probability P(dH0)P(d \mid H_0) is too low. This approach is dangerous when we measure unlikely data or are looking at unlikely hypotheses.

Bayesian hypothesis testing can compute P(Hd)P(H \mid d), i.e. our posterior belief in HH given data dd! Does not need a null hypothesis - can compare any two or more competing hypotheses! Bayesian and frequentist hypothesis tests return different results whenever there is heavy prior belief in H0H_0 and we have measured unlikely data that are in favor of H1H_1: The Bayesian approach is more robust in this case because it supports the philosophy of “one random accident should not overthrow an established theory”.

One-Sided Hypothesis Tests

Frequentist Hypothesis Test: Null Hypothesis Significance Testing (NHST)

  1. Define H0H_0 and H1H_1

  2. Define an estimator π^\hat\pi and significance level α\alpha (false positive rate, usually α=0.05\alpha = 0.05)

  3. Define a test statistic (i.e. z-score in case of binominal distribution, type of test statistic depends on estimator!): z=π^π0σ=π^π0π^0(1π^0)nz = \frac{\hat\pi - \pi_0}{\sigma} = \frac{\hat\pi - \pi_0}{\sqrt{\frac{\hat\pi_0(1-\hat\pi_0)}{n}}}

  4. Determine p-value: p=P(Z>z)=1Φ(z)p = P(Z > z) = 1 - \Phi(z)

  5. If p<αp < \alpha, reject H0H_0$

Bayesian Hypothesis Test: Compare posterior probability of hypotheses

Bayesian Hypothesis Testing
  1. Define H0H_0 and H1H_1:

    • H0:H_0: 80% of people or less in the local population are able to roll their tongue

    • H1:H_1: More than 80% of people in the local population are able to roll their tongue

  2. Define a prior distribution for π\pi (e.g. Beta(1,1) for flat prior)

  3. Define a likelihood function (e.g. Binomial likelihood)

  4. Sample posterior πi\pi_i

  5. Approximate posterior distribution of π\pi using H0H_0 and H1H_1

pH1d = np.mean(trace.posterior.pi.values > 0.8)
pH1d, 1.0 - pH1d
Notebook Cell
# With posterior samples pi_i, approximate:
import numpy as np
import pymc as pm

# Data: k successes out of n trials
n = 40
k = 26

with pm.Model() as one_sided_model:
    # Prior on pi: flat Beta(1,1)
    pi = pm.Beta("pi", alpha=1.0, beta=1.0)

    # Likelihood
    y = pm.Binomial("y", n=n, p=pi, observed=k)

    # Sample posterior
    trace = pm.sample(
        draws=3000,
        tune=1000,
        chains=4,
        target_accept=0.9,
        random_seed=42
    )

# Extract posterior samplesd
pi_samples = trace.posterior["pi"].values.flatten()

# Posterior probability of H1: pi > 0.8
P_H1_post = np.mean(pi_samples > 0.8)
P_H0_post = 1.0 - P_H1_post

print(f"P(H1: pi > 0.8 | data)      = {P_H1_post:.3f}")
print(f"P(H0: pi ≤ 0.8 | data)      = {P_H0_post:.3f}")
print(f"Posterior odds H1:H0        = {P_H1_post / P_H0_post:.3f}")
Loading...
Loading...
P(H1: pi > 0.8 | data)      = 0.010
P(H0: pi ≤ 0.8 | data)      = 0.990
Posterior odds H1:H0        = 0.010

Bayes Factor and Odds

Notebook Cell
# Prior probabilities under Beta(1,1)
P_H0_prior = 0.8
P_H1_prior = 1.0 - P_H0_prior  # P(pi > 0.8) for uniform prior = 0.2

prior_odds = P_H1_prior / P_H0_prior
post_odds  = P_H1_post / P_H0_post

BF_10 = post_odds / prior_odds

print(f"Prior Odds H1:H0            = {prior_odds:.3f}")
print(f"Posterior Odds H1:H0        = {post_odds:.3f}")
print(f"Bayes Factor BF_10          = {BF_10:.3f}")
Prior Odds H1:H0            = 0.250
Posterior Odds H1:H0        = 0.010
Bayes Factor BF_10          = 0.040

Prior odds for H1H_1 vs H0H_0 (how much more likely is H1H_1 than H0H_0): P(H1)P(H0)\frac{P(H_1)}{P(H_0)}, flip the ratio if you are interested in how much more likely is H0H_0 than H1H_1

Posterior odds after data: P(H1d)P(H0d)\frac{P(H_1\mid d)}{P(H_0\mid d)}

Bayes factor BF=P(dH1)P(dH0)=posterior oddsprior oddsBF = \frac{P(d\mid H_1)}{P(d\mid H_0)} = \frac{\text{posterior odds}}{\text{prior odds}}, describes how much my believe changed or how much I learned from the data. Rules of Thumb: 1–3: barely worth mentioning, 3–10: substantial, 10–100: strong, bigger than 100: decisive evidence in favour of H1H_1.

Two-Sided Hypothesis Tests

Region of Practical Equivalence (ROPE)

In Bayesian inference, testing whether a parameter exactly equals a specific value is statistically problematic because the probability of a continuous parameter equaling any single point is essentially zero. Instead, ROPE defines a range of values considered “practically equivalent” to the null hypothesis. For testing whether a parameter is practically equal to some value π0\pi_0, define a ROPE, e.g. [π0δ,π0+δ][\pi_0-\delta, \pi_0+\delta].

Two Sided Bayesian Hypothesis Test
  • H0H_0: Exactly 93% of people in the local population are able to roll their tongue     π0[π0δ,π0+δ]\implies \pi_0 \in [\pi_0-\delta,\pi_0+\delta] for π0=0.93\pi_0 = 0.93 and some δ>0\delta > 0.

  • H1H_1: More or less than 93% of people in the local population are able to roll their tongue     π0\implies \pi_0 outside that interval.

pi0 = 0.93; delta = 0.03
pi_values = trace.posterior.pi.values
pH0d = np.mean((pi_values > pi0 - delta) &
    (pi_values < pi0 + delta))
pH1d = 1 - pH0d
pH0d, pH1d, (pH0d / pH1d)

If most posterior mass lies inside the ROPE, you accept practical equivalence; otherwise, you conclude a meaningful difference. If the posterior distribution is fairly diffuse, with substantial mass both inside and outside this range a narrow ROPE {small δ\delta) can create an inconclusive result. A larger ROPE (bigger δ\delta) reflects what differences you actually care about practically: If you only need to know whether the true proportion is “around 93%” rather than “exactly 93%”, a wider ROPE makes sense.

Posterior Predictive Checks

  1. Fit model and obtain posterior samples.

  2. For each posterior draw, simulate an entire replicated dataset from the model (posterior predictive).

  3. Compare distribution of replicated data to observed data (histograms, kernel density plots, summary statistics).

If simulated data systematically differ (e.g. lower variance, wrong tail behaviour), the model is likely misspecified (e.g. wrong likelihood family).

Source

import bambi as bmb
import pandas as pd
import numpy as np
import arviz as az
import matplotlib.pyplot as plt

data = pd.read_csv("bayesian-machine-learning/advertising.csv")
model = bmb.Model( "sales ~ TV + radio + TV:radio", data=data, family="gaussian" )
trace = model.fit(draws=2000, tune=2000, chains=4, target_accept=0.9, random_seed=42)
model.predict(trace, kind="pps")
az.plot_ppc(trace, num_pp_samples=500, figsize=(12, 3))
plt.show()
Loading...
Loading...
<Figure size 1200x300 with 1 Axes>
Notebook Cell
import numpy as np
import pymc as pm

# Data: k successes out of n trials
n = 40
k = 26

alpha_prior, beta_prior = 1.0, 1.0

with pm.Model() as one_sided_model:
    # Prior on pi: flat Beta(1,1)
    pi = pm.Beta("pi", alpha=alpha_prior, beta=beta_prior)

    # Likelihood
    y = pm.Binomial("y", n=n, p=pi, observed=k)

    # Sample posterior
    trace = pm.sample(
        draws=3000,
        tune=1000,
        chains=4,
        target_accept=0.9,
        random_seed=42
    )

# Extract posterior samplesd
pi_samples = trace.posterior["pi"].values.flatten()

###################################################
# Compute posterior mass in ROPE and posterior odds
###################################################

# Define ROPE
pi0 = 0.67
delta = 0.025
rope_low  = pi0 - delta
rope_high = pi0 + delta

# Posterior probabilities
P_H0_rope = np.mean((pi_samples >= rope_low) & (pi_samples <= rope_high))
P_H1_rope = 1.0 - P_H0_rope

print(f"ROPE = [{rope_low:.2f}, {rope_high:.2f}]")
print(f"P(H0: pi in ROPE | data)    = {P_H0_rope:.3f}")
print(f"P(H1: pi outside ROPE | data) = {P_H1_rope:.3f}")
print(f"Posterior odds H0:H1        = {P_H0_rope / P_H1_rope:.3f}")

#######################################
# Visualize posterior with ROPE shaded
#######################################
from scipy.stats import beta

alpha_posterior, beta_posterior = alpha_prior + k, beta_prior + n - k
pi_grid = np.linspace(0, 1, 500)
post_pdf = beta.pdf(pi_grid, a=alpha_posterior, b=beta_posterior)

plt.figure(figsize=(7, 4))

# Posterior density
plt.plot(pi_grid, post_pdf, color="darkgreen", lw=2, label="Posterior density")

# Shade ROPE region
mask_rope = (pi_grid >= rope_low) & (pi_grid <= rope_high)
plt.fill_between(pi_grid[mask_rope], post_pdf[mask_rope], color="purple", alpha=0.3, label="ROPE")

# Vertical line at pi0
plt.axvline(pi0, color="k", linestyle="--", label=r"$\pi_0$ (target value)")

plt.xlabel(r"$\pi$")
plt.ylabel("density")
plt.xlim(0, 1)
plt.title(r"Posterior with ROPE: $H_0: \pi \in [{:.2f}, {:.2f}]$".format(rope_low, rope_high))
plt.legend()
plt.tight_layout()
plt.show()
Loading...
Loading...
ROPE = [0.65, 0.70]
P(H0: pi in ROPE | data)    = 0.245
P(H1: pi outside ROPE | data) = 0.755
Posterior odds H0:H1        = 0.324
<Figure size 700x400 with 1 Axes>

Model Checking, Selection, and Multivariate Distributions

Model selection criteria

The Fully Bayesian way: Use Bayes Factor (based on marginal likelihood). This is conceptually clean, but marginal likelihood can be hard to compute.

Use Predictive Criteria instead: These criteria trade off fit vs complexity; Bayesian evidence automatically penalizes overly complex models (“built-in Occam’s razor”).

  • RMSE, MAE (point prediction error using posterior predictive means)

  • ELPD / LOO (Expected Log Predictive Density with leave-one-out cross-validation, Computed with PSIS-LOO, higher is better)

  • Bayesian R2R^2 for regression

Bayesian Linear Regression

What is the posterior distribution of the linear regression parameters β0\beta_0, β1\beta_1, and σ\sigma? Use Bayes’ theorem (and condition on x\mathbf{x}):

p(β0,β1,σy,x)=p(yβ0,β1,σ,x)p(β0,β1,σx)p(yx)p(\beta_0, \beta_1, \sigma \mid \mathbf{y}, \mathbf{x}) = \frac{p(\mathbf{y} \mid \beta_0, \beta_1, \sigma, \mathbf{x}) \, p(\beta_0, \beta_1, \sigma \mid \mathbf{x})}{p(\mathbf{y} \mid \mathbf{x})}
  • Likelihood: p(yβ0,β1,σ,x)p(\mathbf{y} \mid \beta_0, \beta_1, \sigma, \mathbf{x}) — probability of observing the data given parameters

  • Prior: p(β0)p(β1)p(σ)p(\beta_0) p(\beta_1) p(\sigma) — treat parameters as independent, no direct dependence on x\mathbf{x}

  • Evidence: p(yx)p(\mathbf{y} \mid \mathbf{x}) — marginal likelihood computed via MCMC

Then define priors and likelihood distributions:

β0N(θ0,τ02)β1N(θ1,τ12)σExp(1/l)yβ0,β1,σ,xN(β0+β1x,σ2)\begin{align} \beta_0 &\sim N(\theta_0, \tau_0^2) \\ \beta_1 &\sim N(\theta_1, \tau_1^2) \\ \sigma &\sim \text{Exp}(1/l) \\ y \mid \beta_0, \beta_1, \sigma, x &\sim N(\beta_0 + \beta_1 x, \sigma^2) \end{align}

Probabilistic Graphical Model Notation

Compact graphical way to describe a model without the distributions of the individual variables. Every unobserved variable that has no arrows going into it does need a prior. Bayesian inference will get us probability distributions for the unobserved variables.

Probabilistic Graphical Model Notation

Implementation and interpretation (PyMC/Bambi)

Built-in Bambi Operators

Built-in Bambi Operators

Notebook Cell
import bambi as bmb
import pandas as pd
import arviz as az
Notebook Cell
data = pd.read_csv("bayesian-machine-learning/bodyfat.csv")
model = bmb.Model("BodyFat ~ 1 + BMI",
          data=data, family="gaussian")
trace = model.fit(draws=1000, tune=1000)
trace_summary(trace, hdi_prob=0.95)
Loading...
Loading...
Loading...
Notebook Cell
az.plot_posterior(trace, hdi_prob=0.95)
plt.show()
<Figure size 2208x552 with 3 Axes>
Notebook Cell
data.plot.scatter(x='BMI', y='BodyFat')
bmb.interpret.plot_predictions(
    model, trace, 'BMI', ax=plt.gca())
plt.show()
<Figure size 640x480 with 1 Axes>
Notebook Cell
# predict out-of-sample data. for in-sample remove `data=...`
predictions = model.predict(trace, data=pd.DataFrame({ 'BMI': [20, 30, 50] }), kind="pps", inplace=False)
uncertainty_summary(predictions.posterior, family='normal', mu='BodyFat_mean', sigma='BodyFat_sigma')
Loading...
Notebook Cell
predictions.posterior_predictive.BodyFat.var().values # double check posterior predictive variance
array(463.07425281)

Robust Linear Regression

Motivation: Outliers can have a big impact on the resulting regression line.

Reason: The Gaussian likelihood does not like points far away from its center.

Solution: Can be mitigated using a Student’s t-distribution with a heavier tail. Bayesians often prefer to change the likelihood instead of the data because its more reproducible.

model_robust = bmb.Model("BodyFat ~ BMI",
                    data=data, family="t")
trace_robust = model.fit(draws=1000, tune=1000)
trace_summary(trace_robust, hdi_prob=0.95)
Output
Loading...
Loading...
Loading...

Gaussian Processes (GPs)

Intuition and definition

Gaussian Processes are prior distributions over functions: f(x)GP(m(x),k(x,x))f(x) \sim \mathcal{GP}(m(x), k(x,x')). A Gaussian Process is fully characterized by: Mean m(x)=E[f(x)]m(x) = E[f(x)] and a Covariance function k(x,x)=cov[f(x),f(x)]k(x,x') = cov[f(x), f(x')]

Covariance function k(x,x)k(x,x') must be: Symmetric and Positive semi-definite: for any finite set of inputs, the matrix KK must produce non-negative quadratic forms zKz0z^\top K z \ge 0. If using a “kernel” that is not positive semi-definite (e.g. K(x,x)=2+xxK(x,x') = 2 + |x-x'|), you can obtain negative predictive variances in the formula vkC1kv - k^\top C^{-1}k. Such a function is not a valid covariance function, because a covariance matrix must be positive semi-definite, implying predictive variance must be 0\ge 0.

Key Intuition about Gaussian Processes: If things are close to each other, they should behave similar (i.e. be stronger correlated). That implies that the covariance between outputs is given in terms of the inputs.

Conditioning Rule: (“whats the prediction given the training data”) p(ff)=N(fKffKff1f,KffKffKff1KffT)p(f_* \mid f) = \mathcal{N}\left(f_* \mid K_{f_* f} K_{ff}^{-1} f, K_{f_* f_*} - K_{f_* f} K_{ff}^{-1} K_{f_* f}^T\right)

Posterior Distribution for ff_* given yy, including observation noise:

p(fy)=N(fμ,σ2)μ=Kff(Kff+σobs2I)1yσ2=KffKff(Kff+σobs2I)1KffT\begin{align} p(f_* \mid y) &= \mathcal{N}(f_* \mid \mu_*, \sigma_*^2) \\ \mu_* &= K_{f_* f}(K_{ff} + \sigma_{\text{obs}}^2 I)^{-1} y \\ \sigma_*^2 &= K_{f_* f_*} - K_{f_* f}(K_{ff} + \sigma_{\text{obs}}^2 I)^{-1} K_{f_* f}^T \end{align}

Inverting the covariance matrix of the noisy observations C=(Kff+σobs2I)Rn×nC = (K_{ff} + \sigma^2_{obs}I) \in \mathbb{R}^{n \times n} is the most computationally expensive step, resulting in a complexity of O(N3)O(N^3). Therefore, large datasets are challenging for standard GPs.

GP regression: predictive mean and variance

Notebook Cell
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import train_test_split

length_scale = 300.0
variance = 100.0
sigma_obs = 80  # observation noise std

df = pd.read_csv("bayesian-machine-learning/credit_data.csv")
X_train, X_test, y_train, y_test = train_test_split(
    df[['Balance']],
    df[['Limit']],
    test_size=0.2,      # 20% test, 80% train
    random_state=42,    # for reproducibility
    shuffle=True        # default; set False if time series
)

kernel = variance * RBF(length_scale=length_scale) + WhiteKernel(noise_level=sigma_obs**2)
gpr = GaussianProcessRegressor(kernel=kernel, alpha=0.0, optimizer=None, normalize_y=False)
gpr.fit(X_train, y_train)

X_plot = np.linspace(X_test["Balance"].min(), X_test["Balance"].max(), 400).reshape(-1, 1)
y_mean, y_std = gpr.predict(X_plot, return_std=True)

# ---- Plot ----
plt.figure(figsize=(8, 5))

# Training and test points
plt.scatter(X_train["Balance"], y_train, color="k", alpha=0.6, label="train")
plt.scatter(X_test["Balance"],  y_test,  color="gray", alpha=0.6, label="test")

# GP mean and 95% interval
plt.plot(X_plot[:, 0], y_mean, color="C0", lw=2, label="GP mean")
plt.fill_between(
    X_plot[:, 0],
    y_mean - 1.96 * y_std,
    y_mean + 1.96 * y_std,
    color="C0",
    alpha=0.2,
    label="95% CI"
)

plt.xlabel("Balance")
plt.ylabel("Limit")
plt.title("Gaussian Process Regression")
plt.legend()
<Figure size 800x500 with 1 Axes>

Predictive distribution for response yy_*:

E[yx,data]=kC1yVar(yx,data)=vkC1k\mathbb E[y_* \mid x_*,\text{data}] = k^\top C^{-1} y \quad \text{Var}(y_* \mid x_*,\text{data}) = v - k^\top C^{-1}k
  • Training outputs yRNy \in \mathbb R^N.

  • Covariance matrix for training outputs (including noise): C=Kff+σobs2I=k(X,X)+σobs2IC = K_{ff} + \sigma_{\text{obs}}^2 I = k(X, X) + \sigma_{\text{obs}}^2 I

  • Covariances between test point and training points: k=k(x,X)RNk = k(x_*, X) \in \mathbb R^N

  • Prior variance at test point (for yy_*): v=k(x,x)+σobs2v = k(x_*,x_*) + \sigma_{\text{obs}}^2

Connection to kernels and PSD requirement

Common kernels and hyperparameters:

  • Squared Exponential / RBF: k(x,x)=αexp(xx222)k(x,x') = \alpha \exp\left(-\frac{\|x-x'\|^2}{2\ell^2}\right), (α\alpha: output variance, \ell: length scale (smoothness))

  • Matérn (controls differentiability via ν\nu):

  • Rational Quadratic (mixture over length scales).

Bayesian Networks and Causal Inference

Structural Causal Models (SCMs) and Bayesian Networks

SCM formalism (from BN lecture notes):

  • Endogenous (ovservered) variables VV: determined by structural equations.

  • Exogenous (unobserved) variables UU: external noise.

  • Structural equations: each variable ViV_i is a function of its parents (endogenous causes) and noise (exogenous causes):

Vi=fi(Pai,Ui)V_i = f_i(\text{Pa}_i, U_i)
  • Joint distribution defined via P(U)P(U) and structural equations.

A Bayesian Network (BN) is a DAG over variables VV such that joint distribution factorizes as:

p(v1,,vn)=i=1np(viPai).p(v_1,\dots,v_n) = \prod_{i=1}^n p(v_i \mid \text{Pa}_i).
  • Graph encodes conditional independencies.

  • Each node has a Conditional Probability Table (CPT) or parameterized distribution.

Basic graph motifs and conditional independence

Three canonical patterns:

  1. Chain (serial): XZYX \to Z \to Y

    • XX and YY correlated marginally.

    • Conditional independence given the middle node: XYZX \perp Y \mid Z.

  2. Fork (common cause): XZYX \leftarrow Z \to Y

    • XX and YY correlated marginally (through common cause ZZ).

    • Conditional independence given cause: XYZX \perp Y \mid Z.

  3. Collider (v‑structure): XZYX \to Z \leftarrow Y

    • XX and YY independent marginally.

    • Conditioning on collider (or its descendants) induces dependence (explaining away).

    • Explaining away: In collider XZYX \to Z \leftarrow Y, observing the effect ZZ makes causes XX and YY compete.

      • Example: “Smoker” and “Covid” both cause “Hospitalization”; given hospitalization, learning that patient is a smoker reduces belief that Covid is cause, and vice versa.

Computing Probabilities from Joint Tables

Given full joint distribution table P(H,S,C)P(H,S,C):

schP(H=h,S=s,C=c)0000.755970010.03993\begin{array}{c|c|c|c} s & c & h & \mathbb{P}(H = h, S = s, C = c) \\ \hline 0 & 0 & 0 & 0.75597 \\ 0 & 0 & 1 & 0.03993 \\ \dots & \dots & \dots & \dots \\ \end{array}

Marginal: P(C=1)=s,hP(C=1,H=h,S=s)=P(C=1,H=0,S=0)+P(C=1,H=0,S=1)+P(C=1) = \sum_{s,h} P(C=1, H=h, S=s) = P(C=1, H=0, S=0) + P(C=1, H=0, S=1) + \dots

Conditional: P(C=1H=1)=sP(C=1,H=1,S=s)c,sP(C=c,H=1,S=s)P(C=1 \mid H=1) = \dfrac{\sum_s P(C=1,H=1,S=s)}{\sum_{c,s}P(C=c,H=1,S=s)}

Example: P(C=1H=1,S=0)=P(C=1,H=1,S=0)P(H=1,S=0)=P(C=1,H=1,S=0)cP(C=c,H=1,S=0)P(C = 1 \mid H = 1, S = 0) = \dfrac{P(C = 1, H = 1, S = 0)}{P(H = 1, S = 0)} = \dfrac{P(C = 1, H = 1, S = 0)}{\sum_c P(C = c, H = 1, S = 0)}

Interventions and the do‑operator

Conditioning: P(YX=x)P(Y\mid X=x) describes correlations in observed data; XX may itself be caused by other variables (confounded).

Intervention: P(Ydo(X=x))P(Y\mid \text{do}(X=x)) describes what happens if we force XX to xx (surgery on the graph):

  • All incoming edges into XX are cut.

  • XX is set externally.

  • The rest of the graph and conditional distributions remain unchanged.

For a treatment XX, outcome YY, and set of observed confounders ZZ that block all backdoor paths (backdoor criterion), adjustment formula:

P(Ydo(X=x))=zP(YX=x,Z=z)P(Z=z).P(Y\mid \text{do}(X=x)) = \sum_z P(Y\mid X=x, Z=z)\,P(Z=z).

Expectation Maximization (EM)

The EM algorithm iteratively maximizes the log-likelihood of the observed data. HH is hidden, E=eE = e is observed.

  1. Initialize ϑ\vartheta randomly

  2. Repeat until convergence:

E-step:

  • Compute q(h)=p(H=hE=e;ϑ)q(h) = p(H = h \mid E = e; \vartheta) for each hh (probabilistic inference)

  • Create fully-observed weighted examples: (h,e)(h, e) with weight q(h)q(h)

M-step:

  • Perform maximum likelihood (count and normalize) on weighted examples to update ϑ\vartheta