Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Advanced Statistical Data Analysis

Authors
Affiliations
ZHAW School of Engineering
ZHAW School of Engineering
Notebook Cell
# Install BiocManager (if not already installed)
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

# Install graph and RBGL from Bioconductor
BiocManager::install(c("graph", "RBGL", "Rgraphviz"))

install.packages("gRbase")
install.packages("pcalg")

library(gRbase)

set.seed(123)

The downloaded binary packages are in
	/var/folders/mg/f7gk10vj1_153rz1zkvd13_00000gn/T//Rtmp2VjdEg/downloaded_packages

The downloaded binary packages are in
	/var/folders/mg/f7gk10vj1_153rz1zkvd13_00000gn/T//Rtmp2VjdEg/downloaded_packages

Overview

The course focuses on regression modeling as a primary tool for explaining relationships between:

  • Response variable (output/dependent variable): The variable we want to explain or predict

  • Explanatory variables (inputs/independent variables/predictors): Variables that explain the response

Mathematical Formulation:

YfX(1),X(2),,X(m)+EY \approx f\left\langle X^{(1)}, X^{(2)}, \ldots, X^{(m)}\right\rangle + E

where ff\left\langle \cdot \right\rangle is the systematic component and EE is the random error term.

Symbols

SymbolMeaningExample
YYResponse variableYiY_i
XXExplanatory variableXjX_j
β\betaRegression coefficientβ0\beta_0, β1\beta_1
μ\muExpectationμ=E[Y]\mu = \mathbb{E}[Y]
σ2\sigma^2VarianceVar(Y)=σ2\text{Var}(Y) = \sigma^2
ϕ\phiDispersion parameterϕV(μ)\phi V(\mu)
η\etaLinear predictorη=XTβ\eta = \underline{X}^T \underline{\beta}
g()g(\cdot)Link functiong(μ)=ηg(\mu) = \eta
P()P(\cdot)ProbabilityP(YX)P(Y | X)
do(X=x)\text{do}(X=x)InterventionP(Ydo(X=x))P(Y | \text{do}(X=x))

Advanced Regression Modelling

Multiple Linear Regression

Objectives of Regression Analysis

  1. General description of data structure.

  2. Assessment of the effect of explanatory variables on the response.

  3. Prediction of future observations.

Error Assumptions

The standard assumptions for the error terms Ei\mathcal{E}_i are:

  • Stochastically independent.

  • Expectation zero and constant variance σ2\sigma^2 (homoscedasticity).

  • Normally (Gaussian) distributed: EiN(0,σ2)\mathcal{E}_i \sim \mathcal{N}(0, \sigma^2).

Matrix Representation

To simplify notation, the regression equation Definition 2 is written in matrix form:

Y=Xβ+E\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\mathcal{E}}

where:

  • Y\mathbf{Y} is an n×1n \times 1 vector of responses.

  • X\mathbf{X} is an n×pn \times p matrix of explanatory variables (including a column of 1s for the intercept).

  • β\boldsymbol{\beta} is a p×1p \times 1 vector of unknown coefficients (p=m+1p = m+1).

  • E\boldsymbol{\mathcal{E}} is an n×1n \times 1 vector of unobserved random variables.

Tukey’s First-Aid Transformations

Standard recommendations used to linearize relationships and stabilize variance when there is no specific domain theory to guide variable transformation. These should be applied to both explanatory variables and responses unless a valid reason exists to do otherwise:

Data TypeRecommended Transformation
Concentrations and Amountslog(x)\log(x)
Count Datax\sqrt{x}
Counted Fractions / Sharesx~=logit(x)=log(x+0.0051.01x)\tilde{x} = \text{logit}(x) = \log\left(\frac{x + 0.005}{1.01 - x}\right)[1]

Model Fitting and Diagnostics

Least Squares Estimation

The coefficients β\boldsymbol{\beta} are estimated by minimizing the sum of squared residuals.

The OLS estimator is given by:

β^=(XTX)1XTY\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}
Model Adequacy (Residual Analysis)

Model adequacy is checked using diagnostic plots:

source("advanced-statistical-data-analysis/RFn_Plot-lmSim.R")
df <- read.table("advanced-statistical-data-analysis/Softdrink.dat", header=TRUE)

# Tukey's First-Aid Transformations:
df.tk <- data.frame(
  Time=log(df$Time),
  volume=sqrt(df$volume), # := number of cases of product stocked => count data
  distance=log(df$distance),
  location=df$location) # := categorical variable (text data)

# Linear Regression:
df.tk.lm1 <- lm(Time ~ volume + distance, data=df.tk)

# Residual Analysis:
par(mfrow=c(2,4))
plot(df.tk.lm1)
plot.lmSim(df.tk.lm1, SEED=4711)
Plot with title “”
  • Tukey-Anscombe Plot: Residuals vs. Fitted values to check for non-linearity or heteroscedasticity.

  • Normal Q-Q Plot: To check the normality assumption of errors.

  • Scale-Location Plot: To check for constant variance.

  • Residuals vs. Leverage: To identify influential observations (Cook’s Distance).

Model Formulation:

Yi=β0+β1xi(1)+β2xi(2)++βmxi(m)+Ei,i=1,,nY_i = \beta_0 + \beta_1 x_i^{(1)} + \beta_2 x_i^{(2)} + \cdots + \beta_m x_i^{(m)} + E_i, \quad i = 1, \ldots, n

Assumptions:

  1. EiN(0,σ2)E_i \sim \mathcal{N}(0, \sigma^2) (normally distributed errors)

  2. E[Ei]=0\mathbb{E}[E_i] = 0 (zero expectation)

  3. Var(Ei)=σ2\text{Var}(E_i) = \sigma^2 (homoscedasticity)

  4. EiE_i are stochastically independent

  5. No perfect multicollinearity

Matrix Notation:

Y=Xβ+E\underline{Y} = \boldsymbol{X} \underline{\beta} + \underline{E}
Y=Xβ+EY = X\beta + E

Gauss-Markov Theorem: Under assumptions 1-4 (without normality), OLS is BLUE (Best Linear Unbiased Estimator).

Properties:

  • Unbiased: E[β^]=β\mathbb{E}[\hat{\underline{\beta}}] = \underline{\beta}

  • Consistent: β^Pβ\hat{\underline{\beta}} \xrightarrow{P} \underline{\beta}

  • Asymptotic Normality: β^N(β,σ2(XTX)1)\hat{\underline{\beta}} \sim \mathcal{N}(\underline{\beta}, \sigma^2 (\boldsymbol{X}^T \boldsymbol{X})^{-1})

Advanced Topics in Linear Regression Modelling

Learning Objectives

  • Weighted Least Squares (for non-constant variance)

  • Robust Fitting (to handle outliers and leverage points)

  • Fitting Smooth Functions (using LOESS and splines)

  • Additive Regression Models

  • Model Building strategies

Problem: OLS is not robust

Measuring Robustness

How robust an estimator is, can be investigated by two simple measures:

  • influence function and gross error sensitivity: The (gross error) sensitivity is based on the influence function and measures the maximum effect of a single observation on the estimated value

  • breakdown point: returns the minimum proportion of data that can be altered without causing completely unreliable estimates

Hence an estimator with good robustness properties has:

  • a bounded gross error sensitivity and

  • a breakdown point of (about) 0.5

Robust Regression: From OLS to MM-estimator

EstimatorPurposeProblem AddressedWeights
OLS (Ordinary Least Squares)Standard regressionNone (assumes ideal conditions)All weights = 1
WLS (Weighted Least Squares)Efficiency improvementHeteroscedasticity (non-constant variance)Known from variance structure
M-estimator (Max Likelihood)RobustnessOutliers in responseData-driven from residuals
S-estimator (Scale, σ\sigma)Robust scaleOutliers + leverage pointsImplicit in scale estimation
MM-estimator (Modified M-estimator)Best of bothOutliers + leverage + efficiencyTwo-step: S then M

We need S-estimators, M-estimators, and MM-estimators because real-world data often violates the assumptions of OLS regression. Therefore, we need robust estimators:

  1. OLS has 0% breakdown point - even one bad data point can ruin your analysis

  2. M-estimators alone can’t handle leverage points - outliers in predictors are still problematic

  3. S-estimators are too computationally expensive for routine use

  4. MM-estimators give us the best of both worlds - robustness AND practicality

Use MM-estimators as your default robust regression method. They provide protection against both outliers and leverage points while maintaining good statistical efficiency.

library(robustbase)
source("advanced-statistical-data-analysis/RFn_Plot-lmSim.R")
df <- read.table("advanced-statistical-data-analysis/Synthetic.dat", header=TRUE)

# Robust fit:
df.rlm <- lmrob(Y ~ x1 + x2, data=df, method="MM")

# Residual Analysis:
par(mfrow=c(2,4))
plot(df.tk.lm1)
plot.lmSim(df.tk.lm1, SEED=4711)
Plot with title “”

Logistic Regression (Binary Regression)

Theoretical Background

Logistic Regression Model:

P(Yi=1Xi)=πi=11+eηi,ηi=XiTβP(Y_i = 1 | \underline{X}_i) = \pi_i = \frac{1}{1 + e^{-\eta_i}}, \quad \eta_i = \underline{X}_i^T \underline{\beta}

Logit Link:

logπi1πi=ηi\log\left\langle \frac{\pi_i}{1 - \pi_i} \right\rangle = \eta_i

Odds Ratio:

OR=eβj=Odds(Y=1Xj=x+1)Odds(Y=1Xj=x)\text{OR} = e^{\beta_j} = \frac{\text{Odds}(Y=1 | X_j = x+1)}{\text{Odds}(Y=1 | X_j = x)}

Asymptotic Properties:

  • β^N(β,I1)\hat{\underline{\beta}} \sim \mathcal{N}(\underline{\beta}, \mathcal{I}^{-1})

  • Wald tests valid for large samples

df <- read.table("advanced-statistical-data-analysis/turbines.dat", header = TRUE)

# glm format: glm(x ~ y, family=…, data=…)
df.glm <- glm(cbind(Fissures, Turbines-Fissures) ~ Hours, family=binomial, data=df)
coef(df.glm)

df.p <- data.frame(Hours=seq(100, 10000, by=100))
p.y <- predict(df.glm, newdata=N, type="response")
sunflowerplot(x=df$Hours, y=df$Fissures/df$Turbines, las=1, number=df$Turbines, xlim=c(0, 10000), ylim=c(0,1))
lines(p$Hours, p.y, col=3)
Loading...
plot without title

Generalized Linear Models (GLM) - A Unifying Model Family

Learning Objectives

  • Identify members of the exponential family

  • Know the two basic elements that define the GLM

  • Fit GLMs in R and know the algorithm underlying it

  • Interpret R output of a GLM fit

Exponential Family

f(yi;μi,ϕ)=expyib(μi)c(μi)ϕwi+d(yi;ϕ,wi)f(y_i; \mu_i, \phi) = \exp \left\langle \frac{y_i b(\mu_i) - c(\mu_i)}{\phi} w_i + d(y_i; \phi, w_i) \right\rangle
  • The dispersion parameter ϕ\phi is connected to the variance

  • The value wiw_i is a fixed known number (weighting)

  • The function d(yi;ϕ,wi)d(y_i; \phi, w_i) scales the probability to 1.

  • The functions b(μi)b(\mu_i) and c(μi)c(\mu_i) determine the distribution.

Expected Value:

μi=E[Yi]=c(μi)b(μi)\mu_i = \mathbb{E}[Y_i] = \frac{c'(\mu_i)}{b'(\mu_i)}

where b(μ)b'(\mu) and c(μ)c'(\mu) are the first derivatives with respect to μ\mu.

Variance:

Var(Yi)=ϕwiV(μi),V(μi)=1b(μi)\text{Var}(Y_i) = \frac{\phi}{w_i} V(\mu_i), \quad V(\mu_i) = \frac{1}{b'(\mu_i)}

Common Distributions

DistributionRange of YYEY=μ\mathbb{E}\langle Y\rangle = \muvarY\text{var}\langle Y\ranglebμb\langle \mu\rangleVμV\langle \mu\rangleϕ\phiww
Gaussian μ,σ2\langle \mu, \sigma^2 \rangle(,+)(-\infty, +\infty)μ\muσ2\sigma^2μ\mu1σ2\sigma^21
Binomial m,π\langle m, \pi \rangle{0,1,2,,m}\{0,1,2,\ldots,m\}π\piπ(1π)m\frac{\pi(1-\pi)}{m}logμ1μ\log\left\langle \frac{\mu}{1-\mu} \right\rangleμ(1μ)\mu(1-\mu)1mm
Poisson λ\langle \lambda \rangle{0,1,2,}\{0,1,2,\ldots\}λ\lambdaλ\lambdalogμ\log\langle \mu \rangleμ\mu11
Gamma α,β\langle \alpha, \beta \rangle(0,)(0, \infty)αβ\frac{\alpha}{\beta}αβ2\frac{\alpha}{\beta^2}1μ-\frac{1}{\mu}μ2\mu^21α\frac{1}{\alpha}1
Inverse Gaussian(0,)(0, \infty)μ\muμ3λ\frac{\mu^3}{\lambda}1μ2-\frac{1}{\mu^2}μ3\mu^31λ\frac{1}{\lambda}1
  • Distribution: The probability distribution of the response variable YY;Defines the random component of the GLM

  • Range of YY: The support (set of possible values) for the response variable; Determines which distribution is appropriate for your data

  • EY=μ\mathbb{E}\langle Y\rangle = \mu: The expected value (mean) of the response; The target of your regression model

  • varY\text{var}\langle Y\rangle: The variance of the response variable; Determines precision of estimates

  • bμb\langle \mu\rangle: The canonical link function from the exponential family density; η=b(μ)\eta = b'(\mu)

  • VμV\langle \mu\rangle: The variance function; Connects mean to variance: var(Y)=ϕwV(μ)var(Y) = \frac{\phi}{w} V(\mu)

  • ϕ\phi: The dispersion parameter; Scales the variance: larger ϕ\phi means more spread

  • ww: The prior weight for each observation; Known constant for each observation (e.g., mm for binomial)

Normal/Gaussian: N(μ,σ2)\mathcal{N}(\mu, \sigma^2)

  • E[Y]=μ\mathbb{E}[Y] = \mu

  • Var(Y)=σ2Var(Y) = \sigma^2

glm(y ~ x1 + x2, family = gaussian)
# or simply: lm(y ~ x1 + x2)

Binomial(m,π)\text{Binomial}(m, \pi)

  • E[Y]=mπ\mathbb{E}[Y] = m\pi

  • Var(Y)=mπ(1π)Var(Y) = m\pi(1-\pi)

# Binary data
glm(y ~ x1 + x2, family = binomial)

# Aggregated data (successes and failures)
glm(cbind(success, failure) ~ x1 + x2, family = binomial)

Poisson(λ)\text{Poisson}(\lambda)

  • E[Y]=λ\mathbb{E}[Y] = \lambda

  • Var(Y)=λVar(Y) = \lambda

glm(y ~ x1 + x2, family = poisson)

Gamma(α,β)\text{Gamma}(\alpha, \beta)

  • E[Y]=α/β\mathbb{E}[Y] = \alpha/\beta

  • Var(Y)=α/β2Var(Y) = \alpha/\beta^2

glm(y ~ x1 + x2, family = Gamma)

Inverse Gaussian: IG(μ,λ)IG(\mu, \lambda)

  • E[Y]=μ\mathbb{E}[Y] = \mu

  • Var(Y)=μ3/λVar(Y) = \mu^3 / \lambda

glm(y ~ x1 + x2, family = inverse.gaussian)

Distribution Decision Tree

  • Check variance: For Poisson, verify mean ≈ variance. If variance > mean, consider Negative Binomial (not in exponential family)

  • Check range: Ensure response values are valid for the distribution (e.g., Poisson requires y ≥ 0)

  • Check skewness: Right-skewed data may need Gamma or Inverse Gaussian

  • Check zeros: Excess zeros may require zero-inflated models

GLM Structure

The Generalized Linear Model (GLM) is defined by two fundamental elements:

  1. Distributional Element
    The distribution of the response YiY_i, given the explanatory variables xi\underline{x}_i, belongs to the two-parameter exponential family with:

  • Expectation: EYixi=μi\mathbb{E}\langle Y_i \mid \underline{x}_i \rangle = \mu_i

  • Dispersion: ϕ\phi The responses YiY_i, i=1,,ni = 1, \ldots, n, are independently distributed.

  1. Structural Element
    The expectation μi\mu_i is related to the linear predictor ηi=xiTβ\eta_i = \underline{x}_i^T \underline{\beta} of the explanatory variables by applying a (possibly nonlinear) function gg\left\langle \cdot \right\rangle on μi\mu_i:

    gμi=ηi=xiTβ.g\left\langle \mu_i \right\rangle = \eta_i = \underline{x}_i^T \underline{\beta}.

    The function gg\left\langle \cdot \right\rangle is called the link function.

The estimator for β\underline{\beta} is derived based on the maximum likelihood principle:

  • It is defined by a nonlinear system of equations

  • It can be solved most efficiently by the IRLS algorithm

This two-element structure (distributional + structural) is what unifies various regression models (linear, logistic, Poisson, etc.) under the GLM framework.

IRLS Algorithm for GLM Fitting

The Iteratively Reweighted Least Squares (IRLS) algorithm fits Generalized Linear Models through the following iterative steps:

  1. Initialize
    Put μ^i=Yi\hat{\mu}_i = Y_i

  2. Ensure Valid Weights
    If necessary, modify μ^i\hat{\mu}_i so that the weights w~i\tilde{w}_i below are non-zero (e.g., as done by the empirical logits)

  3. Calculate Weights

    w~iwiVμ^i(gμ^i)2\tilde{w}_i \coloneqq \frac{w_i}{V\left\langle \hat{\mu}_i \right\rangle \left( g'\left\langle \hat{\mu}_i \right\rangle \right)^2}
  4. Form Adjusted Response

    zixiTβ^+(Yiμ^i)gμ^iz_i \coloneqq \underline{x}_i^T \hat{\underline{\beta}} + \left( Y_i - \hat{\mu}_i \right) g'\left\langle \hat{\mu}_i \right\rangle
  5. Weighted Least Squares
    Regress w~izi\sqrt{\tilde{w}_i} z_i on w~ixi\sqrt{\tilde{w}_i} \underline{x}_i by least squares to obtain β^\hat{\underline{\beta}}. That is, solve the system of linear equations:

    i=1nw~i(zixiTβ^)xi=0for β^.\sum_{i=1}^n \tilde{w}_i \left( z_i - \underline{x}_i^T \hat{\underline{\beta}} \right) \underline{x}_i = \underline{0} \quad \text{for } \hat{\underline{\beta}}.
  6. Update Fitted Values
    Calculate the fitted values:

    μ^i=g1xiTβ^\hat{\mu}_i = g^{-1}\left\langle \underline{x}_i^T \hat{\underline{\beta}} \right\rangle
  7. Iterate
    Return to step 2 and iterate until convergence

The dispersion parameter ϕ\phi must be estimated as well:

  • It can also be done by maximizing the likelihood.

  • As we know from multiple linear regression, we rather use an unbiased estimator for estimating the dispersion parameter ϕ=σ2\phi = \sigma^2

  • In the case of binomial and Poisson distribution: ϕ=1\phi = 1

  • For other distributions: see table above

Key R Code Snippets

Week 6: Inference, Model Simplification, Variable Selection

Theoretical Background

Deviance:

D(y,μ^)=2i(yi;yi)(yi;μ^i)D(\underline{y}, \hat{\underline{\mu}}) = 2 \sum_i \left\langle \ell(y_i; y_i) - \ell(y_i; \hat{\mu}_i) \right\rangle

Deviance Test:

ΔD=D0D1χdf1df02\Delta D = D_0 - D_1 \sim \chi^2_{df_1 - df_0}

Overdispersion:

ϕ^=Residual Deviancenp\hat{\phi} = \frac{\text{Residual Deviance}}{n - p}

Confidence Intervals:

  • Wald: β^j±zα/2SE(β^j)\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{SE}(\hat{\beta}_j)

  • Deviance-based: More accurate for small samples

  • Bootstrap: Nonparametric

Learning Objectives

  • Know what deviances are in GLM

  • Know what overdispersion is and can identify it

  • Know when to apply Wald-type vs deviance-based confidence intervals

  • Apply methods in statistical data analysis using R

Key R Code Snippets

model <- glm(y ~ x1 + x2, family = poisson, data = mydata)
null_deviance <- model$null.deviance
resid_deviance <- model$deviance
anova(model_simple, model_complex, test = "LRT")
model_poisson <- glm(y ~ x1 + x2, family = poisson, data = mydata)
dispersion <- model_poisson$deviance / model_poisson$df.residual
if (dispersion > 1.5) {
  model_quasi <- glm(y ~ x1 + x2, family = quasipoisson, data = mydata)
  library(MASS)
  model_nb <- glm.nb(y ~ x1 + x2, data = mydata)
}
AIC(model1, model2, model3)
BIC(model1, model2, model3)
step(model1, direction = "both")

Week 7: Diagnostics / Model Adequacy Checking

Theoretical Background

Diagnostic Plots:

  1. Residuals vs Fitted: Check linearity and homoscedasticity

  2. Normal Q-Q: Check normality

  3. Scale-Location: Check constant variance

  4. Leverage: Identify influential points

Multicollinearity:

  • High VIF (> 5 or 10) indicates problematic collinearity

  • Effects: Unstable estimates, inflated standard errors

Learning Objectives

  • Understand how AIC is generalized to GLMs

  • Check model adequacy and determine which assumptions are violated

  • Find appropriate transformations of predictors in a data-driven manner

  • Apply methods in statistical data analysis using R

Key R Code Snippets

AIC(model1, model2)
par(mfrow = c(2, 2))
plot(model_poisson, which = 1:4)
library(halfnorm)
halfnorm(residuals(model_poisson, type = "deviance"))
library(car)
cooks.distance(model1)
hat.values(model1)
dfbetas(model1)

Week 8: Some Extensions of GLM

Theoretical Background

Rate Models:

E[Yi]=μi=tiλi\mathbb{E}[Y_i] = \mu_i = t_i \lambda_i

For Poisson with log link:

log(μi)=log(ti)+log(λi)=xiTβ\log(\mu_i) = \log(t_i) + \log(\lambda_i) = \underline{x}_i^T \underline{\beta}

Quasi Models:

Var(Yi)=ϕV(μi)\text{Var}(Y_i) = \phi V(\mu_i)
  • Consistent β^\hat{\underline{\beta}} even if variance function misspecified

  • Standard errors adjusted by ϕ^\hat{\phi}

Instrumental Variables:

  1. First Stage: X=π0+π1Z+UX = \pi_0 + \pi_1 Z + U

  2. Second Stage: Y=β0+β1X^+EY = \beta_0 + \beta_1 \hat{X} + E

Learning Objectives

  • Know what a rate model is and how to analyse it with GLM

  • Know how a quasi model extends a GLM and when to apply it

  • Fit methods to data, interpret results, make inference and predictions

Key R Code Snippets

model_rate <- glm(y ~ x1 + x2, family = poisson, offset = log(t), data = mydata)
model_quasipoisson <- glm(y ~ x1 + x2, family = quasipoisson, data = mydata)
model_quasibinomial <- glm(cbind(Y, m-Y) ~ x1 + x2, family = quasibinomial, data = mydata)
library(AER)
model_iv <- ivreg(Y ~ X + W | Z + W, data = mydata)
summary(model_iv)

Causality

Introduction

Theoretical Background

The Ladder of Causation (Pearl):

  1. Association (Seeing): Observing patterns

  2. Intervention (Doing): Estimating effects of actions

  3. Counterfactuals (Imagining): Reasoning about hypotheticals

Notation:

  • P(YX=x)P(Y | X = x): Conditional probability

  • P(Ydo(X=x))P(Y | \text{do}(X = x)): Causal effect

Causal Graphical Models:

  • Nodes: Variables

  • Edges: Causal relationships

  • DAG: Directed Acyclic Graph (no cycles)

Graphical Structures:

  1. Chain: XZYX → Z → Y

  • XX and YY are conditionally independent given Z; (X ⁣ ⁣ ⁣YZ)(X \perp\!\!\!\perp Y \mid Z)

  1. Fork: XZYX ← Z → Y

  • XX and YY are conditionally independent given Z; (X ⁣ ⁣ ⁣YZ)(X \perp\!\!\!\perp Y \mid Z)

  1. Collider: XZYX → Z ← Y

  • XX and YY are independent (X ⁣ ⁣ ⁣Y)(X \perp\!\!\!\perp Y), but are conditionally dependent on Z and any descendants of Z; (X⊥̸ ⁣ ⁣ ⁣YZ)(X \not\perp\!\!\!\perp Y \mid Z)

X ⁣ ⁣ ⁣YZ if all paths between X and Y are blocked by ZX \perp\!\!\!\perp Y \mid Z \text{ if all paths between X and Y are blocked by Z}

Learning Objectives

  • Understand when and why causal reasoning is important

  • Be familiar with causal graphical models

  • Understand the difference between association, interventions, and counterfactuals

  • Know the difference between experimental and observational data

Intervention changes distribution

Intervention changes distribution 1Intervention changes distribution 2

Key R Code Snippets

library(gRbase)
g <- dag(
  c("A"),
  c("B", "A", "E"), # = root node + all incoming node-edges
  c("C", "A", "B", "D"),
  c("D", "B"),
  c("E")
)

cat("Children of B = ", children("B", g), "\n")
cat("Parents of B = ", parents("B", g), "\n")
cat("Ancestors of B = ", ancestralSet("B", g), "\n")
Children of B =  C D 
Parents of B =  A E 
Ancestors of B =  A B E 
Plot with title “”
m <- as(g, "matrix") # to matrix from graph
m
g1 <- as(m, "igraph") # to igraph from matrix
Loading...
# Intervention on B (i.e. remove edges):
g2 <- removeEdge("A", "B", removeEdge("E", "B", g1))
plot(g2)
Plot with title “”

Simpson’s Paradox & Graphical Models

Theoretical Background

Simpson’s Paradox: A trend appears in different groups but disappears or reverses when combined.

Factorization:

  • Chain: P(X,Z,Y)=P(X)P(ZX)P(YZ)P(X,Z,Y) = P(X)P(Z|X)P(Y|Z)

  • Fork: P(X,Z,Y)=P(Z)P(XZ)P(YZ)P(X,Z,Y) = P(Z)P(X|Z)P(Y|Z)

  • Collider: P(X,Z,Y)=P(X)P(Y)P(ZX,Y)P(X,Z,Y) = P(X)P(Y)P(Z|X,Y)

Bayesian Network Factorization:

P(X1,,Xn)=i=1nP(XiParents(Xi))P(X_1, \ldots, X_n) = \prod_{i=1}^n P(X_i | \text{Parents}(X_i))

Learning Objectives

  • Understand Simpson’s paradox and how to analyze it

  • Express joint distribution by factorization

  • Understand (conditional) independence

  • Identify (conditional) independences for chain, fork, and collider

Key R Code Snippets

library(gRbase)
library(gRain)

graph.toys <- dag(c("Gender"), c("Toy","Gender"))
load("advanced-statistical-data-analysis/toys.rda")
gn <- grain(graph.toys, data = toys) # combine graph and data to a Bayesian network
# Conditional probability: from data
toys_girls <- toys[toys$Gender== "girl",]
cat("P(Toy = car | Gender = girl) =", sum(toys_girls== "car") / nrow(toys_girls))

# Conditional probability: from the Bayesian network
querygrain(gn, nodes=c("Toy", "Gender"), type="conditional")
P(Toy = car | Gender = girl) = 0.3915344
Loading...
# Joint probability: from data
n_car_girl <- sum(toys$Toy== "car" & toys$Gender== "girl")
cat("P(Toy = car, Gender = girl) =", n_car_girl/nrow(toys))

# Joint probability: from the Bayesian network
querygrain(gn, nodes=c("Toy", "Gender"), type="joint")
P(Toy = car, Gender = girl) = 0.1749409
Loading...
# marginal probabilities
querygrain(gn, nodes = "Gender", type = "marginal")
querygrain(gn, nodes = "Toy", type = "marginal")
Loading...
Loading...
# Manually using a Conditional Probability Table:
gender <- cptable("Gender", values = c(0.55, 0.45), levels = c("boy","girl"))
toy <- cptable("Toy", values = c(0.67, 0.33), levels = c("car","doll"))

toy_gender <- cptable(c("Toy", "Gender"),
  # car:boy = 0.89, doll:boy = 0.10, car:girl = 0.39, doll:girl = 0.60
  values = c(0.89, 0.10, 0.39, 0.60), levels = c("car","doll"))

# Building the Bayesian Network
cpt <- compileCPT(list(gender, toy_gender))
bn <- grain(cpt, compile = FALSE)
plot(bn)

querygrain(bn, nodes=c("Toy", "Gender"), type="conditional")
Loading...
Plot with title “”

D-Separation & Causal Effect Estimation

Theoretical Background

D-Separation:

X and Y are d-separated given Z if all paths between X and Y are blocked by Z if and only if:

  1. a middle node of a chain (X → B → Y) or a fork (X ← B → Y) is conditioned on (i.e. it is in Z)

  2. a middle node of a collider (X → B ← Y) is not conditioned on (i.e. is not in Z), and no descendant of B is in Z.

If Z blocks every path between two nodes X and Y, then X and Y are d-separated conditioning on Z and thus are conditionally independent on Z.

Adjustment Formula:

P(Ydo(X=x))=zP(YX=x,Z=z)P(Z=z)P(Y | \text{do}(X = x)) = \sum_z P(Y | X = x, Z = z) P(Z = z)

Backdoor Criterion: Z satisfies backdoor criterion if:

  1. Z blocks all backdoor paths from X to Y

  2. Z does not contain any descendants of X

Learning Objectives

  • Know D-Separation

  • Estimate causal effect using the adjustment formula

  • Define adjustment sets with the backdoor criterion

library(gRbase)
# List all nodes with their parents
g <- dag("A",c("B","A","E"),c("C","A","B","D"),c("D","B"),c("E"))
d.separates <- function(a,b,c,dag){
    separates(a,b,c, moralize(ancestralGraph(union(union(a,b),c),dag)))
}
plot(g)
cat("Are E and D d-separated by Z = {B}?", d.separates("E", "D", c("B"), g), "\n")
cat("Are E and C d-separated by Z = {A}?", d.separates("E", "C", c("A"), g), "\n")
cat("Are A and D d-separated by Z = {B, C}?", d.separates("A", "D", c("B", "C"), g), "\n")
Are E and D d-separated by Z = {B}? TRUE 
Are E and C d-separated by Z = {A}? FALSE 
Are A and D d-separated by Z = {B, C}? FALSE 
Plot with title “”

Structural Causal Models

Theoretical Background

Linear SCM:

X=fX(EX)W=5X+10Z+EWY=12W+EY\begin{aligned} X &= f_X(E_X) \\ W &= 5X + 10Z + E_W \\ Y &= 12W + E_Y \end{aligned}

Direct vs. Total Effects:

  • Total: Effect through all paths

  • Direct: Effect through direct path only

Learning Objectives

  • Be familiar with (linear) structural causal models

  • Understand direct and total causal effects

  • Estimate direct and total causal effect from linear SCM

Key R Code Snippets

library(gRbase)
library(igraph)

set.seed(253)

g <- dag(
  c("Study", "Teacher", "Prior", "Attendance"),
  c("Attendance", "Prior", "Teacher"),
  c("Exam", "Study", "Prior")
)
E(g)$weight <- c(-5, -3, -3, -2, 4, 8, 3)
plot(g, edge.label = E(g)$weight)

n <- 1000000  # Sample size

# Simulate data from the SCM (using the DAG's coefficients)
prior      <- rnorm(n)
teacher    <- rnorm(n)
attendance <- -2 * prior + 4 * teacher + rnorm(n)
study      <- -3 * prior - 5 * teacher + -3 * attendance + rnorm(n)
exam       <- 3 * prior + 8 * study + rnorm(n)

# Estimate total causal effect of attendance on exam
fit <- lm(exam ~ attendance + prior + teacher)
coef(fit)["attendance"]  # Returns the estimated effect (-24 for large n)
Loading...
Plot with title “”

Advanced Topics

Theoretical Background

Instrumental Variables:

  1. Relevance: Cov(Z,X)0\text{Cov}(Z, X) \neq 0

  2. Exclusion: Z has no direct effect on Y except through X

  3. Exogeneity: Cov(Z,E)=0\text{Cov}(Z, E) = 0

Two-Stage Least Squares:

  1. First stage: X=π0+π1Z+UX = \pi_0 + \pi_1 Z + U

  2. Second stage: Y=β0+β1X^+EY = \beta_0 + \beta_1 \hat{X} + E

Counterfactuals: Pearl’s three-step method:

  1. Abduction: Use the observed information to determine the values of the noise variables.

  2. Action: Modify the causal model M by removing the structural equation for the variable X and replacing with X = x.

  3. Prediction: Use the modified model MxM_x to compute the counterfactual.

library(gRain)

g <- dag(
  c("Temp"),
  c("Adv", "Temp"),
  c("Sales", "Temp", "Adv")
)

load("advanced-statistical-data-analysis/ice_cream.rda")
lm.adv <- lm(Adv ~ Temp, data = ice_cream)
lm.sales <- lm(Sales ~ Temp + Adv, data = ice_cream)
print(lm.adv)
print(lm.sales)

E(g)$weight <- c(1.74, 22.293, 2.032)
plot(g, edge.label = E(g)$weight)

# look at the observation with the highest daily temperature:
ice_cream[ice_cream$Temp == max(ice_cream$Temp), ]
day19 <- ice_cream[19,]

# Counterfactuals:

# 1. Abduction:
Es <- day19$Sales - (22.293 * day19$Temp + 2.032 * day19$Adv)
Ea <- day19$Adv - (1.74 * day19$Temp)
Et <- day19$Temp

# how many ice creams would have been sold on that day, if
# temperature would have been 30? -> do(Temp = 30):

# 2. Action:
temp_c1 <- 30
adv_c1 <- 1.74 * temp_c1 + Ea

# 3. Prediction:
sales_c1 <- 22.293 * temp_c1 + 2.032 * adv_c1 + Es
cat("if the temperature would have been", temp_c1, "then", sales_c1, "ice creams would have been sold", "\n")

# do(Adv = 0):
# how many ice creams would have been sold on that day, if
# there would have been no advertisement? -> do(Adv = 0):

# 2. Action:
temp_c2 <- day19$Temp
adv_c2 <- 0

# 3. Prediction:
sales_c2 <- 22.293 * temp_c2 + 2.032 * adv_c2 + Es
cat("without advertisement on that day,", sales_c2, "ice creams would have been sold", "\n")

# Use linear regression to predict the average number of ice creams sold for the counterfactuals
x0 <- data.frame(
  Temp = c(30, day19$Temp),
  Adv  = c(day19$Adv, 0))
cat("Average number of ice creams sold for days with given temperature and advertisement: ", predict(lm.sales, newdata = x0), "\n")

# The prediction describes the average number of ice creams. In contrast,
# the counterfactual describes how many ice creams would have been sold on the
# specific day (corresponding to data point 19) if the temperature
# and advertising had been different.

Call:
lm(formula = Adv ~ Temp, data = ice_cream)

Coefficients:
(Intercept)         Temp  
     110.94         1.74  


Call:
lm(formula = Sales ~ Temp + Adv, data = ice_cream)

Coefficients:
(Intercept)         Temp          Adv  
   -258.252       22.293        2.032  

Loading...
if the temperature would have been 30 then 779.1024 ice creams would have been sold 
without advertisement on that day, 530.5143 ice creams would have been sold 
Average number of ice creams sold for days with given temperature and advertisement:  764.971 501.93 
Plot with title “”

Markov Equivalence:

  • Chains and Forks are Markov equivalent, i.e. their graphs imply the same conditional independencies.

  • Colliders encode a different independence.

→ Colliders are identifiable from data, whereas fork and chains are not
→ All graphs where we can flip the arrows without introducing new colliders and keeping all existing colliders are Markow Equivalent.

Markov Equivalence can be represented by a CPDAG.

Completed Partially DAG (CPDAG)

represents the Markov equivalence class of DAGs

  • Directed edge: causal direction is clear (all graphs agree)

  • Undirected edge: causal direction is ambiguous (graphs disagree)

  • Encodes what can be identified from data alone

CPDAG

Learning Objectives

  • Understand instrumental variables

  • Be familiar with counterfactual reasoning

  • Know the three steps in computing counterfactuals

  • Understand Markov Equivalence

Key R Code Snippets

library(AER)
model_iv <- ivreg(Y ~ X + W | Z + W, data = mydata)
summary(model_iv)
first_stage <- lm(X ~ Z + W, data = mydata)
summary(first_stage)
library(pcalg)
dag1 <- dag(c("X", "Z", "Y"), list(c("X", "Z"), c("Z", "Y")))
dag2 <- dag(c("X", "Z", "Y"), list(c("Z", "X"), c("Z", "Y")))
meq(dag1, dag2)

Causal Structure Learning

Theoretical Background

Assumptions of Independence-Based Methods:

  • Causal sufficiency, i.e., no unobserved confounding variables.

  • Acyclicity: No directed cycles in the graph

  • Markov: Graph → Data, i.e., if X and Y are d-separated by Z in the graph, then X and Y are conditionally independent given Z.

  • Faithfulness: Data → Graph, i.e., if X and Y are conditionally independent given Z, then X and Y are d-separated by Z in the graph.

\to What about non-Gaussian structural equations? LinGAM to the rescue!

4 Steps of Causal Inference

  1. DAG: Create a causal model using expert knowledge.

  2. Identification: Determine whether and how the causal effect can be identified from the observational data.

  3. Estimation: Estimate the causal effect from the data.

  4. Validity: Test the estimated causal effect:

MethodIdeaGood if ...
Random CauseAdd random covariateEstimate stable
PlaceboRandomize treatment XEstimate 0\approx 0
Data Subsetcross-validation / bootstrapEstimates similar

Key R Code Snippets

# PC Algorithm:
library(pcalg)

# Example data generation
set.seed(239)
N <- 500000
X <- rnorm(N)
Z <- rnorm(N)
W <- 5 * X + 10 * Z + rnorm(N)
Y <- 12 * W + rnorm(N)
V <- 8 * W + rnorm(N)
dat <- cbind(X, Z, W, Y, V)

suffStat <- list(C = cor(dat), n = nrow(dat))
pc.fit <- pc(
    suffStat,
    indepTest = gaussCItest,
    alpha = 0.01,
    labels = colnames(dat),
    verbose = FALSE)
summary(pc.fit)  # There are 19 + 29 + 12 + 4 tests!!
plot(pc.fit, main = "") # Needs a title!
Object of class 'pcAlgo', from Call:
pc(suffStat = suffStat, indepTest = gaussCItest, alpha = 0.01, 
    labels = colnames(dat), verbose = FALSE)

Nmb. edgetests during skeleton estimation:
===========================================
Max. order of algorithm:  3 
Number of edgetests from m = 0 up to m = 3 :  19 29 12 4

Graphical properties of skeleton:
=================================
Max. number of neighbours:  2 at node(s) 3 
Avg. number of neighbours:  0.8 

Adjacency Matrix G:
  X Z W Y V
X . . 1 . .
Z . . 1 . .
W . . . 1 1
Y . . . . .
V . . . . .
Plot with title “”
# LiNGAM Algorithm:
library(pcalg)

# Linear SCM with uniform distributed noise
set.seed(145)
N <- 2000
x1 <- runif(N,-.5,.5)
x2 <- runif(N,-.3,3)
x3 <- x1 + x2 + runif(N,-.7,.7)
x4 <- 2*x1 + runif(N,-.5,.5)
x5 <- 3*x3 + runif(N,-1,1)
data <- cbind(x1,x2,x3,x4,x5)

# Apply LiNGAM
fit.ex2 <- lingam(data)

# Transpose the matrix to correct edge directions
adj_matrix <- t(fit.ex2$Bpruned)

# Convert to igraph and plot
g <- graph_from_adjacency_matrix(
  adj_matrix,
  mode = "directed",
  weighted = TRUE
)
plot(g, edge.label = round(E(g)$weight, 2))  # Show rounded weights
Plot with title “”
Footnotes
  1. Modified logit transformation to avoid divison by zero for full range of percentage (0-100%). Unmodified: logit(p)=log(p1p)logit(p) = \log\left(\frac{p}{1 - p}\right)