Advanced Statistical Data Analysis

Overview¶

The course focuses on regression modeling as a primary tool for explaining relationships between:

Response variable (output/dependent variable): The variable we want to explain or predict
Explanatory variables (inputs/independent variables/predictors): Variables that explain the response

Mathematical Formulation:

Y \approx f\left\langle X^{(1)}, X^{(2)}, \ldots, X^{(m)}\right\rangle + E

(1)

where $f\left\langle \cdot \right\rangle$ is the systematic component and $E$ is the random error term.

Symbols¶

Symbol	Meaning	Example
$Y$	Response variable	$Y_i$
$X$	Explanatory variable	$X_j$
$\beta$	Regression coefficient	$\beta_0$ , $\beta_1$
$\mu$	Expectation	$\mu = \mathbb{E}[Y]$
$\sigma^2$	Variance	$\text{Var}(Y) = \sigma^2$
$\phi$	Dispersion parameter	$\phi V(\mu)$
$\eta$	Linear predictor	$\eta = \underline{X}^T \underline{\beta}$
$g(\cdot)$	Link function	$g(\mu) = \eta$
$P(\cdot)$	Probability	$P(Y \| X)$
$\text{do}(X=x)$	Intervention	$P(Y \| \text{do}(X=x))$

Advanced Regression Modelling¶

Multiple Linear Regression¶

Objectives of Regression Analysis¶

General description of data structure.
Assessment of the effect of explanatory variables on the response.
Prediction of future observations.

Error Assumptions¶

The standard assumptions for the error terms $\mathcal{E}_i$ are:

Stochastically independent.
Expectation zero and constant variance $\sigma^2$ (homoscedasticity).
Normally (Gaussian) distributed: $\mathcal{E}_i \sim \mathcal{N}(0, \sigma^2)$ .

Matrix Representation¶

To simplify notation, the regression equation Definition 2 is written in matrix form:

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\mathcal{E}}

(3)

where:

$\mathbf{Y}$ is an $n \times 1$ vector of responses.
$\mathbf{X}$ is an $n \times p$ matrix of explanatory variables (including a column of 1s for the intercept).
$\boldsymbol{\beta}$ is a $p \times 1$ vector of unknown coefficients ( $p = m+1$ ).
$\boldsymbol{\mathcal{E}}$ is an $n \times 1$ vector of unobserved random variables.

Tukey’s First-Aid Transformations¶

Standard recommendations used to linearize relationships and stabilize variance when there is no specific domain theory to guide variable transformation. These should be applied to both explanatory variables and responses unless a valid reason exists to do otherwise:

Data Type	Recommended Transformation
Concentrations and Amounts	$\log(x)$
Count Data	$\sqrt{x}$
Counted Fractions / Shares	$\tilde{x} = \text{logit}(x) = \log\left(\frac{x + 0.005}{1.01 - x}\right)$ ^[1]

Model Fitting and Diagnostics¶

Least Squares Estimation¶

The coefficients $\boldsymbol{\beta}$ are estimated by minimizing the sum of squared residuals.

The OLS estimator is given by:

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}

(4)

Model Adequacy (Residual Analysis)¶

Model adequacy is checked using diagnostic plots:

source("advanced-statistical-data-analysis/RFn_Plot-lmSim.R")
df <- read.table("advanced-statistical-data-analysis/Softdrink.dat", header=TRUE)

# Tukey's First-Aid Transformations:
df.tk <- data.frame(
  Time=log(df$Time),
  volume=sqrt(df$volume), # := number of cases of product stocked => count data
  distance=log(df$distance),
  location=df$location) # := categorical variable (text data)

# Linear Regression:
df.tk.lm1 <- lm(Time ~ volume + distance, data=df.tk)

# Residual Analysis:
par(mfrow=c(2,4))
plot(df.tk.lm1)
plot.lmSim(df.tk.lm1, SEED=4711)

Tukey-Anscombe Plot: Residuals vs. Fitted values to check for non-linearity or heteroscedasticity.
Normal Q-Q Plot: To check the normality assumption of errors.
Scale-Location Plot: To check for constant variance.
Residuals vs. Leverage: To identify influential observations (Cook’s Distance).

Model Formulation:

Y_i = \beta_0 + \beta_1 x_i^{(1)} + \beta_2 x_i^{(2)} + \cdots + \beta_m x_i^{(m)} + E_i, \quad i = 1, \ldots, n

(5)

Assumptions:

$E_i \sim \mathcal{N}(0, \sigma^2)$ (normally distributed errors)
$\mathbb{E}[E_i] = 0$ (zero expectation)
$\text{Var}(E_i) = \sigma^2$ (homoscedasticity)
$E_i$ are stochastically independent
No perfect multicollinearity

Matrix Notation:

\underline{Y} = \boldsymbol{X} \underline{\beta} + \underline{E}

(6)

Y = X\beta + E

(7)

Gauss-Markov Theorem: Under assumptions 1-4 (without normality), OLS is BLUE (Best Linear Unbiased Estimator).

Properties:

Unbiased: $\mathbb{E}[\hat{\underline{\beta}}] = \underline{\beta}$
Consistent: $\hat{\underline{\beta}} \xrightarrow{P} \underline{\beta}$
Asymptotic Normality: $\hat{\underline{\beta}} \sim \mathcal{N}(\underline{\beta}, \sigma^2 (\boldsymbol{X}^T \boldsymbol{X})^{-1})$

Advanced Topics in Linear Regression Modelling¶

Learning Objectives¶

Weighted Least Squares (for non-constant variance)
Robust Fitting (to handle outliers and leverage points)
Fitting Smooth Functions (using LOESS and splines)
Additive Regression Models
Model Building strategies

Problem: OLS is not robust¶

Measuring Robustness

How robust an estimator is, can be investigated by two simple measures:

influence function and gross error sensitivity: The (gross error) sensitivity is based on the influence function and measures the maximum effect of a single observation on the estimated value
breakdown point: returns the minimum proportion of data that can be altered without causing completely unreliable estimates

Hence an estimator with good robustness properties has:

a bounded gross error sensitivity and
a breakdown point of (about) 0.5

Robust Regression: From OLS to MM-estimator

Estimator	Purpose	Problem Addressed	Weights
OLS (Ordinary Least Squares)	Standard regression	None (assumes ideal conditions)	All weights = 1
WLS (Weighted Least Squares)	Efficiency improvement	Heteroscedasticity (non-constant variance)	Known from variance structure
M-estimator (Max Likelihood)	Robustness	Outliers in response	Data-driven from residuals
S-estimator (Scale, $\sigma$ )	Robust scale	Outliers + leverage points	Implicit in scale estimation
MM-estimator (Modified M-estimator)	Best of both	Outliers + leverage + efficiency	Two-step: S then M

We need S-estimators, M-estimators, and MM-estimators because real-world data often violates the assumptions of OLS regression. Therefore, we need robust estimators:

OLS has 0% breakdown point - even one bad data point can ruin your analysis
M-estimators alone can’t handle leverage points - outliers in predictors are still problematic
S-estimators are too computationally expensive for routine use
MM-estimators give us the best of both worlds - robustness AND practicality

Use MM-estimators as your default robust regression method. They provide protection against both outliers and leverage points while maintaining good statistical efficiency.

library(robustbase)
source("advanced-statistical-data-analysis/RFn_Plot-lmSim.R")
df <- read.table("advanced-statistical-data-analysis/Synthetic.dat", header=TRUE)

# Robust fit:
df.rlm <- lmrob(Y ~ x1 + x2, data=df, method="MM")

# Residual Analysis:
par(mfrow=c(2,4))
plot(df.tk.lm1)
plot.lmSim(df.tk.lm1, SEED=4711)

Logistic Regression (Binary Regression)¶

Theoretical Background¶

Logistic Regression Model:

P(Y_i = 1 | \underline{X}_i) = \pi_i = \frac{1}{1 + e^{-\eta_i}}, \quad \eta_i = \underline{X}_i^T \underline{\beta}

(9)

Logit Link:

\log\left\langle \frac{\pi_i}{1 - \pi_i} \right\rangle = \eta_i

(10)

Odds Ratio:

\text{OR} = e^{\beta_j} = \frac{\text{Odds}(Y=1 | X_j = x+1)}{\text{Odds}(Y=1 | X_j = x)}

(11)

Asymptotic Properties:

$\hat{\underline{\beta}} \sim \mathcal{N}(\underline{\beta}, \mathcal{I}^{-1})$
Wald tests valid for large samples

df <- read.table("advanced-statistical-data-analysis/turbines.dat", header = TRUE)

# glm format: glm(x ~ y, family=…, data=…)
df.glm <- glm(cbind(Fissures, Turbines-Fissures) ~ Hours, family=binomial, data=df)
coef(df.glm)

df.p <- data.frame(Hours=seq(100, 10000, by=100))
p.y <- predict(df.glm, newdata=N, type="response")
sunflowerplot(x=df$Hours, y=df$Fissures/df$Turbines, las=1, number=df$Turbines, xlim=c(0, 10000), ylim=c(0,1))
lines(p$Hours, p.y, col=3)

Loading...

Generalized Linear Models (GLM) - A Unifying Model Family¶

Learning Objectives¶

Identify members of the exponential family
Know the two basic elements that define the GLM
Fit GLMs in R and know the algorithm underlying it
Interpret R output of a GLM fit

Exponential Family¶

f(y_i; \mu_i, \phi) = \exp \left\langle \frac{y_i b(\mu_i) - c(\mu_i)}{\phi} w_i + d(y_i; \phi, w_i) \right\rangle

(12)

The dispersion parameter $\phi$ is connected to the variance
The value $w_i$ is a fixed known number (weighting)
The function $d(y_i; \phi, w_i)$ scales the probability to 1.
The functions $b(\mu_i)$ and $c(\mu_i)$ determine the distribution.

Expected Value:

\mu_i = \mathbb{E}[Y_i] = \frac{c'(\mu_i)}{b'(\mu_i)}

(13)

where $b'(\mu)$ and $c'(\mu)$ are the first derivatives with respect to $\mu$ .

Variance:

\text{Var}(Y_i) = \frac{\phi}{w_i} V(\mu_i), \quad V(\mu_i) = \frac{1}{b'(\mu_i)}

(14)

Common Distributions¶

Distribution	Range of $Y$	$\mathbb{E}\langle Y\rangle = \mu$	$\text{var}\langle Y\rangle$	$b\langle \mu\rangle$	$V\langle \mu\rangle$	$\phi$	$w$
Gaussian $\langle \mu, \sigma^2 \rangle$	$(-\infty, +\infty)$	$\mu$	$\sigma^2$	$\mu$	1	$\sigma^2$	1
Binomial $\langle m, \pi \rangle$	$\{0,1,2,\ldots,m\}$	$\pi$	$\frac{\pi(1-\pi)}{m}$	$\log\left\langle \frac{\mu}{1-\mu} \right\rangle$	$\mu(1-\mu)$	1	$m$
Poisson $\langle \lambda \rangle$	$\{0,1,2,\ldots\}$	$\lambda$	$\lambda$	$\log\langle \mu \rangle$	$\mu$	1	1
Gamma $\langle \alpha, \beta \rangle$	$(0, \infty)$	$\frac{\alpha}{\beta}$	$\frac{\alpha}{\beta^2}$	$-\frac{1}{\mu}$	$\mu^2$	$\frac{1}{\alpha}$	1
Inverse Gaussian	$(0, \infty)$	$\mu$	$\frac{\mu^3}{\lambda}$	$-\frac{1}{\mu^2}$	$\mu^3$	$\frac{1}{\lambda}$	1

Distribution: The probability distribution of the response variable $Y$ ;Defines the random component of the GLM
Range of $Y$ : The support (set of possible values) for the response variable; Determines which distribution is appropriate for your data
$\mathbb{E}\langle Y\rangle = \mu$ : The expected value (mean) of the response; The target of your regression model
$\text{var}\langle Y\rangle$ : The variance of the response variable; Determines precision of estimates
$b\langle \mu\rangle$ : The canonical link function from the exponential family density; $\eta = b'(\mu)$
$V\langle \mu\rangle$ : The variance function; Connects mean to variance: $var(Y) = \frac{\phi}{w} V(\mu)$
$\phi$ : The dispersion parameter; Scales the variance: larger $\phi$ means more spread
$w$ : The prior weight for each observation; Known constant for each observation (e.g., $m$ for binomial)

Normal/Gaussian: $\mathcal{N}(\mu, \sigma^2)$

$\mathbb{E}[Y] = \mu$
$Var(Y) = \sigma^2$

glm(y ~ x1 + x2, family = gaussian)
# or simply: lm(y ~ x1 + x2)

$\text{Binomial}(m, \pi)$

$\mathbb{E}[Y] = m\pi$
$Var(Y) = m\pi(1-\pi)$

# Binary data
glm(y ~ x1 + x2, family = binomial)

# Aggregated data (successes and failures)
glm(cbind(success, failure) ~ x1 + x2, family = binomial)

$\text{Poisson}(\lambda)$

$\mathbb{E}[Y] = \lambda$
$Var(Y) = \lambda$

glm(y ~ x1 + x2, family = poisson)

$\text{Gamma}(\alpha, \beta)$

$\mathbb{E}[Y] = \alpha/\beta$
$Var(Y) = \alpha/\beta^2$

glm(y ~ x1 + x2, family = Gamma)

Inverse Gaussian: $IG(\mu, \lambda)$

$\mathbb{E}[Y] = \mu$
$Var(Y) = \mu^3 / \lambda$

glm(y ~ x1 + x2, family = inverse.gaussian)

Distribution Decision Tree¶

Check variance: For Poisson, verify mean ≈ variance. If variance > mean, consider Negative Binomial (not in exponential family)
Check range: Ensure response values are valid for the distribution (e.g., Poisson requires y ≥ 0)
Check skewness: Right-skewed data may need Gamma or Inverse Gaussian
Check zeros: Excess zeros may require zero-inflated models

GLM Structure¶

The Generalized Linear Model (GLM) is defined by two fundamental elements:

Distributional Element
The distribution of the response $Y_i$ , given the explanatory variables $\underline{x}_i$ , belongs to the two-parameter exponential family with:

Expectation: $\mathbb{E}\langle Y_i \mid \underline{x}_i \rangle = \mu_i$
Dispersion: $\phi$ The responses $Y_i$ , $i = 1, \ldots, n$ , are independently distributed.

Structural Element
The expectation $\mu_i$ is related to the linear predictor $\eta_i = \underline{x}_i^T \underline{\beta}$ of the explanatory variables by applying a (possibly nonlinear) function $g\left\langle \cdot \right\rangle$ on $\mu_i$ :
$g\left\langle \mu_i \right\rangle = \eta_i = \underline{x}_i^T \underline{\beta}.$
(15)
The function $g\left\langle \cdot \right\rangle$ is called the link function.

The estimator for $\underline{\beta}$ is derived based on the maximum likelihood principle:

It is defined by a nonlinear system of equations
It can be solved most efficiently by the IRLS algorithm

This two-element structure (distributional + structural) is what unifies various regression models (linear, logistic, Poisson, etc.) under the GLM framework.

IRLS Algorithm for GLM Fitting¶

The Iteratively Reweighted Least Squares (IRLS) algorithm fits Generalized Linear Models through the following iterative steps:

Initialize
Put $\hat{\mu}_i = Y_i$
Ensure Valid Weights
If necessary, modify $\hat{\mu}_i$ so that the weights $\tilde{w}_i$ below are non-zero (e.g., as done by the empirical logits)
Calculate Weights
$\tilde{w}_i \coloneqq \frac{w_i}{V\left\langle \hat{\mu}_i \right\rangle \left( g'\left\langle \hat{\mu}_i \right\rangle \right)^2}$
(16)
Form Adjusted Response
$z_i \coloneqq \underline{x}_i^T \hat{\underline{\beta}} + \left( Y_i - \hat{\mu}_i \right) g'\left\langle \hat{\mu}_i \right\rangle$
(17)
Weighted Least Squares
Regress $\sqrt{\tilde{w}_i} z_i$ on $\sqrt{\tilde{w}_i} \underline{x}_i$ by least squares to obtain $\hat{\underline{\beta}}$ . That is, solve the system of linear equations:
$\sum_{i=1}^n \tilde{w}_i \left( z_i - \underline{x}_i^T \hat{\underline{\beta}} \right) \underline{x}_i = \underline{0} \quad \text{for } \hat{\underline{\beta}}.$
(18)
Update Fitted Values
Calculate the fitted values:
$\hat{\mu}_i = g^{-1}\left\langle \underline{x}_i^T \hat{\underline{\beta}} \right\rangle$
(19)
Iterate
Return to step 2 and iterate until convergence

The dispersion parameter $\phi$ must be estimated as well:

It can also be done by maximizing the likelihood.
As we know from multiple linear regression, we rather use an unbiased estimator for estimating the dispersion parameter $\phi = \sigma^2$
In the case of binomial and Poisson distribution: $\phi = 1$
For other distributions: see table above

Key R Code Snippets¶

Week 6: Inference, Model Simplification, Variable Selection¶

Theoretical Background¶

Deviance:

D(\underline{y}, \hat{\underline{\mu}}) = 2 \sum_i \left\langle \ell(y_i; y_i) - \ell(y_i; \hat{\mu}_i) \right\rangle

(20)

Deviance Test:

\Delta D = D_0 - D_1 \sim \chi^2_{df_1 - df_0}

(21)

Overdispersion:

\hat{\phi} = \frac{\text{Residual Deviance}}{n - p}

(22)

Confidence Intervals:

Wald: $\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{SE}(\hat{\beta}_j)$
Deviance-based: More accurate for small samples
Bootstrap: Nonparametric

Learning Objectives¶

Know what deviances are in GLM
Know what overdispersion is and can identify it
Know when to apply Wald-type vs deviance-based confidence intervals
Apply methods in statistical data analysis using R

Key R Code Snippets¶

model <- glm(y ~ x1 + x2, family = poisson, data = mydata)
null_deviance <- model$null.deviance
resid_deviance <- model$deviance
anova(model_simple, model_complex, test = "LRT")

model_poisson <- glm(y ~ x1 + x2, family = poisson, data = mydata)
dispersion <- model_poisson$deviance / model_poisson$df.residual
if (dispersion > 1.5) {
  model_quasi <- glm(y ~ x1 + x2, family = quasipoisson, data = mydata)
  library(MASS)
  model_nb <- glm.nb(y ~ x1 + x2, data = mydata)
}

AIC(model1, model2, model3)
BIC(model1, model2, model3)
step(model1, direction = "both")

Week 7: Diagnostics / Model Adequacy Checking¶

Theoretical Background¶

Diagnostic Plots:

Residuals vs Fitted: Check linearity and homoscedasticity
Normal Q-Q: Check normality
Scale-Location: Check constant variance
Leverage: Identify influential points

Multicollinearity:

High VIF (> 5 or 10) indicates problematic collinearity
Effects: Unstable estimates, inflated standard errors

Learning Objectives¶

Understand how AIC is generalized to GLMs
Check model adequacy and determine which assumptions are violated
Find appropriate transformations of predictors in a data-driven manner
Apply methods in statistical data analysis using R

Key R Code Snippets¶

AIC(model1, model2)

par(mfrow = c(2, 2))
plot(model_poisson, which = 1:4)
library(halfnorm)
halfnorm(residuals(model_poisson, type = "deviance"))

library(car)
cooks.distance(model1)
hat.values(model1)
dfbetas(model1)

Week 8: Some Extensions of GLM¶

Theoretical Background¶

Rate Models:

\mathbb{E}[Y_i] = \mu_i = t_i \lambda_i

(23)

For Poisson with log link:

\log(\mu_i) = \log(t_i) + \log(\lambda_i) = \underline{x}_i^T \underline{\beta}

(24)

Quasi Models:

\text{Var}(Y_i) = \phi V(\mu_i)

(25)

Consistent $\hat{\underline{\beta}}$ even if variance function misspecified
Standard errors adjusted by $\hat{\phi}$

Instrumental Variables:

First Stage: $X = \pi_0 + \pi_1 Z + U$
Second Stage: $Y = \beta_0 + \beta_1 \hat{X} + E$

Learning Objectives¶

Know what a rate model is and how to analyse it with GLM
Know how a quasi model extends a GLM and when to apply it
Fit methods to data, interpret results, make inference and predictions

Key R Code Snippets¶

model_rate <- glm(y ~ x1 + x2, family = poisson, offset = log(t), data = mydata)

model_quasipoisson <- glm(y ~ x1 + x2, family = quasipoisson, data = mydata)
model_quasibinomial <- glm(cbind(Y, m-Y) ~ x1 + x2, family = quasibinomial, data = mydata)

library(AER)
model_iv <- ivreg(Y ~ X + W | Z + W, data = mydata)
summary(model_iv)

Causality¶

Introduction¶

Theoretical Background¶

The Ladder of Causation (Pearl):

Association (Seeing): Observing patterns
Intervention (Doing): Estimating effects of actions
Counterfactuals (Imagining): Reasoning about hypotheticals

Notation:

$P(Y | X = x)$ : Conditional probability
$P(Y | \text{do}(X = x))$ : Causal effect

Causal Graphical Models:

Nodes: Variables
Edges: Causal relationships
DAG: Directed Acyclic Graph (no cycles)

Graphical Structures:

Chain: $X → Z → Y$

$X$ and $Y$ are conditionally independent given Z; $(X \perp\!\!\!\perp Y \mid Z)$

Fork: $X ← Z → Y$

$X$ and $Y$ are conditionally independent given Z; $(X \perp\!\!\!\perp Y \mid Z)$

Collider: $X → Z ← Y$

$X$ and $Y$ are independent $(X \perp\!\!\!\perp Y)$ , but are conditionally dependent on Z and any descendants of Z; $(X \not\perp\!\!\!\perp Y \mid Z)$

X \perp\!\!\!\perp Y \mid Z \text{ if all paths between X and Y are blocked by Z}

(26)

Learning Objectives¶

Understand when and why causal reasoning is important
Be familiar with causal graphical models
Understand the difference between association, interventions, and counterfactuals
Know the difference between experimental and observational data

Intervention changes distribution¶

Key R Code Snippets¶

library(gRbase)
g <- dag(
  c("A"),
  c("B", "A", "E"), # = root node + all incoming node-edges
  c("C", "A", "B", "D"),
  c("D", "B"),
  c("E")
)

cat("Children of B = ", children("B", g), "\n")
cat("Parents of B = ", parents("B", g), "\n")
cat("Ancestors of B = ", ancestralSet("B", g), "\n")

Children of B =  C D 
Parents of B =  A E 
Ancestors of B =  A B E

m <- as(g, "matrix") # to matrix from graph
m
g1 <- as(m, "igraph") # to igraph from matrix

Loading...

# Intervention on B (i.e. remove edges):
g2 <- removeEdge("A", "B", removeEdge("E", "B", g1))
plot(g2)

Simpson’s Paradox & Graphical Models¶

Theoretical Background¶

Simpson’s Paradox: A trend appears in different groups but disappears or reverses when combined.

Factorization:

Chain: $P(X,Z,Y) = P(X)P(Z|X)P(Y|Z)$
Fork: $P(X,Z,Y) = P(Z)P(X|Z)P(Y|Z)$
Collider: $P(X,Z,Y) = P(X)P(Y)P(Z|X,Y)$

Bayesian Network Factorization:

P(X_1, \ldots, X_n) = \prod_{i=1}^n P(X_i | \text{Parents}(X_i))

(27)

Learning Objectives¶

Understand Simpson’s paradox and how to analyze it
Express joint distribution by factorization
Understand (conditional) independence
Identify (conditional) independences for chain, fork, and collider

Key R Code Snippets¶

library(gRbase)
library(gRain)

graph.toys <- dag(c("Gender"), c("Toy","Gender"))
load("advanced-statistical-data-analysis/toys.rda")
gn <- grain(graph.toys, data = toys) # combine graph and data to a Bayesian network

# Conditional probability: from data
toys_girls <- toys[toys$Gender== "girl",]
cat("P(Toy = car | Gender = girl) =", sum(toys_girls== "car") / nrow(toys_girls))

# Conditional probability: from the Bayesian network
querygrain(gn, nodes=c("Toy", "Gender"), type="conditional")

P(Toy = car | Gender = girl) = 0.3915344

Loading...

# Joint probability: from data
n_car_girl <- sum(toys$Toy== "car" & toys$Gender== "girl")
cat("P(Toy = car, Gender = girl) =", n_car_girl/nrow(toys))

# Joint probability: from the Bayesian network
querygrain(gn, nodes=c("Toy", "Gender"), type="joint")

P(Toy = car, Gender = girl) = 0.1749409

Loading...

# marginal probabilities
querygrain(gn, nodes = "Gender", type = "marginal")
querygrain(gn, nodes = "Toy", type = "marginal")

Loading...

# Manually using a Conditional Probability Table:
gender <- cptable("Gender", values = c(0.55, 0.45), levels = c("boy","girl"))
toy <- cptable("Toy", values = c(0.67, 0.33), levels = c("car","doll"))

toy_gender <- cptable(c("Toy", "Gender"),
  # car:boy = 0.89, doll:boy = 0.10, car:girl = 0.39, doll:girl = 0.60
  values = c(0.89, 0.10, 0.39, 0.60), levels = c("car","doll"))

# Building the Bayesian Network
cpt <- compileCPT(list(gender, toy_gender))
bn <- grain(cpt, compile = FALSE)
plot(bn)

querygrain(bn, nodes=c("Toy", "Gender"), type="conditional")

Loading...

D-Separation & Causal Effect Estimation¶

Theoretical Background¶

D-Separation:

X and Y are d-separated given Z if all paths between X and Y are blocked by Z if and only if:

a middle node of a chain (X → B → Y) or a fork (X ← B → Y) is conditioned on (i.e. it is in Z)
a middle node of a collider (X → B ← Y) is not conditioned on (i.e. is not in Z), and no descendant of B is in Z.

If Z blocks every path between two nodes X and Y, then X and Y are d-separated conditioning on Z and thus are conditionally independent on Z.

Adjustment Formula:

P(Y | \text{do}(X = x)) = \sum_z P(Y | X = x, Z = z) P(Z = z)

(28)

Backdoor Criterion: Z satisfies backdoor criterion if:

Z blocks all backdoor paths from X to Y
Z does not contain any descendants of X

Learning Objectives¶

Know D-Separation
Estimate causal effect using the adjustment formula
Define adjustment sets with the backdoor criterion

library(gRbase)
# List all nodes with their parents
g <- dag("A",c("B","A","E"),c("C","A","B","D"),c("D","B"),c("E"))
d.separates <- function(a,b,c,dag){
    separates(a,b,c, moralize(ancestralGraph(union(union(a,b),c),dag)))
}
plot(g)
cat("Are E and D d-separated by Z = {B}?", d.separates("E", "D", c("B"), g), "\n")
cat("Are E and C d-separated by Z = {A}?", d.separates("E", "C", c("A"), g), "\n")
cat("Are A and D d-separated by Z = {B, C}?", d.separates("A", "D", c("B", "C"), g), "\n")

Are E and D d-separated by Z = {B}? TRUE 
Are E and C d-separated by Z = {A}? FALSE 
Are A and D d-separated by Z = {B, C}? FALSE

Structural Causal Models¶

Theoretical Background¶

Linear SCM:

\begin{aligned} X &= f_X(E_X) \\ W &= 5X + 10Z + E_W \\ Y &= 12W + E_Y \end{aligned}

(30)

Direct vs. Total Effects:

Total: Effect through all paths
Direct: Effect through direct path only

Learning Objectives¶

Be familiar with (linear) structural causal models
Understand direct and total causal effects
Estimate direct and total causal effect from linear SCM

Key R Code Snippets¶

library(gRbase)
library(igraph)

set.seed(253)

g <- dag(
  c("Study", "Teacher", "Prior", "Attendance"),
  c("Attendance", "Prior", "Teacher"),
  c("Exam", "Study", "Prior")
)
E(g)$weight <- c(-5, -3, -3, -2, 4, 8, 3)
plot(g, edge.label = E(g)$weight)

n <- 1000000  # Sample size

# Simulate data from the SCM (using the DAG's coefficients)
prior      <- rnorm(n)
teacher    <- rnorm(n)
attendance <- -2 * prior + 4 * teacher + rnorm(n)
study      <- -3 * prior - 5 * teacher + -3 * attendance + rnorm(n)
exam       <- 3 * prior + 8 * study + rnorm(n)

# Estimate total causal effect of attendance on exam
fit <- lm(exam ~ attendance + prior + teacher)
coef(fit)["attendance"]  # Returns the estimated effect (-24 for large n)

Loading...

Advanced Topics¶

Theoretical Background¶

Instrumental Variables:

Relevance: $\text{Cov}(Z, X) \neq 0$
Exclusion: Z has no direct effect on Y except through X
Exogeneity: $\text{Cov}(Z, E) = 0$

Two-Stage Least Squares:

First stage: $X = \pi_0 + \pi_1 Z + U$
Second stage: $Y = \beta_0 + \beta_1 \hat{X} + E$

Counterfactuals: Pearl’s three-step method:

Abduction: Use the observed information to determine the values of the noise variables.
Action: Modify the causal model M by removing the structural equation for the variable X and replacing with X = x.
Prediction: Use the modified model $M_x$ to compute the counterfactual.

library(gRain)

g <- dag(
  c("Temp"),
  c("Adv", "Temp"),
  c("Sales", "Temp", "Adv")
)

load("advanced-statistical-data-analysis/ice_cream.rda")
lm.adv <- lm(Adv ~ Temp, data = ice_cream)
lm.sales <- lm(Sales ~ Temp + Adv, data = ice_cream)
print(lm.adv)
print(lm.sales)

E(g)$weight <- c(1.74, 22.293, 2.032)
plot(g, edge.label = E(g)$weight)

# look at the observation with the highest daily temperature:
ice_cream[ice_cream$Temp == max(ice_cream$Temp), ]
day19 <- ice_cream[19,]

# Counterfactuals:

# 1. Abduction:
Es <- day19$Sales - (22.293 * day19$Temp + 2.032 * day19$Adv)
Ea <- day19$Adv - (1.74 * day19$Temp)
Et <- day19$Temp

# how many ice creams would have been sold on that day, if
# temperature would have been 30? -> do(Temp = 30):

# 2. Action:
temp_c1 <- 30
adv_c1 <- 1.74 * temp_c1 + Ea

# 3. Prediction:
sales_c1 <- 22.293 * temp_c1 + 2.032 * adv_c1 + Es
cat("if the temperature would have been", temp_c1, "then", sales_c1, "ice creams would have been sold", "\n")

# do(Adv = 0):
# how many ice creams would have been sold on that day, if
# there would have been no advertisement? -> do(Adv = 0):

# 2. Action:
temp_c2 <- day19$Temp
adv_c2 <- 0

# 3. Prediction:
sales_c2 <- 22.293 * temp_c2 + 2.032 * adv_c2 + Es
cat("without advertisement on that day,", sales_c2, "ice creams would have been sold", "\n")

# Use linear regression to predict the average number of ice creams sold for the counterfactuals
x0 <- data.frame(
  Temp = c(30, day19$Temp),
  Adv  = c(day19$Adv, 0))
cat("Average number of ice creams sold for days with given temperature and advertisement: ", predict(lm.sales, newdata = x0), "\n")

# The prediction describes the average number of ice creams. In contrast,
# the counterfactual describes how many ice creams would have been sold on the
# specific day (corresponding to data point 19) if the temperature
# and advertising had been different.


Call:
lm(formula = Adv ~ Temp, data = ice_cream)

Coefficients:
(Intercept)         Temp  
     110.94         1.74  


Call:
lm(formula = Sales ~ Temp + Adv, data = ice_cream)

Coefficients:
(Intercept)         Temp          Adv  
   -258.252       22.293        2.032

Loading...

if the temperature would have been 30 then 779.1024 ice creams would have been sold 
without advertisement on that day, 530.5143 ice creams would have been sold 
Average number of ice creams sold for days with given temperature and advertisement:  764.971 501.93

Markov Equivalence:

Chains and Forks are Markov equivalent, i.e. their graphs imply the same conditional independencies.
Colliders encode a different independence.

→ Colliders are identifiable from data, whereas fork and chains are not
→ All graphs where we can flip the arrows without introducing new colliders and keeping all existing colliders are Markow Equivalent.

Markov Equivalence can be represented by a CPDAG.

Completed Partially DAG (CPDAG)

represents the Markov equivalence class of DAGs

Directed edge: causal direction is clear (all graphs agree)
Undirected edge: causal direction is ambiguous (graphs disagree)
Encodes what can be identified from data alone

Learning Objectives¶

Understand instrumental variables
Be familiar with counterfactual reasoning
Know the three steps in computing counterfactuals
Understand Markov Equivalence

Key R Code Snippets¶

library(AER)
model_iv <- ivreg(Y ~ X + W | Z + W, data = mydata)
summary(model_iv)
first_stage <- lm(X ~ Z + W, data = mydata)
summary(first_stage)

library(pcalg)
dag1 <- dag(c("X", "Z", "Y"), list(c("X", "Z"), c("Z", "Y")))
dag2 <- dag(c("X", "Z", "Y"), list(c("Z", "X"), c("Z", "Y")))
meq(dag1, dag2)

Causal Structure Learning¶

Theoretical Background¶

Assumptions of Independence-Based Methods:

Causal sufficiency, i.e., no unobserved confounding variables.
Acyclicity: No directed cycles in the graph
Markov: Graph → Data, i.e., if X and Y are d-separated by Z in the graph, then X and Y are conditionally independent given Z.
Faithfulness: Data → Graph, i.e., if X and Y are conditionally independent given Z, then X and Y are d-separated by Z in the graph.

$\to$ What about non-Gaussian structural equations? LinGAM to the rescue!

4 Steps of Causal Inference¶

DAG: Create a causal model using expert knowledge.
Identification: Determine whether and how the causal effect can be identified from the observational data.
Estimation: Estimate the causal effect from the data.
Validity: Test the estimated causal effect:

Method	Idea	Good if ...
Random Cause	Add random covariate	Estimate stable
Placebo	Randomize treatment X	Estimate $\approx 0$
Data Subset	cross-validation / bootstrap	Estimates similar

Key R Code Snippets¶

# PC Algorithm:
library(pcalg)

# Example data generation
set.seed(239)
N <- 500000
X <- rnorm(N)
Z <- rnorm(N)
W <- 5 * X + 10 * Z + rnorm(N)
Y <- 12 * W + rnorm(N)
V <- 8 * W + rnorm(N)
dat <- cbind(X, Z, W, Y, V)

suffStat <- list(C = cor(dat), n = nrow(dat))
pc.fit <- pc(
    suffStat,
    indepTest = gaussCItest,
    alpha = 0.01,
    labels = colnames(dat),
    verbose = FALSE)
summary(pc.fit)  # There are 19 + 29 + 12 + 4 tests!!
plot(pc.fit, main = "") # Needs a title!

Object of class 'pcAlgo', from Call:
pc(suffStat = suffStat, indepTest = gaussCItest, alpha = 0.01, 
    labels = colnames(dat), verbose = FALSE)

Nmb. edgetests during skeleton estimation:
===========================================
Max. order of algorithm:  3 
Number of edgetests from m = 0 up to m = 3 :  19 29 12 4

Graphical properties of skeleton:
=================================
Max. number of neighbours:  2 at node(s) 3 
Avg. number of neighbours:  0.8 

Adjacency Matrix G:
  X Z W Y V
X . . 1 . .
Z . . 1 . .
W . . . 1 1
Y . . . . .
V . . . . .

# LiNGAM Algorithm:
library(pcalg)

# Linear SCM with uniform distributed noise
set.seed(145)
N <- 2000
x1 <- runif(N,-.5,.5)
x2 <- runif(N,-.3,3)
x3 <- x1 + x2 + runif(N,-.7,.7)
x4 <- 2*x1 + runif(N,-.5,.5)
x5 <- 3*x3 + runif(N,-1,1)
data <- cbind(x1,x2,x3,x4,x5)

# Apply LiNGAM
fit.ex2 <- lingam(data)

# Transpose the matrix to correct edge directions
adj_matrix <- t(fit.ex2$Bpruned)

# Convert to igraph and plot
g <- graph_from_adjacency_matrix(
  adj_matrix,
  mode = "directed",
  weighted = TRUE
)
plot(g, edge.label = round(E(g)$weight, 2))  # Show rounded weights