Advanced Statistical Data Analysis
Lecture Notes
Review of Multiple Linear Regression¶
Initial Remarks¶
Regression analysis is used to model the relationship between a response variable and one or more explanatory variables , where the relationship is masked by random noise.
Objectives of Regression Analysis¶
General description of data structure.
Assessment of the effect of explanatory variables on the response.
Prediction of future observations.
Error Assumptions¶
The standard assumptions for the error terms are:
Stochastically independent.
Expectation zero and constant variance (homoscedasticity).
Normally (Gaussian) distributed: .
Matrix Representation¶
To simplify notation, the regression equation Definition 1 is written in matrix form:
where:
is an vector of responses.
is an matrix of explanatory variables (including a column of 1s for the intercept).
is a vector of unknown coefficients ().
is an vector of unobserved random variables.
Tukey’s First-Aid Transformations¶
Standard recommendations used to linearize relationships and stabilize variance when there is no specific domain theory to guide variable transformation. These should be applied to both explanatory variables and responses unless a valid reason exists to do otherwise:
| Data Type | Recommended Transformation |
|---|---|
| Concentrations and Amounts | |
| Count Data | |
| Counted Fractions / Shares |
Model Fitting and Diagnostics¶
Least Squares Estimation¶
The coefficients are estimated by minimizing the sum of squared residuals.
The OLS estimator is given by:
Model Adequacy (Residual Analysis)¶
Model adequacy is checked using diagnostic plots:
Tukey-Anscombe Plot: Residuals vs. Fitted values to check for non-linearity or heteroscedasticity.
Normal Q-Q Plot: To check the normality assumption of errors.
Scale-Location Plot: To check for constant variance.
Residuals vs. Leverage: To identify influential observations (Cook’s Distance).