Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Machine Learning

Authors
Affiliations
ZHAW School of Engineering
ZHAW School of Engineering

Summary and Key Takeaways

Core Principles

  1. Generalization: The ultimate goal is to create models that perform well on unseen data, not just the training data.

  2. Bias-Variance Tradeoff: Every model makes a tradeoff between underfitting (high bias) and overfitting (high variance).

  3. No Free Lunch: No single algorithm works best for all problems. Choose based on your data and problem characteristics.

  4. Feature Engineering: The quality of your features often matters more than the choice of algorithm.

  5. Evaluation: Always use proper evaluation techniques (train-test split, cross-validation) to assess model performance.

Practical Skills Acquired

Python Libraries Used

LibraryPurposeKey Functions/Classes
numpyNumerical computingarray, linspace, random, etc.
pandasData manipulationDataFrame, Series, read_csv, etc.
matplotlibVisualizationpyplot, figure, scatter, plot, etc.
seabornStatistical visualizationheatmap, boxplot, pairplot, etc.
scikit-learnMachine learningAll ML algorithms, preprocessing, metrics
tensorflow/kerasDeep learningSequential, Dense, Conv2D, etc.
librosaAudio processingload, stft, mfcc, etc.

Quick Reference

Common Preprocessing Steps

# 1. Load data
import pandas as pd
df = pd.read_csv('data.csv')

# 2. Handle missing values
df.fillna(df.mean(), inplace=True)  # Numerical
df.fillna(df.mode().iloc[0], inplace=True)  # Categorical

# 3. Encode categorical variables
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output=False)
X_encoded = encoder.fit_transform(df[['categorical_col']])

# 4. Scale numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.select_dtypes(include=['float64', 'int64']))

# 5. Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Common Model Evaluation

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    mean_squared_error, mean_absolute_error, r2_score
)

# Classification metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1: {f1_score(y_test, y_pred):.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")

# Regression metrics
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")
print(f"R²: {r2_score(y_test, y_pred):.4f}")

Common Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint

# Grid Search
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10]}
grid_search = GridSearchCV(estimator, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

# Random Search
param_dist = {'n_estimators': randint(50, 200), 'max_depth': [None] + list(randint(5, 50).rvs(10)), 'min_samples_split': randint(2, 20)}
random_search = RandomizedSearchCV(estimator, param_dist, n_iter=20, cv=5)
random_search.fit(X_train, y_train)

Introduction to Machine Learning

Educational Objectives

Key Concepts

ML Paradigms

Supervised Learning

Data: (x,y)(x, y) where xx is input, yy is label
Goal: Learn function to map xyx \rightarrow y
Example: Classifying apples vs. oranges

Unsupervised Learning

Data: xx (no labels)
Goal: Learn underlying structure in data
Example: Grouping similar items together

Reinforcement Learning

Data: State-action pairs
Goal: Maximize future rewards over time
Example: Learning to navigate an environment

Data Types

TypeDescriptionExample
NumericalContinuous or discrete numbersAge, temperature
CategoricalFinite set of categoriesColor, gender
OrdinalCategories with orderRating (1-5 stars)
TextNatural languageProduct reviews
ImagePixel arraysPhotographs
AudioSound waveformsSpeech recordings

Data Preprocessing Pipeline

Practical Example: Data Visualization

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
data = pd.read_csv('data.csv')

# Basic visualization
plt.figure(figsize=(10, 6))
sns.boxplot(data=data.select_dtypes(include=['float64', 'int64']))
plt.title('Feature Distribution')
plt.xticks(rotation=45)
plt.show()

# Correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

Supervised Learning & k-Nearest Neighbors

Educational Objectives

Key Concepts

k-Nearest Neighbors Algorithm

kNN is a simple, instance-based learning algorithm:

  1. Store all training data

  2. Calculate distance between new point and all training points

  3. Find k nearest neighbors

  4. Predict based on majority vote (classification) or average (regression)

Distance metrics:Euclidean: d(x,y)=i=1n(xiyi)2Manhattan: d(x,y)=i=1nxiyiMinkowski: d(x,y)=(i=1nxiyip)1/p\text{Distance metrics:}\\ \text{Euclidean: } d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}\\ \text{Manhattan: } d(x, y) = \sum_{i=1}^{n} |x_i - y_i|\\ \text{Minkowski: } d(x, y) = \left(\sum_{i=1}^{n} |x_i - y_i|^p\right)^{1/p}

Choosing k

Distance Measures

MeasureFormulaWhen to Use
Euclidean(xiyi)2\sqrt{\sum (x_i - y_i)^2}General purpose
Manhattanxiyi\sum | x_i - y_i|High-dimensional data
Cosine1xyxy1 - \frac{x \cdot y}{|x| |y|}Text data
HammingCount of differing positionsCategorical data

Performance Measures

Classification Metrics
  • Accuracy: TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}

  • Precision: TPTP+FP\frac{TP}{TP + FP} (How many selected are correct?)

  • Recall: TPTP+FN\frac{TP}{TP + FN} (How many actual positives found?)

  • F1 Score: 2×Precision×RecallPrecision+Recall2 \times \frac{Precision \times Recall}{Precision + Recall}

  • Confusion Matrix: Visualizes TP, TN, FP, FN

Regression Metrics
  • MSE: Mean Squared Error - sensitive to outliers

  • RMSE: Root Mean Squared Error - same units as target

  • MAE: Mean Absolute Error - robust to outliers

  • : Coefficient of determination - explains variance

Practical Example: kNN Implementation

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Load and prepare data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for distance-based algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train kNN classifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

# Find optimal k using cross-validation
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': range(1, 21)}
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
print(f"Best k: {grid_search.best_params_['n_neighbors']}")

Model Selection, Bias-Variance Tradeoff & Regularization

Educational Objectives

Key Concepts

Bias-Variance Tradeoff

Generalization Error=Bias2+Variance+Irreducible Error\text{Generalization Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Regularization Techniques

TechniqueFormulaEffect
Lasso (L1)λwi\lambda \sum |w_i |Feature selection, sparse weights
Ridge (L2)λwi2\lambda \sum w_i^2Prevents large weights
Elastic Netλ1wi+λ2wi2\lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2Combines L1 and L2

Model Evaluation

Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score, KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"Mean CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
Learning Curves
from sklearn.model_selection import learning_curve
import numpy as np
train_sizes, train_scores, test_scores = learning_curve(
    model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(1), 'o-', label='Training score')
plt.plot(train_sizes, test_scores.mean(1), 'o-', label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')
plt.show()

Stochastic Gradient Descent (SGD)

Update rule: wt+1=wtηL(wt)\text{Update rule: } w_{t+1} = w_t - \eta \nabla L(w_t)
from sklearn.linear_model import SGDClassifier, SGDRegressor
sgd_clf = SGDClassifier(loss='log_loss', penalty='l2', alpha=0.0001, max_iter=1000, random_state=42)
sgd_reg = SGDRegressor(penalty='l2', alpha=0.0001, max_iter=1000, random_state=42)

Feature Engineering

Educational Objectives

Key Concepts

Feature Engineering Pipeline

Data Cleaning

import pandas as pd
import numpy as np

# Handle missing values
df.fillna(df.mean(), inplace=True)  # Numerical
df.fillna(df.mode().iloc[0], inplace=True)  # Categorical

# Handle outliers using IQR
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

Feature Generation Techniques

Numerical Features
  • Binning: Convert continuous to categorical

  • Polynomial: x,x2,x3x, x^2, x^3 for non-linear relationships

  • Interaction: x1×x2x_1 \times x_2 for feature combinations

  • Log Transform: log(x)\log(x) for skewed distributions

  • Scaling: Standardize or normalize features

Categorical Features
  • One-Hot Encoding: Create binary columns for each category

  • Label Encoding: Convert categories to integers

  • Target Encoding: Replace categories with target mean

  • Frequency Encoding: Replace with frequency of category

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder, PolynomialFeatures

# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-max scaling
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X)

# Robust scaling
robust = RobustScaler()
X_robust = robust.fit_transform(X)

# One-hot encoding
encoder = OneHotEncoder(drop='first', sparse_output=False)
X_encoded = encoder.fit_transform(X_categorical)

# Polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

Text Feature Extraction

Bag of Words (BoW)
  • Counts word occurrences

  • Ignores grammar and word order

  • Simple and effective baseline

TF-IDF
  • Term Frequency-Inverse Document Frequency

  • Weights words by importance

  • Rare words get higher weights

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Bag of Words
bow = CountVectorizer(max_features=1000, stop_words='english', ngram_range=(1, 2))
X_bow = bow.fit_transform(texts)

# TF-IDF
tfidf = TfidfVectorizer(max_features=1000, stop_words='english', ngram_range=(1, 2))
X_tfidf = tfidf.fit_transform(texts)

Audio Feature Extraction

STFT (Short-Time Fourier Transform)
  • Converts audio to time-frequency representation

  • Captures frequency content over time

  • Useful for speech and music analysis

MFCC (Mel-Frequency Cepstral Coefficients)
  • Represents spectral envelope of sound

  • Mimics human auditory system

  • State-of-the-art for speech recognition

import librosa

# Load audio file
y, sr = librosa.load('audio.wav', sr=22050)

# Extract STFT
stft = librosa.stft(y, n_fft=2048, hop_length=512)
stft_magnitude = np.abs(stft)

# Extract MFCC
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
mfcc_mean = np.mean(mfccs, axis=1)
mfcc_std = np.std(mfccs, axis=1)

Feature Selection Methods

MethodDescriptionWhen to Use
Variance ThresholdRemove features with low varianceInitial filtering
Univariate SelectionSelect best features based on statistical testsQuick feature reduction
RFERemove features iteratively based on model weightsModel-based selection
Model-based RankingUse feature importance from modelsTree-based models
from sklearn.feature_selection import VarianceThreshold, SelectKBest, RFE, f_classif

# Variance threshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

# Select top k features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Recursive Feature Elimination
estimator = LogisticRegression(max_iter=1000)
selector = RFE(estimator, n_features_to_select=5)
X_selected = selector.fit_transform(X, y)

Linear Models & Logistic Regression

Educational Objectives

Key Concepts

Probability Theory Basics

E[X]=xP(X=x) (discrete),xfX(x)dx (continuous) Var(X)=E[(XE[X])2]=E[X2]E[X]2E[X] = \sum x P(X = x) \text{ (discrete)}, \quad \int x f_X(x) dx \text{ (continuous)}\ Var(X) = E[(X - E[X])^2] = E[X^2] - E[X]^2
Bayes’ Theorem
P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) P(A)}{P(B)}

Fundamental for Bayesian approaches to ML

Common Distributions
  • Bernoulli: Binary outcomes (p)

  • Gaussian: Continuous, symmetric (μ, σ²)

  • Multinomial: Multiple categories

  • Poisson: Count data (λ)

Loss Function Design

Maximum Likelihood Estimation (MLE)
L(θ)=P(Xθ)=i=1nP(xiθ) (θ)=i=1nlogP(xiθ) θ^=argmaxθ(θ)L(\theta) = P(X | \theta) = \prod_{i=1}^n P(x_i | \theta)\ \ell(\theta) = \sum_{i=1}^n \log P(x_i | \theta)\ \hat{\theta} = \arg\max_{\theta} \ell(\theta)
Negative Log-Likelihood (NLL)
NLL=(θ)=i=1nlogP(xiθ)\text{NLL} = -\ell(\theta) = -\sum_{i=1}^n \log P(x_i | \theta)

Logistic Regression

Sigmoid Function
σ(z)=11+ez,z=w0+w1x1++wdxd\sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = w_0 + w_1 x_1 + \dots + w_d x_d
Binary Cross-Entropy Loss
BCE=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]\text{BCE} = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
Implementation
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

# Basic logistic regression
model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Logistic regression with cross-validated regularization
model_cv = LogisticRegressionCV(Cs=[0.001, 0.01, 0.1, 1, 10, 100], cv=5, penalty='l2', solver='lbfgs', max_iter=1000, random_state=42)
model_cv.fit(X_train, y_train)

# Get coefficients
feature_importance = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_[0]}).sort_values('Coefficient', ascending=False)

# Predictions and evaluation
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")

Neural Networks & Deep Learning

Educational Objectives

Key Concepts

Neural Network Architecture

Activation Functions

ReLU

Function: f(x)=max(0,x)f(x) = \max(0, x)
Pros: Solves vanishing gradient, computationally efficient
Cons: Dies for negative inputs

Sigmoid

Function: f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}
Pros: Outputs between 0 and 1
Cons: Vanishing gradients

Softmax

Function: f(xi)=exijexjf(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
Use: Multi-class classification output
Property: Outputs sum to 1

Loss Functions

Loss FunctionFormulaUse Case
MSE1n(yiy^i)2\frac{1}{n}\sum (y_i - \hat{y}_i)^2Regression
Binary Cross-Entropy[yilog(y^i)+(1yi)log(1y^i)]-\sum [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]Binary classification
Categorical Cross-Entropyyilog(y^i)-\sum y_i \log(\hat{y}_i)Multi-class classification

Backpropagation

  1. Forward pass: Compute predictions and loss

  2. Backward pass: Compute gradients using chain rule

  3. Update weights: Adjust weights using gradients

Optimization Techniques

Standard SGD

Update: wt+1=wtηL(wt)w_{t+1} = w_t - \eta \nabla L(w_t)

SGD with Momentum

Update: vt+1=μvtηL(wt)v_{t+1} = \mu v_t - \eta \nabla L(w_t)
wt+1=wt+vt+1w_{t+1} = w_t + v_{t+1}

AdaGrad

Adaptive learning rates for each parameter

Adam

Combines momentum and adaptive learning rates

Regularization Techniques

L1/L2 Regularization
  • L1: λwi\lambda \sum |w_i| - Encourages sparsity

  • L2: λwi2\lambda \sum w_i^2 - Prevents large weights

  • Elastic Net: Combination of both

Dropout

Randomly deactivate neurons during training Prevents co-adaptation of neurons Typical rate: 0.2-0.5 for hidden layers

Practical Example: Neural Network with Keras

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

# Define a simple neural network
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(input_dim,)),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(32, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')  # Binary classification
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100, batch_size=32, callbacks=[early_stopping], verbose=1)

# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.title('Accuracy over epochs')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.title('Loss over epochs')
plt.show()

# Evaluate
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

Convolutional Neural Networks (CNNs)

Educational Objectives

Key Concepts

CNN Architecture Components

Convolution Operation

(IK)(i,j)=mnI(i+m,j+n)K(m,n)(I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) K(m, n)

Pooling Operations

Max Pooling

Takes maximum value in each window Preserves most prominent features Reduces spatial dimensions

Average Pooling

Takes average value in each window Smoother than max pooling Less sensitive to outliers

Practical Example: CNN for Fashion MNIST

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# Load Fashion MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Preprocess data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = np.expand_dims(x_train, -1)  # Shape: (60000, 28, 28, 1)
x_test = np.expand_dims(x_test, -1)    # Shape: (10000, 28, 28, 1)

# Convert labels to one-hot encoding
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

# Define CNN model
model = keras.Sequential([
    layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1), padding='same'),
    layers.BatchNormalization(),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Dropout(0.3),
    layers.Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same'),
    layers.BatchNormalization(),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Dropout(0.4),
    layers.Conv2D(128, kernel_size=(3, 3), activation='relu', padding='same'),
    layers.BatchNormalization(),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Dropout(0.5),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')  # 10 classes
])

# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train model
history = model.fit(x_train, y_train, batch_size=128, epochs=30, validation_split=0.2)

# Evaluate
score = model.evaluate(x_test, y_test, verbose=0)
print(f'Test loss: {score[0]:.4f}')
print(f'Test accuracy: {score[1]:.4f}')

Support Vector Machines (SVM)

Educational Objectives

Key Concepts

SVM Evolution

Linear SVM

For linearly separable data, SVM finds the hyperplane that maximizes the margin:

Hyperplane: wTx+b=0,Margin: 2w\text{Hyperplane: } w^T x + b = 0, \quad \text{Margin: } \frac{2}{\|w\|}

Subject to: yi(wTxi+b)1y_i (w^T x_i + b) \geq 1 for all ii

Soft Margin SVM

Allows some misclassifications to handle non-separable data:

Minimize: 12w2+Ci=1nξi,Subject to: yi(wTxi+b)1ξi,ξi0\text{Minimize: } \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i, \quad \text{Subject to: } y_i (w^T x_i + b) \geq 1 - \xi_i, \xi_i \geq 0

Where:

Kernel Trick

Enables SVM to handle non-linear decision boundaries:

K(xi,xj)=ϕ(xi)Tϕ(xj)K(x_i, x_j) = \phi(x_i)^T \phi(x_j)

Common kernel functions:

KernelFunctionWhen to Use
LinearK(xi,xj)=xiTxjK(x_i, x_j) = x_i^T x_jLinearly separable data
PolynomialK(xi,xj)=(γxiTxj+r)dK(x_i, x_j) = (\gamma x_i^T x_j + r)^dPolynomial relationships
RBF/GaussianK(xi,xj)=exp(γxixj2)K(x_i, x_j) = \exp(-\gamma |x_i - x_j|^2)General non-linear problems

Practical Example: SVM with scikit-learn

from sklearn.svm import SVC, SVR
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score

# Scale features (critical for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Basic SVM classifier
svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm.fit(X_train_scaled, y_train)

# Predictions
y_pred = svm.predict(X_test_scaled)

# Evaluation
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

# Get support vectors
print(f"Number of support vectors: {svm.n_support_}")

# Hyperparameter tuning
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1], 'kernel': ['linear', 'rbf', 'poly']}
grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

Gaussian Processes

Educational Objectives

Key Concepts

Bayesian Learning

Maximum Likelihood

Approach: Find parameters that maximize likelihood of observed data
Formula: θ^MLE=argmaxθP(Xθ)\hat{\theta}_{MLE} = \arg\max_{\theta} P(X | \theta)
Property: Point estimate, no uncertainty quantification

Bayesian Posterior

Approach: Compute probability distribution over parameters given data
Formula: P(θX)=P(Xθ)P(θ)P(X)P(\theta | X) = \frac{P(X | \theta) P(\theta)}{P(X)}
Property: Full distribution, quantifies uncertainty

Gaussian Distribution

N(xμ,σ2)=12πσ2exp((xμ)22σ2)\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Properties:

Gaussian Process

A Gaussian Process (GP) is a collection of random variables, any finite number of which have a (multivariate) Gaussian distribution.

f:XRGP(m(x),k(x,x))f: X \rightarrow \mathbb{R} \sim \mathcal{GP}(m(x), k(x, x'))

Where:

Common Kernel Functions

KernelFormulaProperties
RBFk(x,x)=exp(xx222)k(x, x') = \exp\left(-\frac{|x - x'|^2}{2\ell^2}\right)Smooth, infinitely differentiable
Lineark(x,x)=xTxk(x, x') = x^T x'Linear functions
Polynomialk(x,x)=(xTx+c)dk(x, x') = (x^T x' + c)^dPolynomial functions

Practical Example: Gaussian Process Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel, WhiteKernel

# Generate data
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Define kernel
kernel = ConstantKernel(1.0) * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)

# Create and fit GP
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X, y)

# Make predictions
X_test = np.linspace(0, 10, 500).reshape(-1, 1)
y_pred, y_std = gp.predict(X_test, return_std=True)

# Plot results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, c='k', label='Data')
plt.plot(X_test, y_pred, 'b-', label='GP Mean')
plt.fill_between(X_test.ravel(), y_pred - 1.96 * y_std, y_pred + 1.96 * y_std, alpha=0.2, color='blue', label='95% Confidence Interval')
plt.legend()
plt.title('Gaussian Process Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

# Sample functions from GP prior
X_sample = np.linspace(0, 10, 100).reshape(-1, 1)
y_samples = gp.sample_y(X_sample, n_samples=5)

plt.figure(figsize=(10, 6))
for i in range(5):
    plt.plot(X_sample, y_samples[:, i], lw=2, label=f'Sample {i+1}')
plt.title('Functions Sampled from GP Prior')
plt.legend()
plt.show()

Dimensionality Reduction

Educational Objectives

Key Concepts

Curse of Dimensionality

Manifold Hypothesis

Principal Component Analysis (PCA)

PCA finds orthogonal directions (principal components) that maximize variance. Steps:

  1. Center the data: Xcentered=XXˉX_{centered} = X - \bar{X}

  2. Compute covariance matrix: Σ=1nXcenteredTXcentered\Sigma = \frac{1}{n} X_{centered}^T X_{centered}

  3. Eigendecomposition: Σ=VΛVT\Sigma = V \Lambda V^T

  4. Select top kk eigenvectors

Practical Example: PCA for Visualization

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(10, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.6)
plt.colorbar(label='Class')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: 2D Visualization')
plt.show()

# Explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca.explained_variance_ratio_):.4f}")

Kernel PCA

from sklearn.decomposition import KernelPCA

# Kernel PCA with RBF kernel
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.04, fit_inverse_transform=True)
X_kpca = kpca.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(10, 8))
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=y, cmap='viridis', alpha=0.6)
plt.colorbar(label='Class')
plt.xlabel('Kernel PC 1')
plt.ylabel('Kernel PC 2')
plt.title('Kernel PCA: 2D Visualization')
plt.show()

Manifold Learning Techniques

MDS
  • Preserves pairwise distances

  • Linear technique

  • Good for visualization

t-SNE
  • Preserves local structure

  • Non-linear technique

  • Excellent for visualization

  • Computationally expensive

LLE
  • Preserves local linear relationships

  • Non-linear technique

  • Good for manifold learning

Isomap
  • Preserves geodesic distances

  • Non-linear technique

  • Uses neighborhood graph

from sklearn.manifold import TSNE

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42)
X_tsne = tsne.fit_transform(X_scaled[:1000])  # Use subset for speed

plt.figure(figsize=(10, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y[:1000], cmap='viridis', alpha=0.6)
plt.colorbar(label='Class')
plt.title('t-SNE: 2D Visualization')
plt.show()

Cluster Analysis

Educational Objectives

Key Concepts

Types of Clustering

Partitioning
  • k-Means: Partitions data into k clusters

  • k-Medoids: Uses actual data points as centers

  • Fuzzy c-Means: Soft clustering (probabilistic)

Hierarchical
  • Agglomerative: Bottom-up

  • Divisive: Top-down

  • Dendrogram: Visual representation

Density-Based
  • DBSCAN: Density-based spatial clustering

  • OPTICS: Similar to DBSCAN but more robust

  • HDBSCAN: Hierarchical DBSCAN

k-Means Algorithm

  1. Initialize k cluster centers randomly

  2. Assign each point to nearest cluster center

  3. Recalculate cluster centers as mean of assigned points

  4. Repeat steps 2-3 until convergence

Objective: Minimize WCSS=i=1kxCixμi2\text{Objective: Minimize WCSS} = \sum_{i=1}^k \sum_{x \in C_i} \|x - \mu_i\|^2

Practical Example: k-Means Clustering

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# k-Means clustering
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Plot clusters
plt.figure(figsize=(10, 8))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.legend()
plt.title('k-Means Clustering')
plt.show()

# Evaluate clustering
print(f"Silhouette Score: {silhouette_score(X_scaled, clusters):.4f}")
print(f"Davies-Bouldin Score: {davies_bouldin_score(X_scaled, clusters):.4f}")

# Elbow method
wcss = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()

Hierarchical Clustering Example

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

# Perform hierarchical clustering
Z = linkage(X_scaled, method='ward')

# Plot dendrogram
plt.figure(figsize=(12, 6))
dendrogram(Z, truncate_mode='level', p=12)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Cut dendrogram
clusters = fcluster(Z, t=10, criterion='distance')

DBSCAN Example

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
clusters = dbscan.fit_predict(X_scaled)

# Plot
plt.figure(figsize=(10, 8))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.title('DBSCAN Clustering')
plt.show()

# Count clusters and noise
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = list(clusters).count(-1)
print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")

Gaussian Mixture Models & Expectation-Maximization

Educational Objectives

Key Concepts

Gaussian Mixture Model

A probabilistic model that assumes data is generated from a mixture of several Gaussian distributions:

P(x)=k=1KπkN(xμk,Σk)P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)

Where:

Expectation-Maximization (EM) Algorithm

  1. E-step (Expectation): Compute posterior probabilities (responsibilities)

  2. M-step (Maximization): Update parameters using current responsibilities

  3. Repeat until convergence

E-step: γnk=πkN(xnμk,Σk)j=1KπjN(xnμj,Σj)\text{E-step: } \gamma_{nk} = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}

GMM vs. k-Means

AspectGMMk-Means
ClusteringSoft (probabilistic)Hard (deterministic)
Cluster ShapeEllipticalSpherical
CovarianceCan be differentSame (identity)
ProbabilisticYesNo

Practical Example: GMM with scikit-learn

from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans

# Fit GMM
n_components = 3
gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=42)
gmm.fit(X_scaled)

# Predict cluster assignments (hard clustering)
clusters = gmm.predict(X_scaled)

# Get probabilities (soft clustering)
probabilities = gmm.predict_proba(X_scaled)

# Plot clusters with uncertainty
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.title('GMM Clustering')
plt.colorbar(scatter, label='Cluster')
plt.show()

# Print model parameters
print(f"Means:\n{gmm.means_}")
print(f"Covariances:\n{gmm.covariances_}")
print(f"Weights:\n{gmm.weights_}")

# Calculate AIC and BIC
aic = gmm.aic(X_scaled)
bic = gmm.bic(X_scaled)
print(f"AIC: {aic}, BIC: {bic}")

# Find optimal number of components using BIC
n_components_range = range(1, 11)
bic_scores = []
for n in n_components_range:
    gmm = GaussianMixture(n_components=n, random_state=42)
    gmm.fit(X_scaled)
    bic_scores.append(gmm.bic(X_scaled))

plt.figure(figsize=(8, 4))
plt.plot(n_components_range, bic_scores, marker='o')
plt.xlabel('Number of components')
plt.ylabel('BIC')
plt.title('BIC for Model Selection')
plt.show()

Reinforcement Learning

Educational Objectives

Key Concepts

Markov Decision Process (MDP)

A framework for modeling decision-making situations:

MDP=(S,A,P,R,γ)\text{MDP} = (S, A, P, R, \gamma)

Where:

Markov Property

P(st+1st,at,st1,at1,...)=P(st+1st,at)P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ...) = P(s_{t+1} | s_t, a_t)

Value Functions

Vπ(s)=Eπ[k=0γkRt+k+1St=s]Qπ(s,a)=Eπ[k=0γkRt+k+1St=s,At=a]V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s \right]\\ Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s, A_t = a \right]

Bellman Equation

Vπ(s)=Eπ[Rt+1+γVπ(St+1)St=s]Qπ(s,a)=Eπ[Rt+1+γQπ(St+1,At+1)St=s,At=a]V^\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s \right]\\ Q^\pi(s,a) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a \right]

Optimal Policy

π(s)=argmaxaQ(s,a),where Q(s,a)=maxπQπ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a), \quad \text{where } Q^*(s,a) = \max_\pi Q^\pi(s,a)

Dynamic Programming Methods

Value Iteration
Vk+1(s)=maxaE[Rt+1+γVk(St+1)St=s,At=a]V_{k+1}(s) = \max_a \mathbb{E} \left[ R_{t+1} + \gamma V_k(S_{t+1}) \mid S_t = s, A_t = a \right]
Policy Iteration
  1. Policy Evaluation: Compute VπV^\pi for current policy

  2. Policy Improvement: Update policy to be greedy with respect to VπV^\pi

Temporal Difference Learning

TD(0): V(St)V(St)+α[Rt+1+γV(St+1)V(St)]\text{TD(0): } V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]

Q-Learning

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t, A_t) \right]

Exploration vs. Exploitation

Exploration
  • Try different actions to discover better strategies

  • Needed to find optimal policy

  • Can lead to suboptimal short-term rewards

Exploitation
  • Use known best action to maximize immediate reward

  • Can miss better long-term strategies

  • Suboptimal in the long run if overused

Common exploration strategies:

On-Policy vs. Off-Policy

AspectOn-PolicyOff-Policy
DefinitionLearns about and improves the same policyLearns about one policy while following another
ExampleSARSAQ-Learning
AdvantageDirectly learns the policy being followedCan learn optimal policy while following exploratory policy
DisadvantageMust balance exploration/exploitationMore complex, can be unstable

Practical Example: Q-Learning for Frozen Lake

import gym
import numpy as np

# Create FrozenLake environment
env = gym.make('FrozenLake-v1', is_slippery=False)

# Initialize Q-table
Q = np.zeros((env.observation_space.n, env.action_space.n))

# Hyperparameters
alpha = 0.8  # Learning rate
gamma = 0.95  # Discount factor
epsilon = 0.1  # Exploration rate
episodes = 1000

# Q-learning algorithm
for episode in range(episodes):
    state = env.reset()[0]
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state, :])  # Exploit
        
        # Take action
        next_state, reward, done, truncated, info = env.step(action)
        
        # Q-learning update
        best_next_action = np.argmax(Q[next_state, :])
        td_target = reward + gamma * Q[next_state, best_next_action]
        td_error = td_target - Q[state, action]
        Q[state, action] += alpha * td_error
        
        # Update state
        state = next_state

# Test the learned policy
state = env.reset()[0]
done = False
while not done:
    action = np.argmax(Q[state, :])
    state, reward, done, truncated, info = env.step(action)
    env.render()
    if done:
        print(f"Final reward: {reward}")
        break

env.close()

Ensemble Methods

Educational Objectives

Key Concepts

Ensemble Learning

Combining multiple models to improve performance:

Types of Ensembles

Bagging (Bootstrap Aggregating)
  • Principle: Reduce variance by averaging multiple models

  • Method: Train models on different bootstrap samples

  • Example: Random Forest

  • Effect: Reduces variance, prevents overfitting

Boosting
  • Principle: Reduce bias by sequentially correcting errors

  • Method: Train models sequentially, each focusing on previous errors

  • Examples: AdaBoost, Gradient Boosting, XGBoost

  • Effect: Reduces bias, improves accuracy

Random Forest

An ensemble of decision trees trained on bootstrap samples with feature subsampling:

  1. Create bootstrap samples (with replacement)

  2. For each sample, train a decision tree on a random subset of features

  3. Average predictions from all trees

Feature subsampling: md for classification, md/3 for regression\text{Feature subsampling: } m \approx \sqrt{d} \text{ for classification, } m \approx d/3 \text{ for regression}

Boosting Methods

AdaBoost
  • Idea: Give more weight to misclassified samples

  • Algorithm: Sequentially train models, reweighting data

  • Weight Update: Increase weights for misclassified samples

Gradient Boosting
  • Idea: Fit new models to residual errors

  • Algorithm: Each new model corrects errors of previous ensemble

  • Loss: Minimizes loss function (e.g., MSE, log-loss)

Practical Example: Random Forest

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    bootstrap=True,
    random_state=42
)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")

# Feature importance
importances = rf.feature_importances_
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': importances}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance)
plt.title('Random Forest Feature Importance')
plt.show()

Gradient Boosting Example

# Gradient Boosting
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)
gb.fit(X_train, y_train)

y_pred_gb = gb.predict(X_test)
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, y_pred_gb):.4f}")

# Plot feature importance
feature_importance_gb = pd.DataFrame({'Feature': X.columns, 'Importance': gb.feature_importances_}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_gb)
plt.title('Gradient Boosting Feature Importance')
plt.show()

AdaBoost Example

# AdaBoost
ada = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=42)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, y_pred_ada):.4f}")

Generative AI and Wrap-Up

Educational Objectives

Key Concepts

Generative AI Overview

Generative AI models learn to generate new data that resembles the training data:

Generative Adversarial Networks (GANs)
  • Idea: Two neural networks compete (generator vs. discriminator)

  • Training: Generator tries to fool discriminator

  • Applications: Image generation, style transfer

Variational Autoencoders (VAEs)
  • Idea: Learn probability distribution of data

  • Training: Maximize likelihood of data

  • Applications: Image generation, anomaly detection

Large Language Models (LLMs)
  • Idea: Predict next token in sequence

  • Training: Self-supervised on vast text data

  • Applications: Text generation, translation, coding

Generative Model Types

ModelApproachTrainingApplications
GANAdversarialMinimax gameImages, audio
VAEProbabilisticMaximum likelihoodImages, data generation
AutoregressiveSequentialNext token predictionText, audio, video
DiffusionIterative denoisingNoise removalImages, audio