Statistical Learning and Linear Regression Essentials

Posted by Anonymous and classified in Mathematics

Written on May 8, 2026 in English with a size of 7.23 KB

Introduction to Statistical Learning

Key Notation

Y: Response (dependent/output) variable - quantitative or qualitative
X₁,...,Xₚ: Predictors (features/covariates/independent variables)
p: Number of predictors
n: Number of observations
i: Subject index (i = 1,...,n)
j: Variable index (j = 1,...,p)
Data: (Yᵢ, Xᵢ), i = 1,...,n

Types of Learning

Supervised learning: Have both Y and X (prediction or inference) - PRIMARY FOCUS
Unsupervised learning: Have X but no Y

Model for Data

Yᵢ = f(Xᵢ) + εᵢ, i = 1,...,n
Assumptions: E(εᵢ) = 0, var(εᵢ) = σ², εᵢ are mutually independent
Properties: E(Yᵢ|Xᵢ) = f(Xᵢ), var(Yᵢ|Xᵢ) = σ²

Prediction vs Inference

Prediction	Inference
ˆY = ˆf(X)	Understand relationship between Y and X
ˆf can be black box	Need exact form of ˆf
Only care about accuracy	Which predictors associated? Linear/nonlinear?
Use highly non-linear methods	Use linear models or extensions

Bias-Variance Tradeoff

Bias(ˆθ): E(ˆθ) - θ (should be small in absolute value)
var(ˆθ): E{ˆθ - E(ˆθ)}² (should be small)
MSE(ˆθ) = Bias²(ˆθ) + var(ˆθ)

For prediction at point x₀:

E(ˆY₀ - Y₀)² = MSE{ˆf(x₀)} + σ²
               = (Bias{ˆf(x₀)})² + var{ˆf(x₀)} + σ²

Reducible error: MSE{ˆf(x₀)} - can be reduced with better model
Irreducible error: σ² - inherent random error

Model Flexibility Tradeoffs

As flexibility ↑: Bias ↓, Variance ↑, Interpretability ↓
Training MSE: Decreasing function of flexibility
Test MSE: U-shaped function of flexibility
Overfitting: Small training MSE but large test MSE

Parametric vs Nonparametric Methods

Parametric	Nonparametric
Assumes functional form for f	No assumptions about functional form
Estimate parameters (e.g., β's)	Estimate f by getting close to data
Easy to fit and interpret	Can fit wide range of shapes
May be poor approximation if assumption wrong	Requires large n

Classification (Qualitative Y)

Bayes classifier: Predicts class that maximizes P(Y = c|X = x)
Bayes error rate: 1 - E{max_c P(Y = c|X)} - irreducible lower bound
Training error rate: (1/n)∑I(yᵢ ≠ ˆyᵢ)
Test error rate: Ave I(y₀ ≠ ˆy₀)

K-Nearest Neighbors (KNN) Classifier

ˆP(Y = c|X = x₀) = (1/K)∑_{i∈𝒩₀} I(yᵢ = c)

K controls flexibility: Larger K → less flexible (higher bias, lower variance)
K = 1 gives training error rate = 0
Distance measure: Usually Euclidean
Nonparametric method

Linear Regression

Simple Linear Regression

Model: Yᵢ = β₀ + β₁xᵢ + εᵢ, εᵢ ~ i.i.d. N(0, σ²)
f(x) = β₀ + β₁x

Least Squares Estimates:

ˆβ₁ = r·Sᵧ/Sₓ
ˆβ₀ = Ȳ - ˆβ₁x̄

Properties:

Fitted line passes through (x̄, Ȳ)
∑eᵢ = 0
Average of fitted values = Ȳ

Multiple Linear Regression

Model: Y = Xβ + ε
ˆβ = (XᵀX)⁻¹XᵀY
var(ˆβ) = σ²(XᵀX)⁻¹
ˆσ² = SS_ERR/(n - p - 1)

ANOVA Table

Source	SS	df	MS	F
Model	SS_REG	p	MS_REG	MS_REG/MS_ERR
Error	SS_ERR	n-p-1	MS_ERR = ˆσ²
Total	SS_TOT	n-1

R² and Adjusted R²

R² = SS_REG/SS_TOT = 1 - SS_ERR/SS_TOT
R²_adj = 1 - [SS_ERR/(n-p-1)]/[SS_TOT/(n-1)] = 1 - ˆσ²/[SS_TOT/(n-1)]

R² increases even when adding useless predictors
Adjusted R² rewards adding predictor only if it reduces error SS considerably

Hypothesis Testing

Test for single coefficient (H₀: βⱼ = 0):

t = ˆβⱼ/SE(ˆβⱼ) ~ t_{n-p-1}

Model significance test (H₀: β₁ = ... = β_p = 0):

F = MS_REG/MS_ERR ~ F_{p, n-p-1}

Partial F-test for nested models:

F = [SS_ERR(reduced) - SS_ERR(full)]/q ÷ MS_ERR(full) ~ F_{q, n-p-1}

Interaction Terms

f(X) = β₀ + β₁X₁ + β₂X₂ + β₃X₁X₂
     = β₀ + (β₁ + β₃X₂)X₁ + β₂X₂

Effect of X₁ depends on X₂
Hierarchical principle: If interaction included, main effects should also be included

Categorical Predictors

Use C-1 indicator variables (dummy variables)
Base/reference category: all indicators = 0
β₀ = mean for base category
βⱼ = difference in means for category j vs base

Classification Methods

Bayes Theorem for Classification

pₖ(x) = P(Y = k|X = x) = πₖ·fₖ(x)/∑πₗ·fₗ(x)

pₖ(x): Posterior probability
πₖ: Prior probability (prevalence)
fₖ(x): Class-conditional distribution of X|Y = k

Linear Discriminant Analysis (LDA)

Assumption: X|Y = k ~ N(μₖ, Σ) with common covariance matrix

Decision boundary: Linear in x

Quadratic Discriminant Analysis (QDA)

Assumption: X|Y = k ~ N(μₖ, Σₖ) - class-specific covariance matrices

Decision boundary: Quadratic in x

Logistic Regression

logit{p(x)} = log[p(x)/(1-p(x))] = xᵀβ
p(x) = exp(xᵀβ)/[1 + exp(xᵀβ)]

βⱼ = change in log-odds when Xⱼ increases by 1 unit
Odds ratio = exp(βⱼ)

Resampling Methods

Validation Set Approach

Randomly split data into training and validation sets
High variability; may overestimate test error

Cross-Validation

LOOCV: Approximately unbiased, computationally intensive
k-Fold CV: Intermediate bias, lower variance than LOOCV

Bootstrap

Purpose: Estimate standard errors, construct confidence intervals
Sample WITH REPLACEMENT from original data
B ≥ 500 for bias/variance; B ≥ 2000 for CIs

Related entries:

Tags: