Statistical Learning and Linear Regression Essentials

Posted by Anonymous and classified in Mathematics

Written on in English with a size of 7.23 KB

Introduction to Statistical Learning

Key Notation

  • Y: Response (dependent/output) variable - quantitative or qualitative
  • X₁,...,Xₚ: Predictors (features/covariates/independent variables)
  • p: Number of predictors
  • n: Number of observations
  • i: Subject index (i = 1,...,n)
  • j: Variable index (j = 1,...,p)
  • Data: (Yᵢ, Xᵢ), i = 1,...,n

Types of Learning

  • Supervised learning: Have both Y and X (prediction or inference) - PRIMARY FOCUS
  • Unsupervised learning: Have X but no Y

Model for Data

Yᵢ = f(Xᵢ) + εᵢ, i = 1,...,n
Assumptions: E(εᵢ) = 0, var(εᵢ) = σ², εᵢ are mutually independent
Properties: E(Yᵢ|Xᵢ) = f(Xᵢ), var(Yᵢ|Xᵢ) = σ²

Prediction vs Inference

PredictionInference
ˆY = ˆf(X)Understand relationship between Y and X
ˆf can be black boxNeed exact form of ˆf
Only care about accuracyWhich predictors associated? Linear/nonlinear?
Use highly non-linear methodsUse linear models or extensions

Bias-Variance Tradeoff

  • Bias(ˆθ): E(ˆθ) - θ (should be small in absolute value)
  • var(ˆθ): E{ˆθ - E(ˆθ)}² (should be small)
  • MSE(ˆθ) = Bias²(ˆθ) + var(ˆθ)

For prediction at point x₀:

E(ˆY₀ - Y₀)² = MSE{ˆf(x₀)} + σ²
               = (Bias{ˆf(x₀)})² + var{ˆf(x₀)} + σ²
  • Reducible error: MSE{ˆf(x₀)} - can be reduced with better model
  • Irreducible error: σ² - inherent random error

Model Flexibility Tradeoffs

  • As flexibility ↑: Bias ↓, Variance ↑, Interpretability ↓
  • Training MSE: Decreasing function of flexibility
  • Test MSE: U-shaped function of flexibility
  • Overfitting: Small training MSE but large test MSE

Parametric vs Nonparametric Methods

ParametricNonparametric
Assumes functional form for fNo assumptions about functional form
Estimate parameters (e.g., β's)Estimate f by getting close to data
Easy to fit and interpretCan fit wide range of shapes
May be poor approximation if assumption wrongRequires large n

Classification (Qualitative Y)

  • Bayes classifier: Predicts class that maximizes P(Y = c|X = x)
  • Bayes error rate: 1 - E{max_c P(Y = c|X)} - irreducible lower bound
  • Training error rate: (1/n)∑I(yᵢ ≠ ˆyᵢ)
  • Test error rate: Ave I(y₀ ≠ ˆy₀)

K-Nearest Neighbors (KNN) Classifier

ˆP(Y = c|X = x₀) = (1/K)∑_{i∈𝒩₀} I(yᵢ = c)
  • K controls flexibility: Larger K → less flexible (higher bias, lower variance)
  • K = 1 gives training error rate = 0
  • Distance measure: Usually Euclidean
  • Nonparametric method

Linear Regression

Simple Linear Regression

Model: Yᵢ = β₀ + β₁xᵢ + εᵢ, εᵢ ~ i.i.d. N(0, σ²)
f(x) = β₀ + β₁x

Least Squares Estimates:

ˆβ₁ = r·Sᵧ/Sₓ
ˆβ₀ = Ȳ - ˆβ₁x̄

Properties:

  • Fitted line passes through (x̄, Ȳ)
  • ∑eᵢ = 0
  • Average of fitted values = Ȳ

Multiple Linear Regression

Model: Y = Xβ + ε
ˆβ = (XᵀX)⁻¹XᵀY
var(ˆβ) = σ²(XᵀX)⁻¹
ˆσ² = SS_ERR/(n - p - 1)

ANOVA Table

SourceSSdfMSF
ModelSS_REGpMS_REGMS_REG/MS_ERR
ErrorSS_ERRn-p-1MS_ERR = ˆσ²
TotalSS_TOTn-1

R² and Adjusted R²

R² = SS_REG/SS_TOT = 1 - SS_ERR/SS_TOT
R²_adj = 1 - [SS_ERR/(n-p-1)]/[SS_TOT/(n-1)] = 1 - ˆσ²/[SS_TOT/(n-1)]
  • increases even when adding useless predictors
  • Adjusted R² rewards adding predictor only if it reduces error SS considerably

Hypothesis Testing

Test for single coefficient (H₀: βⱼ = 0):

t = ˆβⱼ/SE(ˆβⱼ) ~ t_{n-p-1}

Model significance test (H₀: β₁ = ... = β_p = 0):

F = MS_REG/MS_ERR ~ F_{p, n-p-1}

Partial F-test for nested models:

F = [SS_ERR(reduced) - SS_ERR(full)]/q ÷ MS_ERR(full) ~ F_{q, n-p-1}

Interaction Terms

f(X) = β₀ + β₁X₁ + β₂X₂ + β₃X₁X₂
     = β₀ + (β₁ + β₃X₂)X₁ + β₂X₂
  • Effect of X₁ depends on X₂
  • Hierarchical principle: If interaction included, main effects should also be included

Categorical Predictors

  • Use C-1 indicator variables (dummy variables)
  • Base/reference category: all indicators = 0
  • β₀ = mean for base category
  • βⱼ = difference in means for category j vs base

Classification Methods

Bayes Theorem for Classification

pₖ(x) = P(Y = k|X = x) = πₖ·fₖ(x)/∑πₗ·fₗ(x)
  • pₖ(x): Posterior probability
  • πₖ: Prior probability (prevalence)
  • fₖ(x): Class-conditional distribution of X|Y = k

Linear Discriminant Analysis (LDA)

Assumption: X|Y = k ~ N(μₖ, Σ) with common covariance matrix

Decision boundary: Linear in x

Quadratic Discriminant Analysis (QDA)

Assumption: X|Y = k ~ N(μₖ, Σₖ) - class-specific covariance matrices

Decision boundary: Quadratic in x

Logistic Regression

logit{p(x)} = log[p(x)/(1-p(x))] = xᵀβ
p(x) = exp(xᵀβ)/[1 + exp(xᵀβ)]
  • βⱼ = change in log-odds when Xⱼ increases by 1 unit
  • Odds ratio = exp(βⱼ)

Resampling Methods

Validation Set Approach

  • Randomly split data into training and validation sets
  • High variability; may overestimate test error

Cross-Validation

  • LOOCV: Approximately unbiased, computationally intensive
  • k-Fold CV: Intermediate bias, lower variance than LOOCV

Bootstrap

  • Purpose: Estimate standard errors, construct confidence intervals
  • Sample WITH REPLACEMENT from original data
  • B ≥ 500 for bias/variance; B ≥ 2000 for CIs

Related entries: