Statistical Learning and Linear Regression Essentials
Posted by Anonymous and classified in Mathematics
Written on in
English with a size of 7.23 KB
Introduction to Statistical Learning
Key Notation
- Y: Response (dependent/output) variable - quantitative or qualitative
- X₁,...,Xₚ: Predictors (features/covariates/independent variables)
- p: Number of predictors
- n: Number of observations
- i: Subject index (i = 1,...,n)
- j: Variable index (j = 1,...,p)
- Data: (Yᵢ, Xᵢ), i = 1,...,n
Types of Learning
- Supervised learning: Have both Y and X (prediction or inference) - PRIMARY FOCUS
- Unsupervised learning: Have X but no Y
Model for Data
Yᵢ = f(Xᵢ) + εᵢ, i = 1,...,n Assumptions: E(εᵢ) = 0, var(εᵢ) = σ², εᵢ are mutually independent Properties: E(Yᵢ|Xᵢ) = f(Xᵢ), var(Yᵢ|Xᵢ) = σ²
Prediction vs Inference
| Prediction | Inference |
|---|---|
| ˆY = ˆf(X) | Understand relationship between Y and X |
| ˆf can be black box | Need exact form of ˆf |
| Only care about accuracy | Which predictors associated? Linear/nonlinear? |
| Use highly non-linear methods | Use linear models or extensions |
Bias-Variance Tradeoff
- Bias(ˆθ): E(ˆθ) - θ (should be small in absolute value)
- var(ˆθ): E{ˆθ - E(ˆθ)}² (should be small)
- MSE(ˆθ) = Bias²(ˆθ) + var(ˆθ)
For prediction at point x₀:
E(ˆY₀ - Y₀)² = MSE{ˆf(x₀)} + σ²
= (Bias{ˆf(x₀)})² + var{ˆf(x₀)} + σ²- Reducible error: MSE{ˆf(x₀)} - can be reduced with better model
- Irreducible error: σ² - inherent random error
Model Flexibility Tradeoffs
- As flexibility ↑: Bias ↓, Variance ↑, Interpretability ↓
- Training MSE: Decreasing function of flexibility
- Test MSE: U-shaped function of flexibility
- Overfitting: Small training MSE but large test MSE
Parametric vs Nonparametric Methods
| Parametric | Nonparametric |
|---|---|
| Assumes functional form for f | No assumptions about functional form |
| Estimate parameters (e.g., β's) | Estimate f by getting close to data |
| Easy to fit and interpret | Can fit wide range of shapes |
| May be poor approximation if assumption wrong | Requires large n |
Classification (Qualitative Y)
- Bayes classifier: Predicts class that maximizes P(Y = c|X = x)
- Bayes error rate: 1 - E{max_c P(Y = c|X)} - irreducible lower bound
- Training error rate: (1/n)∑I(yᵢ ≠ ˆyᵢ)
- Test error rate: Ave I(y₀ ≠ ˆy₀)
K-Nearest Neighbors (KNN) Classifier
ˆP(Y = c|X = x₀) = (1/K)∑_{i∈𝒩₀} I(yᵢ = c)- K controls flexibility: Larger K → less flexible (higher bias, lower variance)
- K = 1 gives training error rate = 0
- Distance measure: Usually Euclidean
- Nonparametric method
Linear Regression
Simple Linear Regression
Model: Yᵢ = β₀ + β₁xᵢ + εᵢ, εᵢ ~ i.i.d. N(0, σ²) f(x) = β₀ + β₁x
Least Squares Estimates:
ˆβ₁ = r·Sᵧ/Sₓ ˆβ₀ = Ȳ - ˆβ₁x̄
Properties:
- Fitted line passes through (x̄, Ȳ)
- ∑eᵢ = 0
- Average of fitted values = Ȳ
Multiple Linear Regression
Model: Y = Xβ + ε ˆβ = (XᵀX)⁻¹XᵀY var(ˆβ) = σ²(XᵀX)⁻¹ ˆσ² = SS_ERR/(n - p - 1)
ANOVA Table
| Source | SS | df | MS | F |
|---|---|---|---|---|
| Model | SS_REG | p | MS_REG | MS_REG/MS_ERR |
| Error | SS_ERR | n-p-1 | MS_ERR = ˆσ² | |
| Total | SS_TOT | n-1 |
R² and Adjusted R²
R² = SS_REG/SS_TOT = 1 - SS_ERR/SS_TOT R²_adj = 1 - [SS_ERR/(n-p-1)]/[SS_TOT/(n-1)] = 1 - ˆσ²/[SS_TOT/(n-1)]
- R² increases even when adding useless predictors
- Adjusted R² rewards adding predictor only if it reduces error SS considerably
Hypothesis Testing
Test for single coefficient (H₀: βⱼ = 0):
t = ˆβⱼ/SE(ˆβⱼ) ~ t_{n-p-1}Model significance test (H₀: β₁ = ... = β_p = 0):
F = MS_REG/MS_ERR ~ F_{p, n-p-1}Partial F-test for nested models:
F = [SS_ERR(reduced) - SS_ERR(full)]/q ÷ MS_ERR(full) ~ F_{q, n-p-1}Interaction Terms
f(X) = β₀ + β₁X₁ + β₂X₂ + β₃X₁X₂
= β₀ + (β₁ + β₃X₂)X₁ + β₂X₂- Effect of X₁ depends on X₂
- Hierarchical principle: If interaction included, main effects should also be included
Categorical Predictors
- Use C-1 indicator variables (dummy variables)
- Base/reference category: all indicators = 0
- β₀ = mean for base category
- βⱼ = difference in means for category j vs base
Classification Methods
Bayes Theorem for Classification
pₖ(x) = P(Y = k|X = x) = πₖ·fₖ(x)/∑πₗ·fₗ(x)
- pₖ(x): Posterior probability
- πₖ: Prior probability (prevalence)
- fₖ(x): Class-conditional distribution of X|Y = k
Linear Discriminant Analysis (LDA)
Assumption: X|Y = k ~ N(μₖ, Σ) with common covariance matrix
Decision boundary: Linear in x
Quadratic Discriminant Analysis (QDA)
Assumption: X|Y = k ~ N(μₖ, Σₖ) - class-specific covariance matrices
Decision boundary: Quadratic in x
Logistic Regression
logit{p(x)} = log[p(x)/(1-p(x))] = xᵀβ
p(x) = exp(xᵀβ)/[1 + exp(xᵀβ)]- βⱼ = change in log-odds when Xⱼ increases by 1 unit
- Odds ratio = exp(βⱼ)
Resampling Methods
Validation Set Approach
- Randomly split data into training and validation sets
- High variability; may overestimate test error
Cross-Validation
- LOOCV: Approximately unbiased, computationally intensive
- k-Fold CV: Intermediate bias, lower variance than LOOCV
Bootstrap
- Purpose: Estimate standard errors, construct confidence intervals
- Sample WITH REPLACEMENT from original data
- B ≥ 500 for bias/variance; B ≥ 2000 for CIs