Machine Learning Fundamentals: Algorithms and Techniques
Classified in Mathematics
Written on in
English with a size of 581.54 KB
ML → SUBSET OF AI; allows computers to learn from data without being explicitly programmed.
BASIC MATH INFO
- matrix multiplication: if A is of size m x n, B if of size n x p, then AB size = m x p
- col of A must = row of B
- each row A * each column B
- ie A of shape(m, n) * B of shape(n, p) = AB of shape(m, p)
- to find logbase2 (n):
$$ log_2(n) = \frac{ln(n)}{ln(2)} $$
DEFINITIONS
- Supervised Learning: Models learn from labeled data → learn a hypothesis function that approximates the target function
- classification: goal = categorise input into classes
- predicting if patient has a disease given symptoms (yes vs no)
- identify type of fruit given images of fruit (apple, pear..)
- regression: predict continuous numerical values based on input data
- predict house price given house size
- classification: goal = categorise input into classes
- Unsupervised Learning: Models find patterns in unlabeled data (clustering)
- model groups diff fruits which have same colour together
- Overfitting: Model performs well on training data but poorly on test data.
- High variance: Model is too complex → Overfitting
- Underfitting: Model is too simple to capture patterns in data.
- High bias: Model is too simple → Underfitting..
PIPELINE
- data preprocessing → missing data, encode categorial variables (one-hot, ordinal etc)
- feature engineering → select relevant + remove irrelevant features, design & create new features
- model selection
- model training → model learns to map input to output, parameters adjusted to minimise pred errors
- model eval
REPRESENTING DATA
- tabular
- one-hot encoding: e.G. ”science”, “arts”, “hybrid → [1,0,0], [0,1,0], [0,0,1]
- no. Of elements in list = no. Of labels
- ordinal encoding: e.G. ”science”, “arts”, “hybrid} → 0, 2, 1
- one-hot encoding: e.G. ”science”, “arts”, “hybrid → [1,0,0], [0,1,0], [0,0,1]
- image / video
- pixel grid representation
- flat b&w → 1 channel with 0s and 1s
- grayscale image → 1 channel on a scale of 256 possible levels
- varying shades of grey; 0 to 255 inclusive
- RGB → combining red, blue, green channels with 256 possible levels
- video: sequence of images → repeated pixel grid rep for each image
- pixel grid representation
- word / sentence
words: one-hot encoding → no. Of words = no. Of elems
- “ume” →
[1, 0, 0] - “is” →
[0, 1, 0] - “cute” →
[0, 0, 1]
- “ume” →
sentence: sequence of words →
- list of one-hot encoded words, eg. “ume is cute” →
[ [1,0,0], [0,1,0], [0,0,1] ] - OR bag of words: initialise counter list where each element corresponds to a word → count frequency of word in sentece
- e.G “ume is so so cute cute cute”
ume is so cute 1 1 2 3 - list of one-hot encoded words, eg. “ume is cute” →
HANDLING DATA ISSUES
class imbalance → model may learn that predicting dominant class = fewer mistake
- sol: data resampling
- oversampling: duplicates of minority class
- undersampling: subsample of majority class
- sol: data resampling
missing values
- sol1: remove all datapoints w missing values (not recommended)
- sol2: replace missing values w/ mean or mode of the column
- implementation:
sklearn.Impute SimpleInputer(strategy = “mean” or “most_frequent”)
- implementation:
outliers
- sol1: z-score method
- calculate z-score:
sklearn.Preprocessing StandardScaler() → fit.Transform(data) - find datapoints where | z | > threshold, ie
data[np.Abs(z_scores) > threshold]
- calculate z-score:
- sol2: IQR method
- arrange data in ascending order → calculate IQR (Q3 - Q1)
np.Percentile(data, 75) - np.Percentile(data, 25) - lower bound = Q1 - 1.5 * IQR
- upper bound = Q3 + 1.5 * IQR
- point > upper bound OR point < lower bound → outlier
- arrange data in ascending order → calculate IQR (Q3 - Q1)
- sol1: z-score method
features w/ diff scales → model may give more imptance to features w larger range
- sol1: z-score standardisation
$$ Z = \frac{X - \mu}{\sigma} $$
- sol2: min-max scaling
$$ X' = \frac{X - min(X)}{max(X) - min(X)} $$
K-NN
- SUPERVISED; for classification
- best for datasets w/ smol no. Of attrs
- input: continuous/discrete/binary, output: array of label
formula + steps
$$ euclidean(x_a, x_b) = \sqrt{\sum_{i=1}^n (x_{ai} - x_{bi})^2} $$
$$ manhattan(x_a, x_b) = \sqrt{\sum_{i=1}^n | x_{ai} - x_{bi}|} $$
- calculate distances of data points from given using manhattan/euclidean distance
- n → no. Of features
- x_a & x_b → 2 data points
- identify k nearest neighbours based on distance
- classify the data point based on majority vote among the k neighbors
choosing k
$$ k = \sqrt{m} $$
- k too large → decision boundary overly smooth (ie linear) → underfit
- k too small → decision boundary
limitations
requires feature scaling (see common data issues + fixes)
curse of dimensionality → becomes more difficult to distinguish points based on distance for higher dimensions.
- mean nearest-farthest distance ratio:
$$ \frac{1}{m}\sum_{i=1}^m\frac{\text{dist to farthest pt.}}{\text{dist to nearest pt.}} $$
- 0 ≤ ratio ≤ 1
- more dimensions → ratio closer to 1 (larger value from 0) → less accurate in classification
IMPLEMENTATION
- use
sklearn.Neighbors KNeighborsClassifier - euclidean dist:
np.Linalg.Norm(x)- assuming x is an array containing datapoints of [x1, x2… xb] features
DECISION TREES
- Supervised learning, used for classification
- Input: binary, discrete, or discretized continuous (e.G., range 0-10, 10-20…)
- Output: single decision value
building the tree
- get root node
- Compute entropy of data and all attributes
- Compute remainder (or info gain) for each attribute
- Choose attribute with smallest remainder (or highest info gain) as root
- split dataset by root attribute’s values
- If an attribute value has >1 outcome, repeat the process using the next best attribute (smallest remainder)
- To find the next best attribute:
- Compute remainder within each subset filtered by the previous best attribute
- If two attributes have the same remainder: choose the one that comes earlier in the breaking order
- repeat until…
- No attributes remain OR
- All samples in a subset belong to one class
ENTROPY (I)
- measures the uncertainty or unpredictability of the outcomes of a decision.
- Max uncertainty: I(y) = 1 (random)
- No uncertainty: I(y) = 0 (deterministic)
formula
$$ I= \sum_{i=1}^{v} p_i \log (p_i) $$
- p = probability of each outcome v (e.G., "yes", "no")
entropy of entire dataset
p=n(samples w/ label y)n(total samples)
p=n(total samples)n(samples w/ label y)
entropy of subset (for one attribute)
$$ p = \frac{n(\text{samples w/ attr x, label y)}}{n(\text{samples of attr x})} $$
REMAINDER
- uncertainty remaining after partitioning the dataset based on an attr, OR
- weighted average of the entropy of the subsets after splitting the dataset based on said attr
- smaller remainder → better attr
formula
$$ remainder(A)=∑_{i=1}^{v}W(A)×I(A) $$
- v = number of outcomes
- I(A) = entropy of subset A
- W(A) = weight of subset A
weight calc
$$ W(A)=\frac{n(\text{samples in subset})}{n(\text{total samples})} $$
Information Gain (IG)
- Reduction in uncertainty after splitting on an attribute
- Higher IG → better attribute → should be chosen earlier
formula
$$ IG(A)=I(data)−remainder(A) $$
pruning (to prevent overfit)
- Max-depth: Limit tree depth, use majority vote at max depth
- Min-sample leaf: Set minimum leaf sample size, use majority vote if fewer
GRADIENT DESCENT
iteratively updates weights to minimise losses via fllwing gradient.
- GLOBAL MIN THEOREM: graph of lose function against weight must have only one minimum which = global minimum.
$$ wj←wj−γ\frac{∂J(loss function)}{∂w_j} $$
- γ → learning rate
- stops when ∂J / ∂w = 0 (converges)
- basically means wj = wj - learning rate * grad of loss curve at wj
- loss function: MSE for linear regression, BCE loss for logistic regression
gradient calc
$$ \frac{∂JMSE(w)/JBCE(w)}{∂w_j}=\frac{1}{m}∑_{i=1}^m (ŷ^i - y^i){x_j}^i $$
determining optimal weight & learning rate manually
- optimal weight (w):* find derivative of cost function → set d(cost)/d(w) = 0, solve for w
- optimal lr:
- let error be e(j) = w(j) - w* → substitute into grad update eqn
- rearrange eqn such that u can see e(t) on both sides (LHS e(j + 1), RHS e(j) )
- find value of lr whereby error gets smaller, ie | e(t + 1) | < | e(t) |
LINEAR REG
- SUPERVISED; used for: predicting continuous values by fitting str8 line to data
model
1 feature 1 label
$$ h_w(x)=w0+w_1x $$
multiple features
$$ h_w(x)=w_0 + w_Tx, where \newline w_Tx = w_1 + w_2 +... W_n $$
key terms
- ŷ → predicted value
- x = [x1,x2,...,xn] → input features
- w = [w1,w2,...,wn] → weights (parameters)
- w0 → y-intercept / bias
- i → index of a data point (runs from 1 to m = no. Of samples;
data.Shape[0]) - j → index of a feature (runs from 1 to n = no of features;
data.Shape[1] - 1)
cost function (MSE)
- to measure ‘goodness’ of our model
$$ J(MSE) = \frac{1}{2m} ∑_{i=1}^m(ŷ_i-y_i)^2 $$
- m → no. Of training data sample
- note: loss function measures loss of SINGLE data point, ie (ŷ-y)^2 (squared error)
- cost function measures AVERAGE loss across entire training set
variants
- Mini-Batch: Uses a subset of data → faster + cheaper, good for large n
- Stochastic: Uses 1 data point per update → fast but may overfit
non-linear r/s
transform into linear by adding polynom features (polynomial regression)
- polynom reg: approximates any continuous function on a closed and bounded interval (min max known) to any degree of accuracy (Weierstrass Approximation Theorem)
$$ ℎ_𝑤(𝑥) = 𝑤_0 + 𝑤_1𝑥 +𝑤_2x^2 $$
LOGISTIC REG
- SUPERVISED; used for classification w/ continuous inputs
- usually the decision boundary is linear (or can be transformed to linear).
- used MAINLY for binary classification (yes/no, t/f, 0/1)
- predicts the probability of a given input belonging to a class
- threshold determines the final classification.
- eg if "fat" = 1, "not fat" = 0, and threshold = 0.5, a predicted y of 0.8 is classified as "fat" since 0.5 < 0.8 < 1
- Default threshold (Scipy): 0.5.
model (sigmoid function)
$$ h_w(x) = \frac{1}{1 + e^{w^Tx}}, \newline {{where \; w^Tx = w_0 + w_1x_1+...+w_nx_n}}
$$
- sigma(x) = hw(x) = ŷ
loss function (bce cross entropy)
$$ BCE(y, ŷ) = \begin{cases} -\log (ŷ)& \text{if y = 1}\\ -\log(1-ŷ) & \text{if y = 0}\\ \end{cases}
\newline OR \newline = -y\log(ŷ) - (1-y)\log(1-ŷ) $$
- ŷ = hw(x)
cost function (bce cross entropy)
$$ J(BCE) = \frac{1}{m} \sum_{i=1}^m-y \log(ŷ)\newline-(1-y)\log(1-ŷ) $$
common issues & fixes
non-linear decision boundary
- add polynomial features equal to no. Of features in dataset
- e.G. Transform hw(x) = w0 + w1x1 + w2x2 into circular boundary
$$ h_w(x) = w_0 + 0x_1 + 0x_2 + x^2_1 + x^2_2 $$
- note: coefficients may not necessarily be 0 and 1; depends on qn
multi-class classification
- one vs all: fit 1 classifier per class
- fit against all classes → repeat for every class (ie train another model) → choose class w/ highest probability when it is fit against all classes
- one vs one: fit 1 classifier per class pair → higher probability = ‘win’
- choose class w/ highest no. Of ‘wins’
REGULARIZATION
- ways to prevent overfitting:
- reduce no. Of features → transform from high to low degree polynomial
- regularization → reduce MAGNITUDE of weights (wj)
(L2/ridge) regularized cost function
$$ J(w) = [JBCE(w) \;or\;JMSE(w)] + \lambda \sum_{i=1}^m w^2_i \newline $$
- logistic → J(BCE); linear → J(MSE)
- λ → regularization parameter
regularized loss gradient
$$ \frac{∂JMSE(w)/JBCE(w)}{∂w_j}=\frac{1}{m}∑_{i=1}^m (ŷ^i - y^i){x_j}^i + \lambda \sum_{i=1}^mw_i $$
CLUSTERING
— unsupervised learning method: Find groupings based on datapoint similarities (NO LABELS)
k-means
centroid
- mean of all data points
- m = no. Of datapoints IN THE CLUSTER
$$ centroid (u) = \frac{1}{m}{\sum_{i=1}^m x_i} $$
how to find centroid?
- steps
- initialise k centroids →
- for datapoint i in range(0, m), where m = no. Of datapoints, group datapoint i to the nearest centroid →
- update centroids (mean of all datapoints)
- loop stops when?: centroids don't change anymore
- initialisation methods:
- k-means: randomly initialise k centroids
- k-means++: random initialization of the first centroid → select the remaining centroids with a probability proportional to the squared distance (x) of each data point from its nearest already chosen centroid.
- P(x^2) = squared dist of current point frm centroid / total squared dist of each point frm centroid
evaluate performance (J)
- calculate MEAN EUCLIDEAN DISTANCE J of EACH datapoint from its centroid → the smaller the value, the better
determining k
- elbows method (heuristic, some graphs of j against k may not have)
- set k to the ‘elbow’ → the point whereby increasing no. Of clusters (k) does not lead to a drastic improvement in clustering performance (j)
- application dependent method: aka changes based on diff cases
hierarchical
- steps
- initially every point is a data cluster
- in a loop, calculate distances of each data point from all other datapoints using linkage
- find min(dist) among all data points → merge tgt
- repeat until all forms 1 cluster
- after merging, to compare the merged point [A,B] with other point C, take nearest distance
- e.G. If dist(A, C) > dist(B, C), distance is taken as dist(B, C)
- visualise: dendrogram → draw from bottom up until there is 1 link for all
- draw a horizontal line, no. Of intersections between line & dendrogram = no. Of clusters
- types of linkages (ie what distance to use):
- single → dist btw points closest to 1 another in cluster)
- average → ave. Of distances of each points pair
- complete → dist btw points farthest from one another in cluster
- centroid → centroid of cluster
differences
- implementation: from sklearn.Cluster, import KMeans for k-means but import AgglomerativeClustering for hierarchical
- K-means uses the loss function (i.E., goodness value) and minimises it, whereas hierarchical clustering combines pairs of clusters that are close together according to the linkage method.
MODEL EVAL
regression
- MSE
- MAE → mean of sum( | ŷ - y | )
classification metrics
basically confusion matrix
| predicted +ve | predicted -ve | |
|---|---|---|
| actual +ve | TP | FN |
| actual -ve | FP | TN |
FOR ALL METRICS BELOW: btw. 0 & 1, the higher the better!!
| Metric | Formula | Definition / Use Case |
|---|---|---|
| Accuracy (A) | (TP + TN) / Total samples | True predictions / total no. Of samples |
| Precision (P) | TP / (TP + FP) | True positives / all predicted positives |
use if uw minimise false positives (e.G. Spam detection) | | Recall (R) | TP / (TP + FN) | True positives / all actual positives
use if uw minimise false negatives (e.G. Disease detection/prediction) | | F1 Score | 2 / ( (1 / P) + (1 / R) ) | — |
hyperparameter tuning
- hyperparameter → external configuration parameters of a ML model not learned from training data
- learning rate, regularisation parameter
- to find best hyperparameter:
- grid / exhaustive search → test all hyperparameters
- random search → select hyperparameters by random to test
model performance measurement
- training:
- if finding best hyperparameter: train new model w/ each hyperparamr val on same dataset
- if finding best model: select diff model (k-nn, logisticreg, etc) to train same dataset
- validating:
- use subset of training sample → find the model/hyperparameter that produces the LEAST ERROR
- testing
NEURAL NETWORKS
Perceptrons
Perceptron model:
Input (w0+w1x1+w2x2+⋯+wn) : → Activation function (g) → Output (ŷ)
Formula:
$$ \hat{y} = g\left( \sum_{i=0}^n w_i x_i \right) $$
perceptron (or any neuron) output without activation = linear regression: , forming a straight-line decision boundary.
$$ \hat{y} = g(w^Tx) = w^Tx = \newline w_0 + w_1x_1 + \dots + w_nx_n $$
- applies to the weighted sum of all inputs, not to individual input features.
- original uses a step function (outputs -1 or 1).
Perceptron Update Algorithm
Initialize weights randomly, predict ŷ
If ŷ ≠ y, update weights:
$$ w_j = w_j - \gamma (y_j - \hat{y}_j) x_j $$
Repeat until all points are classified correctly or max steps reached.
Logic Gates
true = 1, false = 0
- NOT: opposite (True if 0, False if 1)
- AND: only true if both inputs are 1.
- NAND: only true if not both inputs are 1.
- OR: true if at least one input is 1.
- NOR: only true if both inputs are 0.
- XNOR: true if inputs are the same. (1,1) / (0,0) → 1
- XOR: true if inputs differ (0,1) / (1,0) → 1
Neural Networks
Formula:
$$ \hat{y} = g(w^T x) $$
Single Layer: No hidden layers; gates like AND, OR, NAND.
- Multi-Layer: Input → Hidden layers → Output (e.G., XOR, XNOR).
Forward Propagation: Data flows from input → hidden layers → output.
- Activation function used at each layer (except for output layer when doing regression)
Weight Matrix: Rows = number of inputs, Columns = number of outputs for next layer.
- Example: 3x5 matrix → next layer has 5 rows.
- Evaluating Mathematical Reps
- start with inner () first
- e.G. NOT(AND(x1, x2)) → if x1 = 0, x2 = 1, then NOT(0) → 1
Activation Functions
Softmax → used for multiclass classification (for C no. Of classes)
- Softmax(dim=1) → returns list of probabilities of each element (each class) being the class that input row belongs to. index of element = class label
$$ \hat{y}^i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}} $$
Sigmoid: For binary classification (logistic regression), handles non-linear boundaries if input features transformed.
No activation: for regression.
$$ g(x) = x $$
PYTORCH
import torch.Nn as nn
- containers
nn.Module→ use as input when defining neural network classnn.Sequential→ directly define class by calling linear & activation funcs
- layers
nn.Linear(input_size, output_size) → single layer NN; no activation function- must be used for EVERY layer before calling activation classes in pytorch
- activation
nn.Sigmoidnn.ReLUnn.Softmax
BUILDING NN
using .Module
- must have init & forward functions, super().init() line MUST be included
- output size of layer t → input size of layer (t+1)
- each layer defined by 1 linear layer + 1 activation function
e.G. Single layer NN for REGRESSION (i.E. Last layer has no activation function)
class NNRegressor(torch.Nn.Module):
Def __init__(self, input_size, output_size):
Super().__init__()
# must add this ^ so that the NN can be intialised properly
Self.Linear1 = torch.Nn.Layer(input_size, output_size)
Self.Linear2 = torch.Nn.Layer(output_size, 1)
# output ŷ = 1 output size of 1st layer = input size of 2nd layer
Self.Activate = torch.Nn.ReLU
# (or any other activation func)
Def forward(self, x):
F1 = self.Linear1(x)
A1 = self.Activate(f1)
F2 = self.Linear2(a1)
Return f2
Model1 = NNRegressor(4, 8)
# 4 features (x_n), 8 neurons to map to initially
using .Sequential
- sequentially define each layer w/ input+ output sizes; call activation function in between if needed
- input size of layer t = output size of next layer (t + 1)
model2 = torch.Nn.Sequential(
Torch.Nn.Linear(5, 8)
Torch.Nn.ReLU()
Torch.Nn.Linear(8, 1)
)
TRAINING NN (GRAD DESCENT)
- build model (see above) & define loss function
- multi-classification → use
nn.CrossEntropyLoss(y_pred, y) - binary classification → use
nn.BCELoss(y_pred, y) - regression → use
nn.MSELoss(y_pred, y)
- multi-classification → use
- initialise optimiser → retrieve all weights in model using model.Parameters()
optimizer = torch.Optim.SGD(model.Parameters(), learning_rate)
- for every epoch (loop), initialise all gradients of weighs to 0 (
optimiser.Zero_grad()) - compute ŷ
y_pred = model(x)and loss (loss_function(y_pred, y)) - compute gradient (
loss.Backward())→ update weight (optimizer.Step())
CNN
- usually used for object detection in images (cancer cells, classifiying image types, etc)
- fully connected layer → kernel “views” all pixels (’square’) of input image to determine output
- convolution → allows us to view local regions of an image
- NOTE: when building CNN, if u want to transition to a FULLY CONNECTED LAYER (conv2d → maxpool → linear), must flatten inputs first
- input size = channels * height * width
torch.Flatten(batch_size, start_dim=1, end_dim=-1)
components
- W/H = N: dimensions
- K: kernel dimension (for W/H respectively)
- sum of element wise multiplication with local region → feature map value
- P: padding
- S: stride → stride of ‘n’ is inclusive of current 1st elem
- e.G. If top left starts from A, stride = 2 means 2nd region should start from C
pooling
- downsamples feature maps → i.E. Reduces dimensions
- types: average, max, sum
- pytorch methods for pooling:
torch.Nn.MaxPool2d, torch.Nn.AvgPool1d, torch.Nn.AdaptiveMaxPool2d
- output shape: (initial h or w) / pooling factor
BASIC MATH
MATRIX MULTIPLICATION: A size(m x n) multiplied by B size(n x p) → AB size(m x p).
- col of A must = row of B; element = sum(row A * column B). For example, A of shape(m, n) * B of shape(n, p) = AB of shape(m, p)
TO FIND LOGBASE2 (N):
WAT IS ML
SUPERVISED LEARNING: models learn from labeled data → learn a hypothesis function that approximates the target function
CLASSIFICATION: output = discrete label (str, int, bool, etc)
REGRESSION: output = continuous num value
UNSUPERVISED LEARNING: models find patterns in unlabeled data (clustering); only features [x1, x2, ... Xn]
OVERFITTING: Model performs well on training data but poorly on test data.
- due to HIGH VARIANCE: Model is too complex → Overfitting
UNDERFITTING: Model is too simple to capture patterns in data.
- due to HIGH BIAS: Model is too simple → Underfitting..
PREPROCESS
REPRESENTING DATA
tabular: one-hot encoding: e.G. ”science”, “arts”, “hybrid → [1,0,0], [0,1,0], [0,0,1]; ordinal encoding: e.G. ”science”, “arts”, “hybrid} → 0, 2, 1
image / video: pixel grid representation (CNN)
word / sentence: one-hot encoding; no(words )= n(elems); bag of words
CLASS IMBALANCE → model may learn that predicting dominant class = fewer mistake
OVERSAMPLING: duplicates of minority class
UNDERSAMPLING: subsample of majority class
MISSING VALUES
Remove Missing Data OR replace w the mean or median of the column (numerical data) OR replace w the most frequent value / mode of label (categorial variables).
sklearn.Impute's SimpleInputer(strategy = “mean” or “most_frequent” or "median")
FINDING OUTLIERS
Z-SCORE METHOD: calculate the z-scorg, then flag data points where the absolute z-score exceeds threshold (manually set)
IQR METHOD Arrange data in ascending order → calculate quartiles: Q1 (25th percentile) & Q3 (75th percentile) → Compute IQR: IQR = Q3 – Q1 → Determine Bounds: Lower Bound: Q1 – 1.5 * IQR; Upper Bound: Q3 + 1.5 * IQR. Data points outside the bounds are outliers.
FEATURES W/ DIFF SCALES → model may give more imptance to features w larger range
z-score standardisation:
min-max scaling:
FEATURE ENG
create new features based on existing ones; select/identify relevant feats.
non-linear decision boundary: add polynomial features equal to no. Of features in dataset. Eg. Transform hw(x) = w0 + w1x1 + w2x2 into circular boundary:
hw(x) = w0 + 0w1x1 +0w2x2 + w3x21 + w4x22
MODEL SELECT + TRAINING
K-NN
SUPERVISED; for classification; best for datasets w/ smol no. Of attrs; input: continuous/discrete/binary, output: array of label
STEPS:
- calculate distances of data points from given using manhattan/euclidean distance
- identify k nearest neighbours based on distance
- classify the data point based on majority vote among k neighbors
CHOOSING K: K = SQRT(M):
k too large → decision boundary overly smooth (ie linear) → underfit
k too small → decision boundary overly complex → overfit
LIMITATIONS
requires feature scaling (see common data issues + fixes)
curse of dimensionality: higher dimensions = points r more spread out = becomes more difficult to distinguish based on distance
- mean nearest-farthest distance ratio:
DECISION TREES
takes attribute values as input (discrete) and returns a single output value (Bool)
TRAINING: for each value of a chosen attribute, use DTL again on corresponding subset of examples. Terminate when all samples in a subset have same category / Used all the features
ENTROPY (I):
uncertainty or unpredictability of the outcomes of a decision.
Max uncertainty: I(y) = 1 (random)
No uncertainty: I(y) = 0 (deterministic)
REMAINDER:
uncertainty remaining after partitioning the dataset based on an attr / w8ed average of the entropy of the subsets after splittingdataset based on attr.
WEIGHT: n(samples w/ attr value) / n(total no. Of samples)
INFORMATION GAIN (IG): Entropy(data) - remainder(A)
Reduction in uncertainty after splitting on an attribute. Higher IG → better attr → chosen earlier
PRUNING (to prevent overfit)
Max-depth: Limit tree depth, use majority vote at max depth
Min-sample leaf: Set min leaf sample size, use majority vote if < min
GRAD DESCENT
GLOBAL MIN THEOREM: graph of loss function against w8 must have only 1 minimum which = global min.
STEPS: Initialize parameters --> Compute loss 𝐽(𝑤0, 𝑤1, …) and their gradients 𝜕𝐽/𝜕𝑤𝑗 (𝑤0,𝑤1,…) --> Update parameters with the below function:
DETERMINE OPT WEIGHT: find derivative of cost function → set d(cost)/d(w) = 0, solve for w
DETERMINE OPT LR: let error be e(j) = w(j) - w* → substitute into grad update eqn --> rearrange eqn such that u can see e(t) on both sides (LHS e(j + 1), RHS e(j) ) --> find value of lr whereby error gets smaller, ie | e(t + 1) | < | e(t) |
LINEAR REG
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑑𝑥𝑑 where 𝑤0, … , 𝑤𝑑 are parameters. W0 = bias / intercept;
i → index of a data point (runs from 1 to m = no. Of samples; data.Shape[0]);
j → index of a feature (runs from 1 to n = no of features; data.Shape[1] - 1)
COST FUNCTION (MSE):
VARIANTS
Mini-Batch: Uses a subset of data → faster + cheaper, good for large n
Stochastic: Uses 1 data point per update → fast but may overfit, unlikely to hit global min
POLYNOM REG: ℎ_𝑤(𝑥) = 𝑤0 + 𝑤1𝑥 +𝑤2x2 ...
approximates any continuous function on a closed & bounded interval (min-max known) to any degree of accuracy
6LOGISTIC REG
SUPERVISED; predicts the threshold determines the final classification. (default=0.5)
- used for classification w/ continuous inputs.
- decision boundary is linear (or can be transformed to linear). MAINLY for binary classification (yes/no, t/f, 0/1)
MODEL (SIGMOID FUNCTION):
LOSS FUNCTION (BCE) COST FUNCTION
MULTI-CLASS CLASSIFICATION
1 vs all: fit 1 classifier per class. Fit against all classes → repeat for every class (train another model) → choose class w highest probability
1 vs 1: fit 1 classifier per class pair → higher probability = ‘win’ → choose class with highest no. Of ‘wins’
KMEANS CLUSTERING
CENTROID: mean of all data points in cluster per feature. Format (x1, x2, x3...Xn)
TRAINING LOOP: Initialize k centroids → Assign points to the nearest centroid (Euclidean distance) → Update centroids (mean of assigned points) → Repeat until centroids stabilise / max iter
INIT METHODS: K-Means - Randomly initialize; K-Means - Pick one randomly, then select others with probability proportional to squared (euclidean) dist of each pt.
why is random bad?: assigning samples to random clusters causes each cluster's centroid to be close to mean of dataset
EVAL: calculate
DETERMINE K: elbows method - plot graph of loss against k → set k to the ‘elbow’. The point whereby increasing no. Of clusters (k) does not lead to a drastic improvement in clustering performance (j)