Machine Learning Fundamentals: Algorithms and Techniques

Classified in Mathematics

Written on May 5, 2026 in English with a size of 581.54 KB

ML → SUBSET OF AI; allows computers to learn from data without being explicitly programmed.

BASIC MATH INFO

matrix multiplication: if A is of size m x n, B if of size n x p, then AB size = m x p
- col of A must = row of B
- each row A * each column B
- ie A of shape(m, n) * B of shape(n, p) = AB of shape(m, p)
to find logbase2 (n):

$$ log_2(n) = \frac{ln(n)}{ln(2)} $$

DEFINITIONS

Supervised Learning: Models learn from labeled data → learn a hypothesis function that approximates the target function
- classification: goal = categorise input into classes
  - predicting if patient has a disease given symptoms (yes vs no)
  - identify type of fruit given images of fruit (apple, pear..)
- regression: predict continuous numerical values based on input data
  - predict house price given house size
Unsupervised Learning: Models find patterns in unlabeled data (clustering)
- model groups diff fruits which have same colour together
Overfitting: Model performs well on training data but poorly on test data.
- High variance: Model is too complex → Overfitting
Underfitting: Model is too simple to capture patterns in data.
- High bias: Model is too simple → Underfitting..

PIPELINE

data preprocessing → missing data, encode categorial variables (one-hot, ordinal etc)
feature engineering → select relevant + remove irrelevant features, design & create new features
model selection
model training → model learns to map input to output, parameters adjusted to minimise pred errors
model eval

REPRESENTING DATA

tabular
- one-hot encoding: e.G. ”science”, “arts”, “hybrid → [1,0,0], [0,1,0], [0,0,1]
  - no. Of elements in list = no. Of labels
- ordinal encoding: e.G. ”science”, “arts”, “hybrid} → 0, 2, 1
image / video
- pixel grid representation
  - flat b&w → 1 channel with 0s and 1s
  - grayscale image → 1 channel on a scale of 256 possible levels
  - varying shades of grey; 0 to 255 inclusive
- RGB → combining red, blue, green channels with 256 possible levels
- video: sequence of images → repeated pixel grid rep for each image
word / sentence
- words: one-hot encoding → no. Of words = no. Of elems
  - “ume” → [1, 0, 0]
  - “is” → [0, 1, 0]
  - “cute” → [0, 0, 1]
- sentence: sequence of words →
  - list of one-hot encoded words, eg. “ume is cute” → [ [1,0,0], [0,1,0], [0,0,1] ]
  - OR bag of words: initialise counter list where each element corresponds to a word → count frequency of word in sentece
  - e.G “ume is so so cute cute cute”
  ume is so cute
  1 1 2 3

ume	is	so	cute
1	1	2	3

HANDLING DATA ISSUES

class imbalance → model may learn that predicting dominant class = fewer mistake
- sol: data resampling
  - oversampling: duplicates of minority class
  - undersampling: subsample of majority class
missing values
- sol1: remove all datapoints w missing values (not recommended)
- sol2: replace missing values w/ mean or mode of the column
  - implementation: sklearn.Impute SimpleInputer(strategy = “mean” or “most_frequent”)
outliers
- sol1: z-score method
  - calculate z-score: sklearn.Preprocessing StandardScaler() → fit.Transform(data)
  - find datapoints where | z | > threshold, ie data[np.Abs(z_scores) > threshold]
- sol2: IQR method
  - arrange data in ascending order → calculate IQR (Q3 - Q1) np.Percentile(data, 75) - np.Percentile(data, 25)
  - lower bound = Q1 - 1.5 * IQR
  - upper bound = Q3 + 1.5 * IQR
  - point > upper bound OR point < lower bound → outlier
features w/ diff scales → model may give more imptance to features w larger range
- sol1: z-score standardisation
$$ Z = \frac{X - \mu}{\sigma} $$
- sol2: min-max scaling
$$ X' = \frac{X - min(X)}{max(X) - min(X)} $$

K-NN

SUPERVISED; for classification
best for datasets w/ smol no. Of attrs
input: continuous/discrete/binary, output: array of label

formula + steps

$$ euclidean(x_a, x_b) = \sqrt{\sum_{i=1}^n (x_{ai} - x_{bi})^2} $$

$$ manhattan(x_a, x_b) = \sqrt{\sum_{i=1}^n | x_{ai} - x_{bi}|} $$

calculate distances of data points from given using manhattan/euclidean distance
- n → no. Of features
- x_a & x_b → 2 data points
identify k nearest neighbours based on distance
classify the data point based on majority vote among the k neighbors

choosing k

$$ k = \sqrt{m} $$

k too large → decision boundary overly smooth (ie linear) → underfit
k too small → decision boundary

limitations

requires feature scaling (see common data issues + fixes)
curse of dimensionality → becomes more difficult to distinguish points based on distance for higher dimensions.
- mean nearest-farthest distance ratio:
$$ \frac{1}{m}\sum_{i=1}^m\frac{\text{dist to farthest pt.}}{\text{dist to nearest pt.}} $$
- 0 ≤ ratio ≤ 1
- more dimensions → ratio closer to 1 (larger value from 0) → less accurate in classification

IMPLEMENTATION

use sklearn.Neighbors KNeighborsClassifier
euclidean dist: np.Linalg.Norm(x)
- assuming x is an array containing datapoints of [x1, x2… xb] features

DECISION TREES

Supervised learning, used for classification
Input: binary, discrete, or discretized continuous (e.G., range 0-10, 10-20…)
Output: single decision value

building the tree

get root node
- Compute entropy of data and all attributes
- Compute remainder (or info gain) for each attribute
- Choose attribute with smallest remainder (or highest info gain) as root
split dataset by root attribute’s values
- If an attribute value has >1 outcome, repeat the process using the next best attribute (smallest remainder)
- To find the next best attribute:
  - Compute remainder within each subset filtered by the previous best attribute
  - If two attributes have the same remainder: choose the one that comes earlier in the breaking order
repeat until…
- No attributes remain OR
- All samples in a subset belong to one class

ENTROPY (I)

measures the uncertainty or unpredictability of the outcomes of a decision.
Max uncertainty: I(y) = 1 (random)
No uncertainty: I(y) = 0 (deterministic)

formula

$$ I= \sum_{i=1}^{v} p_i \log (p_i) $$

p = probability of each outcome v (e.G., "yes", "no")

entropy of entire dataset

p=n(samples w/ label y)n(total samples)

p=n(total samples)n(samples w/ label y)

entropy of subset (for one attribute)

$$ p = \frac{n(\text{samples w/ attr x, label y)}}{n(\text{samples of attr x})} $$

REMAINDER

uncertainty remaining after partitioning the dataset based on an attr, OR
weighted average of the entropy of the subsets after splitting the dataset based on said attr
smaller remainder → better attr

formula

$$ remainder(A)=∑_{i=1}^{v}W(A)×I(A) $$

v = number of outcomes
I(A) = entropy of subset A
W(A) = weight of subset A

weight calc

$$ W(A)=\frac{n(\text{samples in subset})}{n(\text{total samples})} $$

Information Gain (IG)

Reduction in uncertainty after splitting on an attribute
Higher IG → better attribute → should be chosen earlier

formula

$$ IG(A)=I(data)−remainder(A) $$

pruning (to prevent overfit)

Max-depth: Limit tree depth, use majority vote at max depth
Min-sample leaf: Set minimum leaf sample size, use majority vote if fewer

GRADIENT DESCENT

iteratively updates weights to minimise losses via fllwing gradient.

GLOBAL MIN THEOREM: graph of lose function against weight must have only one minimum which = global minimum.

$$ wj←wj−γ\frac{∂J(loss function)}{∂w_j} $$

γ → learning rate
stops when ∂J / ∂w = 0 (converges)
basically means wj = wj - learning rate * grad of loss curve at wj
loss function: MSE for linear regression, BCE loss for logistic regression

gradient calc

$$ \frac{∂JMSE(w)/JBCE(w)}{∂w_j}=\frac{1}{m}∑_{i=1}^m (ŷ^i - y^i){x_j}^i $$

determining optimal weight & learning rate manually

optimal weight (w):* find derivative of cost function → set d(cost)/d(w) = 0, solve for w
optimal lr:
- let error be e(j) = w(j) - w* → substitute into grad update eqn
- rearrange eqn such that u can see e(t) on both sides (LHS e(j + 1), RHS e(j) )
- find value of lr whereby error gets smaller, ie | e(t + 1) | < | e(t) |

LINEAR REG

SUPERVISED; used for: predicting continuous values by fitting str8 line to data

model

1 feature 1 label
$$ h_w(x)=w0+w_1x $$
multiple features

$$ h_w(x)=w_0 + w_Tx, where \newline w_Tx = w_1 + w_2 +... W_n $$

key terms

ŷ → predicted value
x = [x1,x2,...,xn] → input features
w = [w1,w2,...,wn] → weights (parameters)
w0 → y-intercept / bias
i → index of a data point (runs from 1 to m = no. Of samples; data.Shape[0])
j → index of a feature (runs from 1 to n = no of features; data.Shape[1] - 1)

cost function (MSE)

to measure ‘goodness’ of our model

$$ J(MSE) = \frac{1}{2m} ∑_{i=1}^m(ŷ_i-y_i)^2 $$

m → no. Of training data sample
note: loss function measures loss of SINGLE data point, ie (ŷ-y)^2 (squared error)
- cost function measures AVERAGE loss across entire training set

variants

Mini-Batch: Uses a subset of data → faster + cheaper, good for large n
Stochastic: Uses 1 data point per update → fast but may overfit

non-linear r/s

transform into linear by adding polynom features (polynomial regression)
- polynom reg: approximates any continuous function on a closed and bounded interval (min max known) to any degree of accuracy (Weierstrass Approximation Theorem)
$$ ℎ_𝑤(𝑥) = 𝑤_0 + 𝑤_1𝑥 +𝑤_2x^2 $$

LOGISTIC REG

SUPERVISED; used for classification w/ continuous inputs
- usually the decision boundary is linear (or can be transformed to linear).
- used MAINLY for binary classification (yes/no, t/f, 0/1)
predicts the probability of a given input belonging to a class
- threshold determines the final classification.
- eg if "fat" = 1, "not fat" = 0, and threshold = 0.5, a predicted y of 0.8 is classified as "fat" since 0.5 < 0.8 < 1
- Default threshold (Scipy): 0.5.

model (sigmoid function)

$$ h_w(x) = \frac{1}{1 + e^{w^Tx}}, \newline {{where \; w^Tx = w_0 + w_1x_1+...+w_nx_n}}

sigma(x) = hw(x) = ŷ

loss function (bce cross entropy)

$$ BCE(y, ŷ) = \begin{cases} -\log (ŷ)& \text{if y = 1}\\ -\log(1-ŷ) & \text{if y = 0}\\ \end{cases}

\newline OR \newline = -y\log(ŷ) - (1-y)\log(1-ŷ) $$

ŷ = hw(x)

cost function (bce cross entropy)

$$ J(BCE) = \frac{1}{m} \sum_{i=1}^m-y \log(ŷ)\newline-(1-y)\log(1-ŷ) $$

common issues & fixes

non-linear decision boundary
- add polynomial features equal to no. Of features in dataset
- e.G. Transform hw(x) = w0 + w1x1 + w2x2 into circular boundary
$$ h_w(x) = w_0 + 0x_1 + 0x_2 + x^2_1 + x^2_2 $$
- note: coefficients may not necessarily be 0 and 1; depends on qn

multi-class classification

one vs all: fit 1 classifier per class
- fit against all classes → repeat for every class (ie train another model) → choose class w/ highest probability when it is fit against all classes
one vs one: fit 1 classifier per class pair → higher probability = ‘win’
- choose class w/ highest no. Of ‘wins’

REGULARIZATION

ways to prevent overfitting:
- reduce no. Of features → transform from high to low degree polynomial
- regularization → reduce MAGNITUDE of weights (wj)

(L2/ridge) regularized cost function

$$ J(w) = [JBCE(w) \;or\;JMSE(w)] + \lambda \sum_{i=1}^m w^2_i \newline $$

logistic → J(BCE); linear → J(MSE)
λ → regularization parameter

regularized loss gradient

$$ \frac{∂JMSE(w)/JBCE(w)}{∂w_j}=\frac{1}{m}∑_{i=1}^m (ŷ^i - y^i){x_j}^i + \lambda \sum_{i=1}^mw_i $$

CLUSTERING

— unsupervised learning method: Find groupings based on datapoint similarities (NO LABELS)

k-means

centroid

mean of all data points
m = no. Of datapoints IN THE CLUSTER

$$ centroid (u) = \frac{1}{m}{\sum_{i=1}^m x_i} $$

how to find centroid?

steps
1. initialise k centroids →
2. for datapoint i in range(0, m), where m = no. Of datapoints, group datapoint i to the nearest centroid →
3. update centroids (mean of all datapoints)
loop stops when?: centroids don't change anymore
initialisation methods:
- k-means: randomly initialise k centroids
- k-means++: random initialization of the first centroid → select the remaining centroids with a probability proportional to the squared distance (x) of each data point from its nearest already chosen centroid.
  - P(x^2) = squared dist of current point frm centroid / total squared dist of each point frm centroid

evaluate performance (J)

calculate MEAN EUCLIDEAN DISTANCE J of EACH datapoint from its centroid → the smaller the value, the better

determining k

elbows method (heuristic, some graphs of j against k may not have)
- set k to the ‘elbow’ → the point whereby increasing no. Of clusters (k) does not lead to a drastic improvement in clustering performance (j)
application dependent method: aka changes based on diff cases

hierarchical

steps
1. initially every point is a data cluster
2. in a loop, calculate distances of each data point from all other datapoints using linkage
3. find min(dist) among all data points → merge tgt
4. repeat until all forms 1 cluster
after merging, to compare the merged point [A,B] with other point C, take nearest distance
- e.G. If dist(A, C) > dist(B, C), distance is taken as dist(B, C)
visualise: dendrogram → draw from bottom up until there is 1 link for all
- draw a horizontal line, no. Of intersections between line & dendrogram = no. Of clusters

types of linkages (ie what distance to use):
- single → dist btw points closest to 1 another in cluster)
- average → ave. Of distances of each points pair
- complete → dist btw points farthest from one another in cluster
- centroid → centroid of cluster

differences

implementation: from sklearn.Cluster, import KMeans for k-means but import AgglomerativeClustering for hierarchical
K-means uses the loss function (i.E., goodness value) and minimises it, whereas hierarchical clustering combines pairs of clusters that are close together according to the linkage method.

MODEL EVAL

regression

MSE
MAE → mean of sum( | ŷ - y | )

classification metrics

basically confusion matrix

	predicted +ve	predicted -ve
actual +ve	TP	FN
actual -ve	FP	TN

FOR ALL METRICS BELOW: btw. 0 & 1, the higher the better!!

Metric	Formula	Definition / Use Case
Accuracy (A)	(TP + TN) / Total samples	True predictions / total no. Of samples
Precision (P)	TP / (TP + FP)	True positives / all predicted positives

use if uw minimise false positives (e.G. Spam detection) | | Recall (R) | TP / (TP + FN) | True positives / all actual positives

use if uw minimise false negatives (e.G. Disease detection/prediction) | | F1 Score | 2 / ( (1 / P) + (1 / R) ) | — |

hyperparameter tuning

hyperparameter → external configuration parameters of a ML model not learned from training data
- learning rate, regularisation parameter
to find best hyperparameter:
- grid / exhaustive search → test all hyperparameters
- random search → select hyperparameters by random to test

model performance measurement

training:
- if finding best hyperparameter: train new model w/ each hyperparamr val on same dataset
- if finding best model: select diff model (k-nn, logisticreg, etc) to train same dataset
validating:
- use subset of training sample → find the model/hyperparameter that produces the LEAST ERROR
testing

NEURAL NETWORKS

Perceptrons

Perceptron model:
- Input (w0+w1x1+w2x2+⋯+wn) : → Activation function (g) → Output (ŷ)
- Formula:
  $$ \hat{y} = g\left( \sum_{i=0}^n w_i x_i \right) $$
perceptron (or any neuron) output without activation = linear regression: , forming a straight-line decision boundary.

$$ \hat{y} = g(w^Tx) = w^Tx = \newline w_0 + w_1x_1 + \dots + w_nx_n $$

applies to the weighted sum of all inputs, not to individual input features.
original uses a step function (outputs -1 or 1).

Perceptron Update Algorithm

Initialize weights randomly, predict ŷ
If ŷ ≠ y, update weights:
$$ w_j = w_j - \gamma (y_j - \hat{y}_j) x_j $$
Repeat until all points are classified correctly or max steps reached.

Logic Gates

true = 1, false = 0

NOT: opposite (True if 0, False if 1)
AND: only true if both inputs are 1.
NAND: only true if not both inputs are 1.
OR: true if at least one input is 1.
NOR: only true if both inputs are 0.
XNOR: true if inputs are the same. (1,1) / (0,0) → 1
XOR: true if inputs differ (0,1) / (1,0) → 1

Neural Networks

Formula:
$$ \hat{y} = g(w^T x) $$
Single Layer: No hidden layers; gates like AND, OR, NAND.

Multi-Layer: Input → Hidden layers → Output (e.G., XOR, XNOR).

Forward Propagation: Data flows from input → hidden layers → output.
- Activation function used at each layer (except for output layer when doing regression)
Weight Matrix: Rows = number of inputs, Columns = number of outputs for next layer.
- Example: 3x5 matrix → next layer has 5 rows.

Evaluating Mathematical Reps
- start with inner () first
- e.G. NOT(AND(x1, x2)) → if x1 = 0, x2 = 1, then NOT(0) → 1

Activation Functions

Softmax → used for multiclass classification (for C no. Of classes)
- Softmax(dim=1) → returns list of probabilities of each element (each class) being the class that input row belongs to. index of element = class label
$$ \hat{y}^i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}} $$
Sigmoid: For binary classification (logistic regression), handles non-linear boundaries if input features transformed.
No activation: for regression.
$$ g(x) = x $$

PYTORCH

import torch.Nn as nn

containers
- nn.Module → use as input when defining neural network class
- nn.Sequential → directly define class by calling linear & activation funcs
layers
- nn.Linear(input_size, output_size) → single layer NN; no activation function
- must be used for EVERY layer before calling activation classes in pytorch
activation
- nn.Sigmoid
- nn.ReLU
- nn.Softmax

BUILDING NN

using .Module

must have init & forward functions, super().init() line MUST be included
output size of layer t → input size of layer (t+1)
each layer defined by 1 linear layer + 1 activation function

e.G. Single layer NN for REGRESSION (i.E. Last layer has no activation function)

class NNRegressor(torch.Nn.Module): 
   
   
   Def __init__(self, input_size, output_size):
     Super().__init__()
# must add this ^ so that the NN can be intialised properly
     Self.Linear1 = torch.Nn.Layer(input_size, output_size) 
     Self.Linear2 = torch.Nn.Layer(output_size, 1)
     # output ŷ = 1 output size of 1st layer = input size of 2nd layer
     Self.Activate = torch.Nn.ReLU 
 # (or any other activation func)
     
    Def forward(self, x): 
      F1 = self.Linear1(x) 
      A1 = self.Activate(f1)
      F2 = self.Linear2(a1) 
      Return f2
      
      
Model1 = NNRegressor(4, 8) 
# 4 features (x_n), 8 neurons to map to initially

using .Sequential

sequentially define each layer w/ input+ output sizes; call activation function in between if needed
- input size of layer t = output size of next layer (t + 1)

model2 = torch.Nn.Sequential(
          Torch.Nn.Linear(5, 8)
          Torch.Nn.ReLU()
          Torch.Nn.Linear(8, 1)
          )

TRAINING NN (GRAD DESCENT)

build model (see above) & define loss function
- multi-classification → use nn.CrossEntropyLoss(y_pred, y)
- binary classification → use nn.BCELoss(y_pred, y)
- regression → use nn.MSELoss(y_pred, y)
initialise optimiser → retrieve all weights in model using model.Parameters()
- optimizer = torch.Optim.SGD(model.Parameters(), learning_rate)
for every epoch (loop), initialise all gradients of weighs to 0 (optimiser.Zero_grad())
compute ŷ y_pred = model(x) and loss (loss_function(y_pred, y))
compute gradient (loss.Backward() )→ update weight (optimizer.Step())

CNN

usually used for object detection in images (cancer cells, classifiying image types, etc)
fully connected layer → kernel “views” all pixels (’square’) of input image to determine output
convolution → allows us to view local regions of an image
NOTE: when building CNN, if u want to transition to a FULLY CONNECTED LAYER (conv2d → maxpool → linear), must flatten inputs first
- input size = channels * height * width
- torch.Flatten(batch_size, start_dim=1, end_dim=-1)

components

W/H = N: dimensions
K: kernel dimension (for W/H respectively)
- sum of element wise multiplication with local region → feature map value
P: padding
S: stride → stride of ‘n’ is inclusive of current 1st elem
- e.G. If top left starts from A, stride = 2 means 2nd region should start from C

pooling

downsamples feature maps → i.E. Reduces dimensions
- types: average, max, sum
- pytorch methods for pooling: torch.Nn.MaxPool2d, torch.Nn.AvgPool1d, torch.Nn.AdaptiveMaxPool2d
output shape: (initial h or w) / pooling factor

BASIC MATH

MATRIX MULTIPLICATION: A size(m x n) multiplied by B size(n x p) → AB size(m x p).

- col of A must = row of B; element = sum(row A * column B). For example, A of shape(m, n) * B of shape(n, p) = AB of shape(m, p)

TO FIND LOGBASE2 (N):

WAT IS ML

SUPERVISED LEARNING: models learn from labeled data → learn a hypothesis function that approximates the target function

CLASSIFICATION: output = discrete label (str, int, bool, etc)

REGRESSION: output = continuous num value

UNSUPERVISED LEARNING: models find patterns in unlabeled data (clustering); only features [x1, x2, ... Xn]

OVERFITTING: Model performs well on training data but poorly on test data.

- due to HIGH VARIANCE: Model is too complex → Overfitting

UNDERFITTING: Model is too simple to capture patterns in data.

- due to HIGH BIAS: Model is too simple → Underfitting..

PREPROCESS

REPRESENTING DATA

tabular: one-hot encoding: e.G. ”science”, “arts”, “hybrid → [1,0,0], [0,1,0], [0,0,1]; ordinal encoding: e.G. ”science”, “arts”, “hybrid} → 0, 2, 1

image / video: pixel grid representation (CNN)

word / sentence: one-hot encoding; no(words )= n(elems); bag of words

CLASS IMBALANCE → model may learn that predicting dominant class = fewer mistake

OVERSAMPLING: duplicates of minority class

UNDERSAMPLING: subsample of majority class

MISSING VALUES

Remove Missing Data OR replace w the mean or median of the column (numerical data) OR replace w the most frequent value / mode of label (categorial variables).

sklearn.Impute's SimpleInputer(strategy = “mean” or “most_frequent” or "median")

FINDING OUTLIERS

Z-SCORE METHOD: calculate the z-scorg, then flag data points where the absolute z-score exceeds threshold (manually set)

IQR METHOD Arrange data in ascending order → calculate quartiles: Q1 (25th percentile) & Q3 (75th percentile) → Compute IQR: IQR = Q3 – Q1 → Determine Bounds: Lower Bound: Q1 – 1.5 * IQR; Upper Bound: Q3 + 1.5 * IQR. Data points outside the bounds are outliers.

FEATURES W/ DIFF SCALES → model may give more imptance to features w larger range

z-score standardisation: rJGwrwAAAABJRU5ErkJggg==

min-max scaling: R9z3fJH+PGQigAAAABJRU5ErkJggg==

FEATURE ENG

create new features based on existing ones; select/identify relevant feats.

non-linear decision boundary: add polynomial features equal to no. Of features in dataset. Eg. Transform hw(x) = w0 + w1x1 + w2x2 into circular boundary:

h_w(x) = w₀+ 0w₁x₁+0w₂x₂ + w₃x²₁+ w₄x²₂

MODEL SELECT + TRAINING

K-NN

SUPERVISED; for classification; best for datasets w/ smol no. Of attrs; input: continuous/discrete/binary, output: array of label

STEPS:

9k=

calculate distances of data points from given using manhattan/euclidean distance
identify k nearest neighbours based on distance
classify the data point based on majority vote among k neighbors

CHOOSING K: K = SQRT(M):

k too large → decision boundary overly smooth (ie linear) → underfit

k too small → decision boundary overly complex → overfit

LIMITATIONS

requires feature scaling (see common data issues + fixes)

curse of dimensionality: higher dimensions = points r more spread out = becomes more difficult to distinguish based on distance

- mean nearest-farthest distance ratio:

DECISION TREES

takes attribute values as input (discrete) and returns a single output value (Bool)

TRAINING: for each value of a chosen attribute, use DTL again on corresponding subset of examples. Terminate when all samples in a subset have same category / Used all the features

ENTROPY (I):

uncertainty or unpredictability of the outcomes of a decision.

Max uncertainty: I(y) = 1 (random)

No uncertainty: I(y) = 0 (deterministic)

REMAINDER:

uncertainty remaining after partitioning the dataset based on an attr / w8ed average of the entropy of the subsets after splittingdataset based on attr.

WEIGHT: n(samples w/ attr value) / n(total no. Of samples)

INFORMATION GAIN (IG): Entropy(data) - remainder(A)

Reduction in uncertainty after splitting on an attribute. Higher IG → better attr → chosen earlier

PRUNING (to prevent overfit)

Max-depth: Limit tree depth, use majority vote at max depth

Min-sample leaf: Set min leaf sample size, use majority vote if < min

GRAD DESCENT

GLOBAL MIN THEOREM: graph of loss function against w8 must have only 1 minimum which = global min.

STEPS: Initialize parameters --> Compute loss 𝐽(𝑤0, 𝑤1, …) and their gradients 𝜕𝐽/𝜕𝑤𝑗 (𝑤0,𝑤1,…) --> Update parameters with the below function:

DETERMINE OPT WEIGHT: find derivative of cost function → set d(cost)/d(w) = 0, solve for w

DETERMINE OPT LR: let error be e(j) = w(j) - w* → substitute into grad update eqn --> rearrange eqn such that u can see e(t) on both sides (LHS e(j + 1), RHS e(j) ) --> find value of lr whereby error gets smaller, ie | e(t + 1) | < | e(t) |

LINEAR REG

ℎ𝑤 𝑥 = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑑𝑥𝑑 where 𝑤0, … , 𝑤𝑑 are parameters. W0 = bias / intercept;

i → index of a data point (runs from 1 to m = no. Of samples; data.Shape[0]);

j → index of a feature (runs from 1 to n = no of features; data.Shape[1] - 1)

COST FUNCTION (MSE):

VARIANTS

Mini-Batch: Uses a subset of data → faster + cheaper, good for large n

Stochastic: Uses 1 data point per update → fast but may overfit, unlikely to hit global min

POLYNOM REG: ℎ_𝑤(𝑥) = 𝑤0 + 𝑤1𝑥 +𝑤2x² ...

approximates any continuous function on a closed & bounded interval (min-max known) to any degree of accuracy

6LOGISTIC REG

SUPERVISED; predicts the threshold determines the final classification. (default=0.5)

- used for classification w/ continuous inputs.

- decision boundary is linear (or can be transformed to linear). MAINLY for binary classification (yes/no, t/f, 0/1)

MODEL (SIGMOID FUNCTION):

P9uej05x4GsdAAAAAElFTkSuQmCC

LOSS FUNCTION (BCE) COST FUNCTION

MULTI-CLASS CLASSIFICATION

1 vs all: fit 1 classifier per class. Fit against all classes → repeat for every class (train another model) → choose class w highest probability

1 vs 1: fit 1 classifier per class pair → higher probability = ‘win’ → choose class with highest no. Of ‘wins’

KMEANS CLUSTERING

CENTROID: mean of all data points in cluster per feature. Format (x1, x2, x3...Xn)

TRAINING LOOP: Initialize k centroids → Assign points to the nearest centroid (Euclidean distance) → Update centroids (mean of assigned points) → Repeat until centroids stabilise / max iter

INIT METHODS: K-Means - Randomly initialize; K-Means - Pick one randomly, then select others with probability proportional to squared (euclidean) dist of each pt.

why is random bad?: assigning samples to random clusters causes each cluster's centroid to be close to mean of dataset

EVAL: calculate

DETERMINE K: elbows method - plot graph of loss against k → set k to the ‘elbow’. The point whereby increasing no. Of clusters (k) does not lead to a drastic improvement in clustering performance (j)

MODEL EVAL

Related entries:

Tags: