Machine Learning Fundamentals: Algorithms and Techniques

Classified in Mathematics

Written on in English with a size of 581.54 KB


ML → SUBSET OF AI; allows computers to learn from data without being explicitly programmed.


BASIC MATH INFO

  • matrix multiplication: if A is of size m x n, B if of size n x p, then AB size = m x p
    • col of A must = row of B
    • each row A * each column B
    • ie A of shape(m, n) * B of shape(n, p) = AB of shape(m, p)
  • to find logbase2 (n):

$$ log_2(n) = \frac{ln(n)}{ln(2)} $$


DEFINITIONS

  • Supervised Learning: Models learn from labeled data → learn a hypothesis function that approximates the target function
    • classification: goal = categorise input into classes
      • predicting if patient has a disease given symptoms (yes vs no)
      • identify type of fruit given images of fruit (apple, pear..)
    • regression: predict continuous numerical values based on input data
      • predict house price given house size
  • Unsupervised Learning: Models find patterns in unlabeled data (clustering)
    • model groups diff fruits which have same colour together
  • Overfitting: Model performs well on training data but poorly on test data.
    • High variance: Model is too complex → Overfitting
  • Underfitting: Model is too simple to capture patterns in data.
    • High bias: Model is too simple → Underfitting..

PIPELINE

  • data preprocessing → missing data, encode categorial variables (one-hot, ordinal etc)
  • feature engineering → select relevant + remove irrelevant features, design & create new features
  • model selection
  • model training → model learns to map input to output, parameters adjusted to minimise pred errors
  • model eval

REPRESENTING DATA

  • tabular
    • one-hot encoding: e.G. ”science”, “arts”, “hybrid → [1,0,0], [0,1,0], [0,0,1]
      • no. Of elements in list = no. Of labels
    • ordinal encoding: e.G. ”science”, “arts”, “hybrid} → 0, 2, 1
  • image / video
    • pixel grid representation
      • flat b&w → 1 channel with 0s and 1s
      • grayscale image → 1 channel on a scale of 256 possible levels
      • varying shades of grey; 0 to 255 inclusive
    • RGB → combining red, blue, green channels with 256 possible levels
    • video: sequence of images → repeated pixel grid rep for each image
  • word / sentence
    • words: one-hot encoding → no. Of words = no. Of elems

      • “ume” → [1, 0, 0]
      • “is” → [0, 1, 0]
      • “cute” → [0, 0, 1]
    • sentence: sequence of words →

      • list of one-hot encoded words, eg. “ume is cute” → [ [1,0,0], [0,1,0], [0,0,1] ]
      • OR bag of words: initialise counter list where each element corresponds to a word → count frequency of word in sentece
      • e.G “ume is so so cute cute cute”
      umeissocute
      1123

HANDLING DATA ISSUES

  • class imbalance → model may learn that predicting dominant class = fewer mistake

    • sol: data resampling
      • oversampling: duplicates of minority class
      • undersampling: subsample of majority class
  • missing values

    • sol1: remove all datapoints w missing values (not recommended)
    • sol2: replace missing values w/ mean or mode of the column
      • implementation: sklearn.Impute SimpleInputer(strategy = “mean” or “most_frequent”)
  • outliers

    • sol1: z-score method
      • calculate z-score: sklearn.Preprocessing StandardScaler() → fit.Transform(data)
      • find datapoints where | z | > threshold, ie data[np.Abs(z_scores) > threshold]
    • sol2: IQR method
      • arrange data in ascending order → calculate IQR (Q3 - Q1) np.Percentile(data, 75) - np.Percentile(data, 25)
      • lower bound = Q1 - 1.5 * IQR
      • upper bound = Q3 + 1.5 * IQR
      • point > upper bound OR point < lower bound → outlier
  • features w/ diff scales → model may give more imptance to features w larger range

    • sol1: z-score standardisation

    $$ Z = \frac{X - \mu}{\sigma} $$

    • sol2: min-max scaling

    $$ X' = \frac{X - min(X)}{max(X) - min(X)} $$



K-NN

  • SUPERVISED; for classification
  • best for datasets w/ smol no. Of attrs
  • input: continuous/discrete/binary, output: array of label

formula + steps

$$ euclidean(x_a, x_b) = \sqrt{\sum_{i=1}^n (x_{ai} - x_{bi})^2} $$

$$ manhattan(x_a, x_b) = \sqrt{\sum_{i=1}^n | x_{ai} - x_{bi}|} $$

  1. calculate distances of data points from given using manhattan/euclidean distance
    • n → no. Of features
    • x_a & x_b → 2 data points
  2. identify k nearest neighbours based on distance
  3. classify the data point based on majority vote among the k neighbors

choosing k

$$ k = \sqrt{m} $$

  • k too large → decision boundary overly smooth (ie linear) → underfit
  • k too small → decision boundary

limitations

  • requires feature scaling (see common data issues + fixes)

  • curse of dimensionality → becomes more difficult to distinguish points based on distance for higher dimensions.

    • mean nearest-farthest distance ratio:

    $$ \frac{1}{m}\sum_{i=1}^m\frac{\text{dist to farthest pt.}}{\text{dist to nearest pt.}} $$

    • 0 ≤ ratio ≤ 1
    • more dimensions → ratio closer to 1 (larger value from 0) → less accurate in classification

IMPLEMENTATION

  • use sklearn.Neighbors KNeighborsClassifier
  • euclidean dist: np.Linalg.Norm(x)
    • assuming x is an array containing datapoints of [x1, x2… xb] features

DECISION TREES

  • Supervised learning, used for classification
  • Input: binary, discrete, or discretized continuous (e.G., range 0-10, 10-20…)
  • Output: single decision value

building the tree

  1. get root node
    • Compute entropy of data and all attributes
    • Compute remainder (or info gain) for each attribute
    • Choose attribute with smallest remainder (or highest info gain) as root
  2. split dataset by root attribute’s values
    • If an attribute value has >1 outcome, repeat the process using the next best attribute (smallest remainder)
    • To find the next best attribute:
      • Compute remainder within each subset filtered by the previous best attribute
      • If two attributes have the same remainder: choose the one that comes earlier in the breaking order
  3. repeat until…
    • No attributes remain OR
    • All samples in a subset belong to one class

ENTROPY (I)

  • measures the uncertainty or unpredictability of the outcomes of a decision.
  • Max uncertainty: I(y) = 1 (random)
  • No uncertainty: I(y) = 0 (deterministic)

formula

$$ I= \sum_{i=1}^{v} p_i \log (p_i) $$

  • p = probability of each outcome v (e.G., "yes", "no")

entropy of entire dataset

p=n(samples w/ label y)n(total samples)

p=n(total samples)n(samples w/ label y)

entropy of subset (for one attribute)

$$ p = \frac{n(\text{samples w/ attr x, label y)}}{n(\text{samples of attr x})} $$

REMAINDER

  • uncertainty remaining after partitioning the dataset based on an attr, OR
  • weighted average of the entropy of the subsets after splitting the dataset based on said attr
  • smaller remainder → better attr

formula

$$ remainder(A)=∑_{i=1}^{v}W(A)×I(A) $$

  • v = number of outcomes
  • I(A) = entropy of subset A
  • W(A) = weight of subset A

weight calc

$$ W(A)=\frac{n(\text{samples in subset})}{n(\text{total samples})} $$

Information Gain (IG)

  • Reduction in uncertainty after splitting on an attribute
  • Higher IG → better attribute → should be chosen earlier

formula

$$ IG(A)=I(data)−remainder(A) $$

pruning (to prevent overfit)

  • Max-depth: Limit tree depth, use majority vote at max depth
  • Min-sample leaf: Set minimum leaf sample size, use majority vote if fewer

GRADIENT DESCENT

iteratively updates weights to minimise losses via fllwing gradient.

  • GLOBAL MIN THEOREM: graph of lose function against weight must have only one minimum which = global minimum.

$$ wj←wj−γ\frac{∂J(loss function)}{∂w_j} $$

  • γ → learning rate
  • stops when  ∂J / ∂w = 0 (converges)
  • basically means wj = wj - learning rate * grad of loss curve at wj
  • loss function: MSE for linear regression, BCE loss for logistic regression

gradient calc

$$ \frac{∂JMSE(w)/JBCE(w)}{∂w_j}=\frac{1}{m}∑_{i=1}^m (ŷ^i - y^i){x_j}^i $$

determining optimal weight & learning rate manually

  • optimal weight (w):* find derivative of cost function → set d(cost)/d(w) = 0, solve for w
  • optimal lr:
    • let error be e(j) = w(j) - w* → substitute into grad update eqn
    • rearrange eqn such that u can see e(t) on both sides (LHS e(j + 1), RHS e(j) )
    • find value of lr whereby error gets smaller, ie | e(t + 1) | < | e(t) |

LINEAR REG

  • SUPERVISED; used for: predicting continuous values by fitting str8 line to data

model

  • 1 feature 1 label

    $$ h_w(x)=w0+w_1x $$

  • multiple features

$$ h_w(x)=w_0 + w_Tx, where \newline w_Tx = w_1 + w_2 +... W_n $$

key terms

  • ŷ → predicted value
  • x = [x1,x2,...,xn] → input features
  • w = [w1,w2,...,wn] → weights (parameters)
  • w0 → y-intercept / bias
  • i → index of a data point (runs from 1 to m = no. Of samples; data.Shape[0])
  • j  → index of a feature (runs from 1 to n = no of features; data.Shape[1] - 1)

cost function (MSE)

  • to measure ‘goodness’ of our model

$$ J(MSE) = \frac{1}{2m} ∑_{i=1}^m(ŷ_i-y_i)^2 $$

  • m → no. Of training data sample
  • note: loss function measures loss of SINGLE data point, ie (ŷ-y)^2 (squared error)
    • cost function measures AVERAGE loss across entire training set

variants

  • Mini-Batch: Uses a subset of data → faster + cheaper, good for large n
  • Stochastic: Uses 1 data point per update → fast but may overfit

non-linear r/s

  • transform into linear by adding polynom features (polynomial regression)

    • polynom reg: approximates any continuous function on a closed and bounded interval (min max known) to any degree of accuracy (Weierstrass Approximation Theorem)

    $$ ℎ_𝑤(𝑥) = 𝑤_0 + 𝑤_1𝑥 +𝑤_2x^2 $$


LOGISTIC REG

  • SUPERVISED; used for classification w/ continuous inputs
    • usually the decision boundary is linear (or can be transformed to linear).
    • used MAINLY for binary classification (yes/no, t/f, 0/1)
  • predicts the probability of a given input belonging to a class
    • threshold determines the final classification.
    • eg if "fat" = 1, "not fat" = 0, and threshold = 0.5, a predicted y of 0.8 is classified as "fat" since 0.5 < 0.8 < 1
    • Default threshold (Scipy): 0.5.

model (sigmoid function)

$$ h_w(x) = \frac{1}{1 + e^{w^Tx}}, \newline {{where \; w^Tx = w_0 + w_1x_1+...+w_nx_n}}

$$

  • sigma(x) = hw(x) = ŷ

loss function (bce cross entropy)

$$ BCE(y, ŷ) = \begin{cases} -\log (ŷ)& \text{if y = 1}\\ -\log(1-ŷ) & \text{if y = 0}\\ \end{cases}

\newline OR \newline = -y\log(ŷ) - (1-y)\log(1-ŷ) $$

  • ŷ = hw(x)

cost function (bce cross entropy)

$$ J(BCE) = \frac{1}{m} \sum_{i=1}^m-y \log(ŷ)\newline-(1-y)\log(1-ŷ) $$

common issues & fixes

  • non-linear decision boundary

    • add polynomial features equal to no. Of features in dataset
    • e.G. Transform hw(x) = w0 + w1x1 + w2x2 into circular boundary

    $$ h_w(x) = w_0 + 0x_1 + 0x_2 + x^2_1 + x^2_2 $$

    • note: coefficients may not necessarily be 0 and 1; depends on qn

multi-class classification

  • one vs all: fit 1 classifier per class
    • fit against all classes → repeat for every class (ie train another model) → choose class w/ highest probability when it is fit against all classes
  • one vs one: fit 1 classifier per class pair → higher probability = ‘win’
    • choose class w/ highest no. Of ‘wins’

REGULARIZATION

  • ways to prevent overfitting:
    • reduce no. Of features → transform from high to low degree polynomial
    • regularization → reduce MAGNITUDE of weights (wj)

(L2/ridge) regularized cost function

$$ J(w) = [JBCE(w) \;or\;JMSE(w)] + \lambda \sum_{i=1}^m w^2_i \newline $$

  • logistic → J(BCE); linear → J(MSE)
  • λ → regularization parameter

regularized loss gradient

$$ \frac{∂JMSE(w)/JBCE(w)}{∂w_j}=\frac{1}{m}∑_{i=1}^m (ŷ^i - y^i){x_j}^i + \lambda \sum_{i=1}^mw_i $$


CLUSTERING

unsupervised learning method: Find groupings based on datapoint similarities (NO LABELS)

k-means

centroid

  • mean of all data points
  • m = no. Of datapoints IN THE CLUSTER

$$ centroid (u) = \frac{1}{m}{\sum_{i=1}^m x_i} $$

how to find centroid?

  • steps
    1. initialise k centroids →
    2. for datapoint i in range(0, m), where m = no. Of datapoints, group datapoint i to the nearest centroid →
    3. update centroids (mean of all datapoints)
  • loop stops when?: centroids don't change anymore
  • initialisation methods:
    • k-means: randomly initialise k centroids
    • k-means++: random initialization of the first centroid → select the remaining centroids with a probability proportional to the squared distance (x) of each data point from its nearest already chosen centroid.
      • P(x^2) = squared dist of current point frm centroid / total squared dist of each point frm centroid

evaluate performance (J)

  • calculate MEAN EUCLIDEAN DISTANCE J of EACH datapoint from its centroid → the smaller the value, the better

determining k

  • elbows method (heuristic, some graphs of j against k may not have)
    • set k to the ‘elbow’ → the point whereby increasing no. Of clusters (k) does not lead to a drastic improvement in clustering performance (j)
  • application dependent method: aka changes based on diff cases

hierarchical

  • steps
    1. initially every point is a data cluster
    2. in a loop, calculate distances of each data point from all other datapoints using linkage
    3. find min(dist) among all data points → merge tgt
    4. repeat until all forms 1 cluster
  • after merging, to compare the merged point [A,B] with other point C, take nearest distance
    • e.G. If dist(A, C) > dist(B, C), distance is taken as dist(B, C)
  • visualise: dendrogram → draw from bottom up until there is 1 link for all
    • draw a horizontal line, no. Of intersections between line & dendrogram = no. Of clusters
  • types of linkages (ie what distance to use):
    • single → dist btw points closest to 1 another in cluster)
    • average → ave. Of distances of each points pair
    • complete → dist btw points farthest from one another in cluster
    • centroid → centroid of cluster

differences

  • implementation: from sklearn.Cluster, import KMeans for k-means but import AgglomerativeClustering for hierarchical
  • K-means uses the loss function (i.E., goodness value) and minimises it, whereas hierarchical clustering combines pairs of clusters that are close together according to the linkage method.

MODEL EVAL

regression

  • MSE
  • MAE → mean of sum( | ŷ - y | )

classification metrics

basically confusion matrix

 predicted +vepredicted -ve
actual +veTPFN
actual -veFPTN

FOR ALL METRICS BELOW: btw. 0 & 1, the higher the better!!

MetricFormulaDefinition / Use Case
Accuracy (A)(TP + TN) / Total samplesTrue predictions / total no. Of samples
Precision (P)TP / (TP + FP)True positives / all predicted positives

use if uw minimise false positives (e.G. Spam detection) | | Recall (R) | TP / (TP + FN) | True positives / all actual positives

use if uw minimise false negatives (e.G. Disease detection/prediction) | | F1 Score | 2 / ( (1 / P) + (1 / R) ) | — |

hyperparameter tuning

  • hyperparameter → external configuration parameters of a ML model not learned from training data
    • learning rate, regularisation parameter
  • to find best hyperparameter:
    • grid / exhaustive search → test all hyperparameters
    • random search → select hyperparameters by random to test

model performance measurement

  1. training:
    • if finding best hyperparameter: train new model w/ each hyperparamr val on same dataset
    • if finding best model: select diff model (k-nn, logisticreg, etc) to train same dataset
  2. validating:
    • use subset of training sample → find the model/hyperparameter that produces the LEAST ERROR
  3. testing

NEURAL NETWORKS

Perceptrons

  • Perceptron model:

    • Input (w0+w1x1+w2x2+⋯+wn) : → Activation function (g) → Output (ŷ)

    • Formula:

      $$ \hat{y} = g\left( \sum_{i=0}^n w_i x_i \right) $$

  • perceptron (or any neuron) output without activation = linear regression: , forming a straight-line decision boundary.

$$ \hat{y} = g(w^Tx) = w^Tx = \newline w_0 + w_1x_1 + \dots + w_nx_n $$

  • applies to the weighted sum of all inputs, not to individual input features.
  • original uses a step function (outputs -1 or 1).

Perceptron Update Algorithm

  1. Initialize weights randomly, predict ŷ

  2. If ŷ ≠ y, update weights:

    $$ w_j = w_j - \gamma (y_j - \hat{y}_j) x_j $$

  3. Repeat until all points are classified correctly or max steps reached.

Logic Gates

true = 1, false = 0

  • NOT: opposite (True if 0, False if 1)
  • AND: only true if both inputs are 1.
  • NAND: only true if not both inputs are 1.
  • OR: true if at least one input is 1.
  • NOR: only true if both inputs are 0.
  • XNOR: true if inputs are the same. (1,1) / (0,0) → 1
  • XOR: true if inputs differ (0,1) / (1,0) → 1

Neural Networks

  • Formula:

    $$ \hat{y} = g(w^T x) $$

  • Single Layer: No hidden layers; gates like AND, OR, NAND.

  • Multi-Layer: Input → Hidden layers → Output (e.G., XOR, XNOR).
  • Forward Propagation: Data flows from input → hidden layers → output.

    • Activation function used at each layer (except for output layer when doing regression)
  • Weight Matrix: Rows = number of inputs, Columns = number of outputs for next layer.

    • Example: 3x5 matrix → next layer has 5 rows.
  • Evaluating Mathematical Reps
    • start with inner () first
    • e.G. NOT(AND(x1, x2)) → if x1 = 0, x2 = 1, then NOT(0) → 1

Activation Functions

  • Softmax → used for multiclass classification (for C no. Of classes)

    • Softmax(dim=1) → returns list of probabilities of each element (each class) being the class that input row belongs to. index of element = class label

    $$ \hat{y}^i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}} $$

  • Sigmoid: For binary classification (logistic regression), handles non-linear boundaries if input features transformed.

  • No activation: for regression.

    $$ g(x) = x $$

PYTORCH

import torch.Nn as nn 
  • containers
    • nn.Module → use as input when defining neural network class
    • nn.Sequential → directly define class by calling linear & activation funcs
  • layers
    • nn.Linear(input_size, output_size) → single layer NN; no activation function
    • must be used for EVERY layer before calling activation classes in pytorch
  • activation
    • nn.Sigmoid
    • nn.ReLU
    • nn.Softmax

BUILDING NN

using .Module

  • must have init & forward functions, super().init() line MUST be included
  • output size of layer t → input size of layer (t+1)
  • each layer defined by 1 linear layer + 1 activation function

e.G. Single layer NN for REGRESSION (i.E. Last layer has no activation function)

class NNRegressor(torch.Nn.Module): 
   
   
   Def __init__(self, input_size, output_size):
     Super().__init__()
# must add this ^ so that the NN can be intialised properly
     Self.Linear1 = torch.Nn.Layer(input_size, output_size) 
     Self.Linear2 = torch.Nn.Layer(output_size, 1)
     # output ŷ = 1 output size of 1st layer = input size of 2nd layer
     Self.Activate = torch.Nn.ReLU 
 # (or any other activation func)
     
    Def forward(self, x): 
      F1 = self.Linear1(x) 
      A1 = self.Activate(f1)
      F2 = self.Linear2(a1) 
      Return f2
      
      
Model1 = NNRegressor(4, 8) 
# 4 features (x_n), 8 neurons to map to initially 

using .Sequential

  • sequentially define each layer w/ input+ output sizes; call activation function in between if needed
    • input size of layer t = output size of next layer (t + 1)
model2 = torch.Nn.Sequential(
          Torch.Nn.Linear(5, 8)
          Torch.Nn.ReLU()
          Torch.Nn.Linear(8, 1)
          )

TRAINING NN (GRAD DESCENT)

  1. build model (see above) & define loss function
    • multi-classification → use nn.CrossEntropyLoss(y_pred, y)
    • binary classification → use nn.BCELoss(y_pred, y)
    • regression → use nn.MSELoss(y_pred, y)
  2. initialise optimiser → retrieve all weights in model using model.Parameters()
    • optimizer = torch.Optim.SGD(model.Parameters(), learning_rate)
  3. for every epoch (loop), initialise all gradients of weighs to 0 (optimiser.Zero_grad())
  4. compute ŷ y_pred = model(x) and loss (loss_function(y_pred, y))
  5. compute gradient (loss.Backward() )→ update weight (optimizer.Step())

CNN

  • usually used for object detection in images (cancer cells, classifiying image types, etc)
  • fully connected layer → kernel “views” all pixels (’square’) of input image to determine output
  • convolution → allows us to view local regions of an image
  • NOTE: when building CNN, if u want to transition to a FULLY CONNECTED LAYER (conv2d → maxpool → linear), must flatten inputs first
    • input size = channels * height * width
    • torch.Flatten(batch_size, start_dim=1, end_dim=-1)

components

  • W/H = N: dimensions
  • K: kernel dimension (for W/H respectively)
    • sum of element wise multiplication with local region → feature map value
  • P: padding
  • S: stride → stride of ‘n’ is inclusive of current 1st elem
    • e.G. If top left starts from A, stride = 2 means 2nd region should start from C

pooling

  • downsamples feature maps → i.E. Reduces dimensions
    • types: average, max, sum
    • pytorch methods for pooling: torch.Nn.MaxPool2d, torch.Nn.AvgPool1d, torch.Nn.AdaptiveMaxPool2d
  • output shape: (initial h or w) / pooling factor


BASIC MATH

MATRIX MULTIPLICATION: A size(m x n) multiplied by B size(n x p)  → AB size(m x p).

- col of A must = row of B; element = sum(row A * column B). For example, A of shape(m, n) * B of shape(n, p) = AB of shape(m, p)

TO FIND LOGBASE2 (N): FtvsAAAAABJRU5ErkJggg==

WAT IS ML

SUPERVISED LEARNING: models learn from labeled data → learn a hypothesis function that approximates the target function

CLASSIFICATION: output = discrete label (str, int, bool, etc)

REGRESSION: output = continuous num value

UNSUPERVISED LEARNING: models find patterns in unlabeled data (clustering); only features [x1, x2, ... Xn]

OVERFITTING: Model performs well on training data but poorly on test data.

- due to HIGH VARIANCE: Model is too complex → Overfitting

UNDERFITTING: Model is too simple to capture patterns in data.

- due to HIGH BIAS: Model is too simple → Underfitting..

PREPROCESS

REPRESENTING DATA

tabular: one-hot encoding: e.G. ”science”, “arts”, “hybrid → [1,0,0], [0,1,0], [0,0,1]; ordinal encoding: e.G. ”science”, “arts”, “hybrid} → 0, 2, 1

image / video: pixel grid representation (CNN)

word / sentence: one-hot encoding; no(words )= n(elems); bag of words

CLASS IMBALANCE → model may learn that predicting dominant class = fewer mistake

OVERSAMPLING: duplicates of minority class

UNDERSAMPLING: subsample of majority class

MISSING VALUES

Remove Missing Data OR replace w the mean or median of the column (numerical data) OR replace w the most frequent value / mode of label (categorial variables).

sklearn.Impute's SimpleInputer(strategy = “mean” or “most_frequent” or "median")

FINDING OUTLIERS

Z-SCORE METHOD: calculate the z-scorg, then flag data points where the absolute z-score exceeds threshold (manually set)

IQR METHOD Arrange data in ascending order →  calculate quartiles: Q1 (25th percentile) & Q3 (75th percentile) →  Compute IQR: IQR = Q3 – Q1 →  Determine Bounds: Lower Bound: Q1 – 1.5 * IQR; Upper Bound: Q3 + 1.5 * IQR. Data points outside the bounds are outliers. 

FEATURES W/ DIFF SCALES → model may give more imptance to features w larger range

z-score standardisation: rJGwrwAAAABJRU5ErkJggg==

min-max scaling:R9z3fJH+PGQigAAAABJRU5ErkJggg==

FEATURE ENG

create new features based on existing ones; select/identify relevant feats.

non-linear decision boundary: add polynomial features equal to no. Of features in dataset. Eg. Transform hw(x) = w0 + w1x1 + w2x2 into circular boundary:

hw(x) = w+ 0w1x+0w2x2 + w3x2+ w4x22

MODEL SELECT + TRAINING

K-NN

SUPERVISED; for classification; best for datasets w/ smol no. Of attrs; input: continuous/discrete/binary, output: array of label

STEPS:

9k=

  1. calculate distances of data points from given using manhattan/euclidean distance
  2. identify k nearest neighbours based on distance
  3. classify the data point based on majority vote among k neighbors

CHOOSING K: K = SQRT(M):

k too large → decision boundary overly smooth (ie linear) → underfit

k too small → decision boundary overly complex → overfit

LIMITATIONS

requires feature scaling (see common data issues + fixes)

curse of dimensionality: higher dimensions = points r more spread out = becomes more difficult to distinguish based on distance

- mean nearest-farthest distance ratio:Z

DECISION TREES

takes attribute values as input (discrete) and returns a single output value (Bool)

TRAINING: for each value of a chosen attribute, use DTL again on corresponding subset of examples. Terminate when all samples in a subset have same category  / Used all the features

ENTROPY (I): Z 

uncertainty or unpredictability of the outcomes of a decision.

Max uncertainty: I(y) = 1 (random)

No uncertainty: I(y) = 0 (deterministic)

REMAINDER:  Z

uncertainty remaining after partitioning the dataset based on an attr / w8ed average of the entropy of the subsets after splittingdataset based on attr.

WEIGHT: n(samples w/ attr value) / n(total no. Of samples)

INFORMATION GAIN (IG): Entropy(data) - remainder(A)

Reduction in uncertainty after splitting on an attribute. Higher IG → better attr → chosen earlier

PRUNING (to prevent overfit)

Max-depth: Limit tree depth, use majority vote at max depth

Min-sample leaf: Set min leaf sample size, use majority vote if < min

GRAD DESCENT

GLOBAL MIN THEOREM: graph of loss function against w8 must have only 1 minimum which = global min.

STEPS: Initialize parameters --> Compute loss 𝐽(𝑤0, 𝑤1, …) and their gradients 𝜕𝐽/𝜕𝑤𝑗 (𝑤0,𝑤1,…) --> Update parameters with the below function: 

2Q==       Z 

DETERMINE OPT WEIGHT: find derivative of cost function → set d(cost)/d(w) = 0, solve for w

DETERMINE OPT LR: let error be e(j) = w(j) - w* → substitute into grad update eqn --> rearrange eqn such that u can see e(t) on both sides (LHS e(j + 1), RHS e(j) ) --> find value of lr whereby error gets smaller, ie | e(t + 1) | < | e(t) |

LINEAR REG

ℎ𝑤 𝑥 = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑑𝑥𝑑 where 𝑤0, … , 𝑤𝑑 are parameters. W0 = bias / intercept; 

i → index of a data point (runs from 1 to m = no. Of samples; data.Shape[0]);

j  → index of a feature (runs from 1 to n = no of features; data.Shape[1] - 1)

COST FUNCTION (MSE): 2Q== 

VARIANTS

Mini-Batch: Uses a subset of data → faster + cheaper, good for large n

Stochastic: Uses 1 data point per update → fast but may overfit, unlikely to hit global min

POLYNOM REG: ℎ_𝑤(𝑥) = 𝑤0 + 𝑤1𝑥 +𝑤2x2 ...

approximates any continuous function on a closed & bounded interval (min-max known) to any degree of accuracy


6LOGISTIC REG

SUPERVISED; predicts the threshold determines the final classification. (default=0.5)

- used for classification w/ continuous inputs.

- decision boundary is linear (or can be transformed to linear). MAINLY for binary classification (yes/no, t/f, 0/1)

MODEL (SIGMOID FUNCTION): 4fljfsqw+WTu8AAAAASUVORK5CYII= 

P9uej05x4GsdAAAAAElFTkSuQmCC

LOSS FUNCTION (BCE)                               COST FUNCTION

9k=     Z 

MULTI-CLASS CLASSIFICATION

1 vs all: fit 1 classifier per class. Fit against all classes → repeat for every class (train another model) → choose class w highest probability

1 vs 1: fit 1 classifier per class pair → higher probability = ‘win’ → choose class with highest no. Of ‘wins’

KMEANS CLUSTERING

CENTROID: mean of all data points in cluster per feature. Format (x1, x2, x3...Xn) 

TRAINING LOOP: Initialize k centroids → Assign points to the nearest centroid (Euclidean distance) → Update centroids (mean of assigned points) → Repeat until centroids stabilise / max iter

INIT METHODS: K-Means - Randomly initialize; K-Means - Pick one randomly, then select others with probability proportional to squared (euclidean) dist of each pt.

why is random bad?: assigning samples to random clusters causes each cluster's centroid to be close to mean of dataset

EVAL:  calculate 

DETERMINE K: elbows method - plot graph of loss against k → set k to the ‘elbow’. The point whereby increasing no. Of clusters (k) does not lead to a drastic improvement in clustering performance (j)

MODEL EVAL

Related entries: