Machine Learning Algorithms: Comprehensive Definitions

Classified in Computers

Written on March 17, 2025 in English with a size of 13.82 KB

Support Vector Machines (SVM)

A support vector machine is a supervised method for classification or regression that seeks a boundary in a high-dimensional space which separates classes with the widest possible margin. The training process involves choosing a boundary that maximizes the distance to the nearest training points, known as support vectors. When data are not perfectly separable, slack variables can be introduced to allow some misclassifications or margin violations while balancing margin maximization and classification accuracy. A kernel is a special function that effectively maps data into higher-dimensional spaces without doing the mapping explicitly; it lets the support vector machine handle nonlinear relationships by measuring similarity between data points in a transformed space.

Forward Neural Networks with Error Backpropagation

A forward neural network is a layered model of neurons connected from an input layer through hidden layers to an output layer. Each neuron computes a weighted sum of its inputs plus a bias term, then applies a nonlinear activation. Training happens through an iterative cycle called backpropagation where the output error is computed with a loss function, then gradients are propagated backward to update each connection weight. This process is repeated over many passes through the training data.

Self-Organizing Maps (SOM)

A self-organizing map is an unsupervised neural model that projects high-dimensional inputs onto a usually two-dimensional grid of neurons. Each neuron has a weight vector of the same dimension as the inputs, and when an input is presented, the neuron whose weight is closest is chosen as the best matching unit. That unit’s weight and the weights of its neighbors are then nudged toward the input in proportion to a learning rate and a neighborhood function. Over time, the map organizes into clusters that preserve topological relationships of the input space.

Linear Classifiers

A linear classifier uses a linear combination of input features to separate data into classes. It creates a boundary in the feature space and assigns labels based on which side of the boundary a point lies. Training typically involves defining a loss function that measures prediction error and applying an optimization process such as gradient descent to adjust the weight vector and any bias term for minimal classification error.

K-Means Clustering

K-means is an unsupervised algorithm that groups data into a chosen number of clusters by assigning each point to the nearest cluster center and then recalculating the cluster centers as the means of their assigned points. The process alternates between assignment and update steps until it stabilizes. Deciding how many clusters to use can be guided by methods like the elbow method, silhouette scores, or information criteria. Alternatives include hierarchical clustering, density-based methods like DBSCAN, and model-based methods such as Gaussian mixtures.

Fisher Linear Discriminant Analysis (FLDA)

Fisher linear discriminant analysis is a supervised method that projects data onto a direction that best separates the classes. It seeks a projection that maximizes the ratio of the distance between class means to the spread within each class. For more than two classes, the method can be extended by maximizing overall class separability in a subspace.

Independent Component Analysis (ICA)

Independent component analysis is an unsupervised method that decomposes mixed signals into statistically independent source components. It assumes signals are combined linearly and that the source signals have non-Gaussian distributions. It is often used in problems like source separation and denoising when sources are independent.

K-Nearest Neighbors (KNN)

K-nearest neighbors is a non-parametric approach that uses the closest training examples to classify a new point. Distances between the new point and existing training points are computed, and the label of the new point is inferred from the majority label of its nearest neighbors. The choice of how many neighbors to use affects model complexity, with smaller values capturing finer details and larger values smoothing predictions.

Filter vs. Wrapper Approaches to Dimension Reduction

Filter approaches select features based on data-level statistics or correlations without referencing a specific model. Wrapper approaches evaluate subsets of features by training a chosen model and measuring its performance, often with metrics like accuracy, precision, recall, or similar measures. Filters are usually faster and simpler, wrappers can be more accurate but demand more computational work.

Procedures to Construct Subsets in Dimension Reduction

Filter methods often rank each feature using scores like correlation, mutual information, or other relevance criteria, then choose the top subset. Wrapper methods use model-based searches like forward selection, backward elimination, or heuristic search. Filter steps happen independently of a predictive model, while wrappers actively use a model in the subset search process.

Principal Component Analysis (PCA) vs. ICA

Principal component analysis identifies the directions of greatest variance in data, called principal components, and projects data onto these directions. It uses the covariance of the data to find uncorrelated directions that capture the most variance. ICA aims for statistical independence among components, whereas PCA focuses on uncorrelated directions based on variance.

PCA for Dimension Reduction

Principal component analysis involves centering the data, computing its covariance, and extracting directions that correspond to the largest variations. To reduce dimension, keep only those directions with the highest variance and discard the rest. Data points are then mapped into this lower-dimensional subspace while retaining most of the variance.

Determining Significant Components in PCA

Principal component analysis reveals how much variance each direction explains. The number of significant components is usually chosen by examining how much cumulative variance is captured or by looking at a plot of variances to see where they start dropping rapidly. In practice, a threshold of explained variance or a visual elbow can guide how many components to keep.

Training the Bayesian Classifier

A Bayesian classifier applies the rule of Bayes to compute how likely each class is for a given input based on class priors and likelihoods. Training involves estimating the prior probability of each class from the data and modeling how likely each feature is under each class assumption. Under ideal conditions with correct assumptions, the Bayesian classifier achieves theoretically minimal error, though real data or naive independence assumptions may reduce its practical performance.

Creating Decision Trees

A decision tree is grown by splitting data at nodes using a feature that yields the greatest improvement in class purity based on measures like information gain or gini. Each split results in branches, and the process continues until stopping criteria such as maximum depth or minimal node size are met. Pruning can be used afterward to control overfitting.

Cross-Validation vs. Bootstrap Methods

Cross-validation repeatedly splits the dataset into training and validation sets to get a more stable performance estimate, as with methods like k-fold or leave-one-out. The bootstrap method repeatedly samples from the dataset with replacement to form training sets and assesses performance on the leftover out-of-bag instances. Cross-validation partitions data without replacement, while bootstrapping involves sampling with replacement to estimate variability in model estimates.

Convolutional Neural Networks (CNN)

Convolutional networks are specialized models commonly used for images or similar grid-like data. They apply small learnable filters that slide across the spatial arrangement of the input to extract local features, followed by pooling layers to reduce spatial resolution. These learned features feed into fully connected layers for classification or regression, and the entire model is trained by backpropagation.

Hierarchical Clustering Methods

Hierarchical clustering builds a hierarchy of merges or splits among data points. In agglomerative clustering, each point starts in its own group, and the closest groups are merged step by step until all points form one group, while divisive clustering does the opposite. Linkage strategies such as single, complete, or average define how distances between clusters are measured, and the result is visualized in a dendrogram.

Confusion Matrix

A confusion matrix is a table that lays out actual class labels against predicted class labels for classification. It helps measure metrics like accuracy, precision, and recall by examining counts of correct and incorrect classifications.

Receiver Operating Characteristic (ROC) Curve

A receiver operating characteristic curve plots the true positive rate against the false positive rate as the classification threshold shifts. It helps analyze trade-offs between sensitivity and specificity, and its area under the curve value is an overall measure of model discriminative capability.

Methods to Select Training and Test Sets

Commonly, data is split randomly or by stratified random methods to maintain class proportions. Cross-validation creates multiple splits to provide more reliable estimates of model performance.

PCA vs. ICA: Key Differences

Principal component analysis focuses on capturing maximum variance along orthogonal directions, which are uncorrelated but not necessarily independent. Independent component analysis attempts to uncover statistically independent signals that need not be orthogonal.

PCA vs. LDA: Key Differences

Principal component analysis is unsupervised and targets directions of greatest data variance, whereas linear discriminant analysis is supervised and aims to maximize class separation based on labeled data. LDA looks for directions that best separate known classes.

Feature Normalization Techniques

Features are often normalized by subtracting the mean and dividing by the standard deviation or by transforming them into a range between minimum and maximum. These scalings help align features that differ widely in scale or units.

Error Function and Estimation

The error function is a measure of how far the predictions deviate from the true labels, such as cross-entropy or misclassification counts. Error is typically estimated by applying the model to a held-out set or through a cross-validation approach.

Filter vs. Wrapper Methods: Key Differences

Filter methods rank or score features independently of any model, while wrapper methods evaluate subsets of features using a chosen predictive model. Filters are faster and less specific, whereas wrappers tend to be more computationally heavy but more tailored to the actual model performance.

Classification Using the Bayesian Classifier

A Bayesian classifier applies the rule of Bayes to each class based on how likely the feature values are under that class and the overall chance of that class. The prediction is the class with the greatest posterior probability.

Classification Using a Linear Classifier

A linear classifier combines features in a linear form to assign a label by deciding which side of a separating boundary the data belongs to. It is trained by adjusting weights to separate the classes in the best possible way under a chosen loss function.

Classification Using Decision Trees

Decision trees split data by choosing a feature and threshold that yields the most homogeneous child nodes. Each branch then continues splitting until pure leaves or until stopping conditions are reached, resulting in a flowchart-like classification process.

Singular Value Decomposition (SVD)

Singular value decomposition factors a matrix into two orthonormal matrices and a diagonal of singular values. It is used for dimensionality reduction, noise suppression, and low-rank approximations in various data processing tasks.

Measures Used to Evaluate Features

Measures include correlation with class labels, mutual information, and statistical tests for relevance. These measures guide which features contribute significantly to the prediction goal.

Entropy and Mutual Information

Entropy indicates the degree of uncertainty in a variable, while mutual information shows how much knowing one variable decreases uncertainty about another. They are used for feature selection to discover which features provide the most information about the target.

Statistical Independence and Consistency

Two variables are statistically independent if knowledge of one gives no information about the other, and consistency often refers to an estimator matching the true parameter values given sufficient data. Both concepts help determine whether features are redundant or reliable.

Common Clustering Methods

Common approaches include partition based methods such as k-means, hierarchical methods that build clusters via merges or splits, density based methods like DBSCAN, and model-based methods such as Gaussian mixture modeling. Each type suits different data shapes and cluster assumptions.

Related entries:

Tags: