Machine Learning Algorithms: Comparison and Best Practices
Classified in Mathematics
Written on in
English with a size of 5.26 KB
Supervised Classification
Logistic Regression (LR)
- Type: Classification (binary only)
- Scaling: YES (StandardScaler)
- Outliers: NOT robust
- Categorical Variables: NO (encode first)
- Idea: Sigmoid function → probability 0–1 → if ≥ 0.5 → class 1
- Advantages: Fast, simple, interpretable, outputs probabilities
- Disadvantages: Binary only, needs linear boundary, fails non-linear data
- Metrics: Accuracy, Precision, Recall, F1, Confusion Matrix
Decision Trees (DT)
- Type: Classification + Regression
- Scaling: NO (never needs it)
- Outliers: Robust
- Categorical Variables: YES
- Idea: IF-ELSE splits by feature → leaf = final prediction
- Advantages: Interpretable, no scaling, handles any data type, fast
- Disadvantages: Overfits easily, sensitive to small changes
- Metrics: Gini, Accuracy, Confusion Matrix
Random Forest (RF)
- Type: Classification + Regression
- Scaling: NO
- Outliers: Robust
- Categorical Variables: YES
- Idea: Many trees (BAGGING) → majority vote → final prediction
- Advantages: Reduces overfit, stable, feature importance, handles missing data
- Disadvantages: Slow, uninterpretable, many hyperparameters
- Metrics: Accuracy, Feature Importance, OOB error
Supervised Regression
Linear Regression
- Type: Regression (continuous Y only)
- Scaling: YES (recommended)
- Outliers: NOT robust (outliers distort line)
- Categorical Variables: NO (encode first)
- Idea: y = b0 + b1·x1 + b2·x2 + ... → predicts a continuous number
- Advantages: Simple, fast, interpretable, coefficients show feature impact
- Disadvantages: Assumes linearity, fails complex patterns, sensitive to outliers
- Metrics: MAE, MSE, RMSE, R²
K-Nearest Neighbors (KNN)
- Type: Classification + Regression
- Scaling: YES (distance-based — MUST scale)
- Outliers: NOT robust
- Categorical Variables: NO
- Idea: Classification: majority vote of K neighbors; Regression: average of K neighbors
- Advantages: Simple, no training phase, works with small data
- Disadvantages: Slow on large data, sensitive to outliers
- Metrics: Accuracy, F1 (class); MAE, MSE, R² (regression)
- Note: K too low = overfit/noisy; K too high = underfit. Best default K ≈ 5.
Unsupervised Learning
K-Means (KM)
- Type: Clustering (NO target variable)
- Scaling: YES (distance-based)
- Outliers: NOT robust (shift centroids)
- Categorical Variables: NO
- Idea: K centroids → assign each point to nearest → update centroids → repeat until stable
- Inertia: Sum squared distances to centroid (lower = more compact); used in Elbow method
- Elbow Method: Plot inertia vs K → pick K where curve bends
- Advantages: Fast, simple, scalable
- Disadvantages: Must set K, assumes spherical clusters, random initialization leads to different results
- Metrics: Inertia, Silhouette score
Hierarchical Clustering (HC)
- Type: Clustering (NO target variable)
- Scaling: YES (distance-based)
- Outliers: Depends on linkage
- Categorical Variables: NO
- Idea: Agglomerative: start with N clusters → merge closest pairs → dendrogram shows history
- Dendrogram: Tree diagram; cut horizontally → vertical lines crossed = number of clusters
- Advantages: No K needed in advance, deterministic, shows full merge history
- Linkage Types:
- Single: Nearest point, good for outlier detection
- Complete: Farthest point, avoids chains
- Average: Centroid distance, robust to outliers
- Ward: Minimizes within-cluster variance, best general choice
- Disadvantages: Slow on large data, linkage choice matters significantly
- Metrics: Dendrogram, Silhouette score, Elbow (distortion)
Association Rules (APr)
- Note: High Confidence + Lift ≈ 1 means the consequent is equally common with or without the antecedent, rendering the rule useless.