Mastering Topic Modeling, Anomaly Detection, and PageRank

Posted by Anonymous and classified in Computers

Written on in English with a size of 12.48 KB

Topic Modeling: LSI vs LDA

Latent Semantic Indexing (LSI)

Use LSI for semantic similarity, retrieval, short/sparse documents, and synonymy problems. It uses Singular Value Decomposition (SVD) to find latent concept axes from term-document co-occurrence. Concepts are mathematical directions, not clean probability-based topics. It is better when the goal is to “find similar documents.”

Latent Dirichlet Allocation (LDA)

Use LDA for discovering hidden themes and topic percentages. In this model, each document is a mixture of topics, and each topic is a distribution over words. It is better when the goal is to determine “what themes exist in this corpus?”

Key Distinction

  • LSI finds latent concept dimensions.
  • LDA explicitly models probabilistic topics.

The LDA Process

  1. Pick the number of topics K.
  2. Randomly assign each word to a topic.
  3. Reassign each word by checking:
    • How common the topic is in that document.
    • How common the word is in that topic overall.
  4. Repeat until topics stabilize enough to interpret.

Gibbs Sampling Idea

P(topic k | word, document) is proportional to:
(topic count in document + alpha) × (word count in topic + beta)

Hyperparameters: Alpha and Beta

Alpha:

  • Controls how many topics appear in each document.
  • High alpha: Documents mix many topics.
  • Low alpha: Documents are dominated by fewer topics.

Beta:

  • Controls how many words appear in each topic.
  • High beta: Topics are broad and use many words evenly.
  • Low beta: Topics are sharp and focused on fewer strong words.

Corpus Strategy

If you have no clue about the corpus:

  • Do not guess high or low.
  • Start with default/neutral priors.
  • Inspect the output.
  • Tune based on identified problems.

Parameter Diagnosis

  • Documents evenly split across many topics: Alpha may be too high.
  • Documents forced into one topic even when mixed: Alpha may be too low.
  • Topics vague and share many words: Beta may be too high or preprocessing is weak.
  • Topics too narrow/repetitive: Beta may be too low.

Major LDA Traps

  • Topics may not match human categories.
  • Short documents give weak evidence.
  • Random initialization can change results.
  • Corpus-level patterns can conflict with document-level meaning.
  • Bag-of-words loses word order.
  • Choosing K is difficult.

Anomaly Detection

Defining Anomalies

  • An anomaly is a point that is unusual relative to its context.
  • An anomaly does not automatically mean fraud, error, or bad behavior.

KNN Anomaly Detection

The scoring methods include:

  • score(x) = average distance to k nearest neighbors.
  • score(x) = distance to the k-th nearest neighbor.

Interpretation

  • Large score: Isolated, likely an anomaly.
  • Small score: Dense region, likely normal.

Effects of K

  • Small k: Local, sensitive to noise.
  • Large k: Broader and more stable, but may miss local anomalies.

KNN Traps

  • Must scale features.
  • Large numeric ranges dominate Euclidean distance.
  • High dimensions make distance less useful.
  • Irrelevant features hurt performance.

DBSCAN: Density-Based Clustering

  • eps: The radius around a point.
  • MinPts: Minimum neighbors needed inside eps.
  • Core point: At least MinPts neighbors within eps.
  • Border point: Not enough neighbors itself, but inside eps of a core point.
  • Noise/outlier: Neither a core nor a border point.

DBSCAN Parameter Diagnosis

  • Almost everything is noise: eps is too small, MinPts is too high, or data is high-dimensional and sparse.
  • Clusters merge: eps is too large.
  • Random noise becomes clusters: MinPts is too low.
  • Small real cluster disappears: MinPts is too high or there is a density mismatch.
  • Border point assigned oddly: Border points can be ambiguous near multiple core regions.

Choosing Eps

  • Use a k-distance graph.
  • Set k = MinPts - 1.
  • X-axis: Sorted points/ranks (not a real feature).
  • Y-axis: Distance to the k-th nearest neighbor.
  • Choose eps at the “elbow” or “knee” of the graph.

K-Distance Graph Interpretation

  • Flat left region: Dense cluster points.
  • Knee: Transition from dense points to sparse/noise points.
  • Steep right region: Noise/outliers.

The High-Dimensional Trap

  • In high dimensions, points tend to be farther apart.
  • Distances also become less contrastive.
  • Nearest and farthest points can both look similarly far.
  • Consequently, eps becomes hard to choose.
  • Raising eps alone can create fake or merged clusters.

MinPts Guidelines

  • 2-3 dimensions: About 4–6.
  • 5-10 dimensions: About 2 × D.
  • More than 10 dimensions: About 3 × D.

K-means vs DBSCAN

  • K-means is centroid-based; DBSCAN is density-based.
  • K-means forces every point into a cluster; DBSCAN can label points as noise.
  • K-means works best for compact, round-ish clusters; DBSCAN can find irregular shapes.
  • K-means does not naturally detect outliers and can change depending on random centroid initialization.

When to Use K-means

  • You need exactly K groups.
  • Clusters are roughly compact and center-based.
  • K is given by the application (e.g., 3 customer segments).

When to Use DBSCAN

  • You do not know the number of clusters.
  • Clusters may have irregular shapes.
  • You want sparse points labeled as noise/outliers.
  • Dense regions are separated by sparse regions.

Link Analysis and PageRank

Fundamentals of Link Analysis

  • Nodes: Pages, users, or papers.
  • Edges: Links, follows, or citations.
  • A link is treated as an endorsement or quality signal.

Simple In-Degree

score(page) = number of incoming links

Problem: This is easy to manipulate with many fake, low-quality pages.

PageRank Improvement

  • Looks at both the quantity and quality of incoming links.
  • A link from an important page matters more.
  • If an important page links to many pages, each target receives less “rank.”

Simplified PageRank Formula

PR(u) = Σ [PR(v) / |out(v)|] for all v in in(u).

In plain English: The new rank of page u equals the sum of rank shares from pages linking to u. Each linking page splits its rank across all its outgoing links.

Modified PageRank Formula

PR(p_i) = (1-d)/N + d × (incoming-link-share + sink-share)

Expanded:
PR(p_i) = (1-d)/N + d × (Σ [PR(p_j) / |out(p_j)|] + Σ [PR(p_k) / N])

Variables Defined

  • N: Number of pages.
  • d: Damping factor (usually 0.85).
  • 1-d: Teleport probability.
  • (1-d)/N: Teleportation share every page gets.
  • in(p_i): Pages linking to p_i.
  • |out(p_j)|: Number of outgoing links from linking page p_j.
  • D: Sink/dangling pages with no outgoing links.
  • sink-share: Total sink PageRank redistributed evenly to all N pages.

Formula in words: New PageRank = guaranteed teleportation + (damping × rank from incoming links) + redistributed sink rank.

PageRank Iteration Checklist

  1. Initialize every page: PR0 = 1/N.
  2. Use previous iteration values to compute the next iteration.
  3. For incoming links, add the previous PR of the linking page divided by its number of outlinks.
  4. Identify sink nodes.
  5. Add the sink total divided by N to every page inside the parentheses.
  6. Do not divide by out(A)=0 for a sink.
  7. Check that all PageRanks sum to 1.

Damping and Sinks

  • Damping: With probability d, a surfer follows a link; with probability 1-d, they teleport to any page. This prevents rank from getting trapped.
  • Sink Node: A page with no outgoing links. Rank flows in but cannot flow out. Fix: Redistribute sink rank evenly to all pages, or treat the sink as linking to all pages.

PageRank Traps

  • Fake incoming links may not help if those pages have low rank.
  • One link from an important page can matter significantly.
  • A high-rank page with many outgoing links gives less to each target page.
  • A page with no incoming links can still get a non-zero rank due to teleportation.
  • PageRank is query-independent; it measures authority, not relevance.
  • Search ranking combines both relevance and authority.

High-Risk Exam Traps

  • LSI vs LDA: LSI finds concept axes for similarity; LDA finds probabilistic topics for interpretation.
  • Alpha/Beta: These are controls/priors, not outputs. Without context, use defaults and tune after inspection.
  • Apple Ambiguity: If “apple” usually appears in tech documents, corpus-level evidence may push the model toward tech even when a specific document refers to fruit.
  • Bag-of-Words: “Dog bites man” and “man bites dog” look identical because word order is ignored.
  • Short Documents: LDA has weak evidence on short documents. Consider LSI, embeddings, aggregation, or fewer topics.
  • KNN Anomaly: Being far from neighbors means a point is unusual, but unusual does not automatically mean fraud.
  • DBSCAN High Dimensions: Points become farther apart and distances become similarly large, making eps hard to choose.
  • Raising Eps: This can reduce noise but can also merge unrelated points into fake clusters.
  • K-means: Centroid-based, not density-based. It forces every point into a cluster and has no natural outlier label.
  • Border Points: DBSCAN ambiguity usually affects border points located near multiple core regions.
  • PageRank: Importance is not the same as relevance. Link value depends on the linking page's rank and its number of outgoing links.

Summary of Use Cases

  • LDA: Use when labels are missing and you want topic proportions, but watch for short documents and interpretability issues.
  • LSI: Use when the goal is semantic similarity, retrieval, or handling short, sparse text.
  • DBSCAN: Use when dense normal groups are separated by sparse regions and outliers should remain as noise.
  • KNN Anomaly: Use when isolated points should receive an anomaly score and features can be scaled carefully.
  • PageRank: Remember that a page is not important just because many pages link to it; it matters who links to it and how many outgoing links they have.

Related entries: