Document Similarity Metrics: Jaccard, Cosine, and Hamming Calculations
Posted by Anonymous and classified in Mathematics
Written on in
English with a size of 128 KB
Document Similarity Metrics: Jaccard, Cosine, and Hamming Calculations (Q99)
Documents (after lowercasing and tokenizing words, removing punctuation):
- D1: “the night is dark and the moon is red”
- D2: “the moon in the night is red”
- D3: “i can see moon is red the night is dark”
Step A — Construct Word Sets / Vectors (Unified Vocabulary)
Vocabulary (unique words across D1–D3): {the, night, is, dark, and, moon, red, in, i, can, see} (11 words)
i) Jaccard Similarity (Set of Words)
Set(D1) = {the, night, is, dark, and, moon, red}
Set(D2) = {the, moon, in, night, is, red}
Set(D3) = {i, can, see, moon, is, red, the, night, dark}
Compute pairwise Jaccard:
- D1 ∩ D2 = {the, night, is, moon, red} → size 5; D1 ∪ D2 size = 8 → J(D1,D2)=5/8=0.625
- D1 ∩ D3 = {the, night, is, moon, red, dark} → size 6; union size = 9 → J(D1,D3)=6/9≈0.6667
- D2 ∩ D3 = {the, moon, is, red, night} → size 5; union size = 8 → J(D2,D3)=5/8=0.625
ii) Cosine Similarity (TF Vectors)
Use term frequency (counts). Order: the, night, is, dark, and, moon, red, in, i, can, see:
- TF(D1) = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0] (Norm √7)
- TF(D2) = [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0] (Norm √6)
- TF(D3) = [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1] (Norm √9)
Compute cosine (dot product / product of norms):
- Cos(D1,D2) = (1+1+1+0+0+1+1) / (√7 × √6) = 5 / (2.6458 × 2.4495) ≈ 0.7716
- Cos(D1,D3) = (1+1+1+1+0+1+1) / (√7 × √9) = 6 / (2.6458 × 3) ≈ 0.7561
- Cos(D2,D3) = (1+1+1+0+0+1+1) / (√6 × √9) = 5 / (2.4495 × 3) ≈ 0.6806
iii) Hamming Distance (Binary Presence Vectors)
Use presence (1)/absence (0) for the 11 vocabulary terms:
- D1: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
- D2: [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0]
- D3: [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1]
Hamming distances (count differing positions out of 11):
- H(D1,D2): Differ at positions corresponding to dark (1 vs 0), and (1 vs 0), in (0 vs 1) → H = 3.
- H(D1,D3): Differ at positions corresponding to and (1 vs 0), i (0 vs 1), can (0 vs 1), see (0 vs 1) → H = 4.
- H(D2,D3): Differ at positions corresponding to dark (0 vs 1), in (1 vs 0), i (0 vs 1), can (0 vs 1), see (0 vs 1) → H = 5.
Summary of Results
- Jaccard: D1–D3 ≈ 0.667 (highest), D1–D2 = D2–D3 = 0.625.
- Cosine (TF): D1–D2 ≈ 0.772, D1–D3 ≈ 0.756, D2–D3 ≈ 0.681.
- Hamming (Binary): D1–D2 = 3, D1–D3 = 4, D2–D3 = 5.
K-Means Clustering Analysis (Q102)
Given points: A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).
Initial centroids: C₁ = A1(2,10), C₂ = B1(5,8), C₃ = C1(1,2).
Iteration 1: Assignment
- Cluster C₁ (2,10): members → A1
- Cluster C₂ (5,8): members → A3, B1, B2, B3, C2 (closest)
- Cluster C₃ (1,2): members → A2, C1
Iteration 1: Recompute Centroids
- C₁ = mean of {A1} = (2.0, 10.0) (Unchanged)
- C₂ = mean of {A3(8,4), B1(5,8), B2(7,5), B3(6,4), C2(4,9)} = (6.0, 6.0)
- C₃ = mean of {A2(2,5), C1(1,2)} = (1.5, 3.5)
Iteration 2: Reassignment
- C₁ (2.0,10.0): members → A1, C2
- C₂ (6.0,6.0): members → A3, B1, B2, B3
- C₃ (1.5,3.5): members → A2, C1
Iteration 2: Recompute Centroids
- C₁ = mean(A1(2,10), C2(4,9)) = (3.0, 9.5)
- C₂ = mean(A3,B1,B2,B3) = (6.5, 5.25)
- C₃ = (1.5, 3.5) (Unchanged)
Iteration 3: Final Stable Clusters (Convergence)
The algorithm stabilizes around Iteration 3 or 4. The final stable clusters are:
- Cluster 1 (Centroid ≈ (3.667, 9.0)): {A1(2,10), B1(5,8), C2(4,9)}
- Cluster 2 (Centroid ≈ (7.0, 4.333)): {A3(8,4), B2(7,5), B3(6,4)}
- Cluster 3 (Centroid = (1.5,3.5)): {A2(2,5), C1(1,2)}
Note: Show Euclidean distance calculations per assignment step is required for full marks.
K-Means Algorithm Flow Chart Explanation (Q92)
- Input: Dataset $D$, choose number of clusters $k$.
- Initialize: Select $k$ initial centroids (randomly or using a heuristic).
- Assign: For each data point, compute its distance to every centroid and assign it to the nearest cluster.
- Update: Recompute each centroid as the mean (average) of all points currently assigned to that cluster.
- Convergence Check: If the centroids have not changed significantly (or the maximum iteration limit is not reached), stop; otherwise, return to Step 3.
- Output: Final centroids and cluster assignments.
Notes: The algorithm repeats until centroids stabilize. It is sensitive to initialization; using techniques like k-means++ or multiple random runs improves result quality.
Data Mining Fundamentals (Q94)
- Definition: Automated extraction of meaningful patterns, trends, and anomalies from large datasets.
- Applications: Market-basket analysis, fraud detection, customer segmentation, predictive maintenance, healthcare prognosis.
- Typical Tasks: Classification, clustering, association rule mining, outlier detection.
- Benefits: Enhanced decision support, personalization, and cost reduction.
- Problems/Challenges: Handling noisy/incomplete data, scalability issues, privacy concerns, and model interpretability.
- Deployment Issues: Integration with enterprise systems and continuous model monitoring for drift.
PageRank for Website Ranking Improvement (Q95)
- PageRank Definition: A link-analysis algorithm that scores web pages based on the quantity and quality of inbound links.
- Core Idea: A page is considered important if important pages link to it. The calculation incorporates a damping factor (typically 0.85).
- Computation: Performed iteratively using an eigenvector or Markov process until the scores converge.
- To Improve Ranking: Acquire high-quality backlinks from authoritative sites, optimize internal linking structure, and ensure pages are easily crawlable (remove broken links).
- Content & Relevance: PageRank must be combined with query relevance signals and strong on-page SEO (high-quality content, optimized metadata).
- Caveat: PageRank is only one signal; modern search engines rely on hundreds of ranking features.
Shingling: Definition, Disadvantages, and Remedies (Q96)
- Shingling: The process of breaking a document into overlapping $k$-grams (word or character shingles) to capture local sequence information.
- Use: Documents are converted into sets of shingles; similarity is then computed using the Jaccard index on these sets.
- Disadvantages: Leads to very high dimensionality, resulting in expensive storage and comparison operations. It is also sensitive to minor edits (e.g., punctuation changes).
- Remedy for Size: Apply MinHash to compress the large shingle sets into small, fixed-size signatures while preserving similarity estimates.
- Remedy for Speed: Use Locality Sensitive Hashing (LSH) to quickly identify candidate pairs that are likely to be similar, avoiding all-pairs comparison.
- Practical Tip: Choose the shingle size ($k$) carefully; larger $k$ reduces false positives but increases sensitivity to small changes.
Data Stream Management Systems (DSMS) and Sensor Data (Q97)
- DSMS Definition: A system designed to process continuous, unbounded streams of data in real time using continuous queries.
- Characteristics: Requires low latency processing, supports sliding or tumbling window semantics, and performs incremental aggregation.
- Operators: Includes standard operators like filters, joins, and aggregates applied over defined time windows.
- Relation to Sensors: Sensors generate continuous data streams (e.g., temperature, pressure); the DSMS is the engine that ingests and analyzes this data immediately.
- Benefit: DSMS enables in-network filtering and aggregation, significantly reducing transmission load and providing real-time alerts from sensor readings.
- Challenges: Handling out-of-order data arrival, managing resource constraints on processing nodes, and ensuring fault tolerance.
Market-Basket Model and Apriori Algorithm (Q100)
- Market-Basket Model: A transactional dataset where each record (basket) is a set of items purchased together.
- Goal: Discover frequent itemsets and generate actionable association rules (e.g., {bread} → {butter}).
- Apriori Key Idea: The Apriori property states that any subset of a frequent itemset must also be frequent (downward closure).
- Algorithm Steps:
- Scan DB to find frequent 1-itemsets (support ≧ minsup).
- Generate candidate $k$-itemsets from frequent $(k-1)$-itemsets (join and prune steps).
- Scan DB to count supports for candidates and prune non-frequent ones. Repeat until no new frequent itemsets are found.
- Rule Generation: From each frequent itemset, generate all possible association rules and test them against a minimum confidence threshold.
- Use in Supermarkets: Identify frequently co-purchased items to optimize product placement, bundling, and promotional strategies.
- Practical Note: Apriori can be computationally expensive due to multiple database passes; modern systems use optimizations like FP-growth or sampling.
Hamming Distance Calculations
Hamming Distance: "1011101" and "1001001" (Q86)
Compare bitwise:
1 0 1 1 1 0 1
1 0 0 1 0 0 1Differences occur at positions 3 (1 vs 0) and 5 (1 vs 0). → Hamming distance = 2.
Hamming Distance: X = 0101010001, Y = 0100010000 (Q88)
Compare bitwise:
0 1 0 1 0 1 0 0 0 1
0 1 0 0 0 1 0 0 0 0Differences occur at positions 4 (1 vs 0) and 10 (1 vs 0). → Hamming distance = 2.
Edit Distance Calculation (Q87)
Calculate the Edit Distance (Levenshtein) between s = "pepperatcoding" and t = "pepcodings".
- Observe: "pepperatcoding" → Keep "pep".
- Delete "perat" (5 deletions). The string becomes "pepcoding".
- Insert final "s" to match "pepcodings" (1 insertion).
Total minimal edits = 5 deletions + 1 insertion = 6 operations. → Edit distance = 6.
Components for Finding Document Similarity (Q93)
- Preprocessing: Tokenization, stop-word removal, and optional stemming or lemmatization.
- Representation: Converting text into numerical vectors (e.g., TF, TF-IDF, or MinHash signatures for shingles).
- Similarity Measure: Applying a metric such as Cosine, Jaccard, or Euclidean distance to the representations.
Shingling Concept Example (Q101)
- Shingling Definition: Breaking text into overlapping $k$-grams (word or character sequences) to create a set representation of the document structure.
- Example (k=2 words): “Data Science is fun” → Shingles = { “data science”, “science is”, “is fun” }.
- Application: These shingle sets are used to compute Jaccard similarity for measuring document closeness.
Shingling and Similarity for Documents D1–D4 (Q104)
Docs: D1: “I am Sam.”; D2: “Sam I am.”; D3: “I do not like green eggs and ham.”; D4: “I do not like them, Sam I am.”
Procedure (k=2 word shingles, normalized):
- D1 → { “i am”, “am sam” } (Size 2)
- D2 → { “sam i”, “i am” } (Size 2)
- D3 → { “i do”, “do not”, “not like”, “like green”, “green eggs”, “eggs and”, “and ham” } (Size 7)
- D4 → { “i do”, “do not”, “not like”, “like them”, “them sam”, “sam i”, “i am” } (Size 7)
Compute pairwise Jaccard (|intersection| / |union|):
- D1 vs D2: Intersection { “i am” } → 1 / 3 ≈ 0.333
- D2 vs D4: Intersection { “sam i”, “i am” } → 2 / 7 ≈ 0.286
- D3 vs D4: Intersection { “i do”, “do not”, “not like” } → 3 / 11 ≈ 0.273
Result: D1 and D2 show the highest similarity due to sharing the exact phrase “i am”.
Apriori vs. PCY Algorithm Comparison (Q105)
- Market-Basket Model: A transactional dataset where each transaction is a set of items; used for finding frequent itemsets and association rules.
- Apriori: Iteratively generates candidate $k$-itemsets from frequent $(k-1)$-itemsets and requires multiple full database scans to count supports, relying on the downward closure property for pruning.
- PCY Algorithm: Improves Apriori by using a hash table in the first pass to count pairs and a bitmap to filter out unlikely candidate pairs, significantly reducing the number of pairs considered in subsequent passes.
- Efficiency Comparison: PCY (and its successors) is generally more efficient in practice for large datasets because hashing and bitmap compression drastically reduce candidate generation overhead and I/O.
- When to Prefer Apriori: For smaller datasets where implementation simplicity outweighs the performance gains of hashing techniques.
- Modern Practice: For very large-scale mining, algorithms like FP-growth or distributed methods (like SON) are often preferred over basic Apriori or PCY.