Document Similarity Metrics: Jaccard, Cosine, and Hamming Calculations

Posted by Anonymous and classified in Mathematics

Written on in English with a size of 128 KB

Document Similarity Metrics: Jaccard, Cosine, and Hamming Calculations (Q99)

Documents (after lowercasing and tokenizing words, removing punctuation):

  • D1: “the night is dark and the moon is red”
  • D2: “the moon in the night is red”
  • D3: “i can see moon is red the night is dark”

Step A — Construct Word Sets / Vectors (Unified Vocabulary)

Vocabulary (unique words across D1–D3): {the, night, is, dark, and, moon, red, in, i, can, see} (11 words)

i) Jaccard Similarity (Set of Words)

Set(D1) = {the, night, is, dark, and, moon, red}

Set(D2) = {the, moon, in, night, is, red}

Set(D3) = {i, can, see, moon, is, red, the, night, dark}

Compute pairwise Jaccard:

  • D1 ∩ D2 = {the, night, is, moon, red} → size 5; D1 ∪ D2 size = 8 → J(D1,D2)=5/8=0.625
  • D1 ∩ D3 = {the, night, is, moon, red, dark} → size 6; union size = 9 → J(D1,D3)=6/9≈0.6667
  • D2 ∩ D3 = {the, moon, is, red, night} → size 5; union size = 8 → J(D2,D3)=5/8=0.625

ii) Cosine Similarity (TF Vectors)

Use term frequency (counts). Order: the, night, is, dark, and, moon, red, in, i, can, see:

  • TF(D1) = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0] (Norm √7)
  • TF(D2) = [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0] (Norm √6)
  • TF(D3) = [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1] (Norm √9)

Compute cosine (dot product / product of norms):

  • Cos(D1,D2) = (1+1+1+0+0+1+1) / (√7 × √6) = 5 / (2.6458 × 2.4495) ≈ 0.7716
  • Cos(D1,D3) = (1+1+1+1+0+1+1) / (√7 × √9) = 6 / (2.6458 × 3) ≈ 0.7561
  • Cos(D2,D3) = (1+1+1+0+0+1+1) / (√6 × √9) = 5 / (2.4495 × 3) ≈ 0.6806

iii) Hamming Distance (Binary Presence Vectors)

Use presence (1)/absence (0) for the 11 vocabulary terms:

  • D1: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
  • D2: [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0]
  • D3: [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1]

Hamming distances (count differing positions out of 11):

  • H(D1,D2): Differ at positions corresponding to dark (1 vs 0), and (1 vs 0), in (0 vs 1) → H = 3.
  • H(D1,D3): Differ at positions corresponding to and (1 vs 0), i (0 vs 1), can (0 vs 1), see (0 vs 1) → H = 4.
  • H(D2,D3): Differ at positions corresponding to dark (0 vs 1), in (1 vs 0), i (0 vs 1), can (0 vs 1), see (0 vs 1) → H = 5.

Summary of Results

  • Jaccard: D1–D3 ≈ 0.667 (highest), D1–D2 = D2–D3 = 0.625.
  • Cosine (TF): D1–D2 ≈ 0.772, D1–D3 ≈ 0.756, D2–D3 ≈ 0.681.
  • Hamming (Binary): D1–D2 = 3, D1–D3 = 4, D2–D3 = 5.

K-Means Clustering Analysis (Q102)

Given points: A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).

Initial centroids: C₁ = A1(2,10), C₂ = B1(5,8), C₃ = C1(1,2).

Iteration 1: Assignment

  • Cluster C₁ (2,10): members → A1
  • Cluster C₂ (5,8): members → A3, B1, B2, B3, C2 (closest)
  • Cluster C₃ (1,2): members → A2, C1

Iteration 1: Recompute Centroids

  • C₁ = mean of {A1} = (2.0, 10.0) (Unchanged)
  • C₂ = mean of {A3(8,4), B1(5,8), B2(7,5), B3(6,4), C2(4,9)} = (6.0, 6.0)
  • C₃ = mean of {A2(2,5), C1(1,2)} = (1.5, 3.5)

Iteration 2: Reassignment

  • C₁ (2.0,10.0): members → A1, C2
  • C₂ (6.0,6.0): members → A3, B1, B2, B3
  • C₃ (1.5,3.5): members → A2, C1

Iteration 2: Recompute Centroids

  • C₁ = mean(A1(2,10), C2(4,9)) = (3.0, 9.5)
  • C₂ = mean(A3,B1,B2,B3) = (6.5, 5.25)
  • C₃ = (1.5, 3.5) (Unchanged)

Iteration 3: Final Stable Clusters (Convergence)

The algorithm stabilizes around Iteration 3 or 4. The final stable clusters are:

  • Cluster 1 (Centroid ≈ (3.667, 9.0)): {A1(2,10), B1(5,8), C2(4,9)}
  • Cluster 2 (Centroid ≈ (7.0, 4.333)): {A3(8,4), B2(7,5), B3(6,4)}
  • Cluster 3 (Centroid = (1.5,3.5)): {A2(2,5), C1(1,2)}

Note: Show Euclidean distance calculations per assignment step is required for full marks.


K-Means Algorithm Flow Chart Explanation (Q92)

  1. Input: Dataset $D$, choose number of clusters $k$.
  2. Initialize: Select $k$ initial centroids (randomly or using a heuristic).
  3. Assign: For each data point, compute its distance to every centroid and assign it to the nearest cluster.
  4. Update: Recompute each centroid as the mean (average) of all points currently assigned to that cluster.
  5. Convergence Check: If the centroids have not changed significantly (or the maximum iteration limit is not reached), stop; otherwise, return to Step 3.
  6. Output: Final centroids and cluster assignments.

Notes: The algorithm repeats until centroids stabilize. It is sensitive to initialization; using techniques like k-means++ or multiple random runs improves result quality.


Data Mining Fundamentals (Q94)

  1. Definition: Automated extraction of meaningful patterns, trends, and anomalies from large datasets.
  2. Applications: Market-basket analysis, fraud detection, customer segmentation, predictive maintenance, healthcare prognosis.
  3. Typical Tasks: Classification, clustering, association rule mining, outlier detection.
  4. Benefits: Enhanced decision support, personalization, and cost reduction.
  5. Problems/Challenges: Handling noisy/incomplete data, scalability issues, privacy concerns, and model interpretability.
  6. Deployment Issues: Integration with enterprise systems and continuous model monitoring for drift.

PageRank for Website Ranking Improvement (Q95)

  1. PageRank Definition: A link-analysis algorithm that scores web pages based on the quantity and quality of inbound links.
  2. Core Idea: A page is considered important if important pages link to it. The calculation incorporates a damping factor (typically 0.85).
  3. Computation: Performed iteratively using an eigenvector or Markov process until the scores converge.
  4. To Improve Ranking: Acquire high-quality backlinks from authoritative sites, optimize internal linking structure, and ensure pages are easily crawlable (remove broken links).
  5. Content & Relevance: PageRank must be combined with query relevance signals and strong on-page SEO (high-quality content, optimized metadata).
  6. Caveat: PageRank is only one signal; modern search engines rely on hundreds of ranking features.

Shingling: Definition, Disadvantages, and Remedies (Q96)

  1. Shingling: The process of breaking a document into overlapping $k$-grams (word or character shingles) to capture local sequence information.
  2. Use: Documents are converted into sets of shingles; similarity is then computed using the Jaccard index on these sets.
  3. Disadvantages: Leads to very high dimensionality, resulting in expensive storage and comparison operations. It is also sensitive to minor edits (e.g., punctuation changes).
  4. Remedy for Size: Apply MinHash to compress the large shingle sets into small, fixed-size signatures while preserving similarity estimates.
  5. Remedy for Speed: Use Locality Sensitive Hashing (LSH) to quickly identify candidate pairs that are likely to be similar, avoiding all-pairs comparison.
  6. Practical Tip: Choose the shingle size ($k$) carefully; larger $k$ reduces false positives but increases sensitivity to small changes.

Data Stream Management Systems (DSMS) and Sensor Data (Q97)

  1. DSMS Definition: A system designed to process continuous, unbounded streams of data in real time using continuous queries.
  2. Characteristics: Requires low latency processing, supports sliding or tumbling window semantics, and performs incremental aggregation.
  3. Operators: Includes standard operators like filters, joins, and aggregates applied over defined time windows.
  4. Relation to Sensors: Sensors generate continuous data streams (e.g., temperature, pressure); the DSMS is the engine that ingests and analyzes this data immediately.
  5. Benefit: DSMS enables in-network filtering and aggregation, significantly reducing transmission load and providing real-time alerts from sensor readings.
  6. Challenges: Handling out-of-order data arrival, managing resource constraints on processing nodes, and ensuring fault tolerance.

Market-Basket Model and Apriori Algorithm (Q100)

  1. Market-Basket Model: A transactional dataset where each record (basket) is a set of items purchased together.
  2. Goal: Discover frequent itemsets and generate actionable association rules (e.g., {bread} → {butter}).
  3. Apriori Key Idea: The Apriori property states that any subset of a frequent itemset must also be frequent (downward closure).
  4. Algorithm Steps:
    1. Scan DB to find frequent 1-itemsets (support ≧ minsup).
    2. Generate candidate $k$-itemsets from frequent $(k-1)$-itemsets (join and prune steps).
    3. Scan DB to count supports for candidates and prune non-frequent ones. Repeat until no new frequent itemsets are found.
  5. Rule Generation: From each frequent itemset, generate all possible association rules and test them against a minimum confidence threshold.
  6. Use in Supermarkets: Identify frequently co-purchased items to optimize product placement, bundling, and promotional strategies.
  7. Practical Note: Apriori can be computationally expensive due to multiple database passes; modern systems use optimizations like FP-growth or sampling.

Hamming Distance Calculations

Hamming Distance: "1011101" and "1001001" (Q86)

Compare bitwise:

1 0 1 1 1 0 1
1 0 0 1 0 0 1

Differences occur at positions 3 (1 vs 0) and 5 (1 vs 0). → Hamming distance = 2.

Hamming Distance: X = 0101010001, Y = 0100010000 (Q88)

Compare bitwise:

0 1 0 1 0 1 0 0 0 1
0 1 0 0 0 1 0 0 0 0

Differences occur at positions 4 (1 vs 0) and 10 (1 vs 0). → Hamming distance = 2.

k+yqa3Bjb6gmccolghNBEFaNyKAIgiAIgrDmiAyKIAiCIAhrjghQBEEQBEFYc0SAIgiCIAjCmiMCFEEQBEEQ1hwRoAiCIAiCsOaIAEUQBEEQhDVHBCiCIAiCIKw5IkARBEEQBGHN+f8B12OEaJKFfXwAAAAASUVORK5CYII=


Edit Distance Calculation (Q87)

Calculate the Edit Distance (Levenshtein) between s = "pepperatcoding" and t = "pepcodings".

  1. Observe: "pepperatcoding" → Keep "pep".
  2. Delete "perat" (5 deletions). The string becomes "pepcoding".
  3. Insert final "s" to match "pepcodings" (1 insertion).

Total minimal edits = 5 deletions + 1 insertion = 6 operations. → Edit distance = 6.


Components for Finding Document Similarity (Q93)

  1. Preprocessing: Tokenization, stop-word removal, and optional stemming or lemmatization.
  2. Representation: Converting text into numerical vectors (e.g., TF, TF-IDF, or MinHash signatures for shingles).
  3. Similarity Measure: Applying a metric such as Cosine, Jaccard, or Euclidean distance to the representations.

3R6YgAAAAZJREFUAwDd0yByxhy18QAAAABJRU5ErkJggg==


Shingling Concept Example (Q101)

  1. Shingling Definition: Breaking text into overlapping $k$-grams (word or character sequences) to create a set representation of the document structure.
  2. Example (k=2 words): “Data Science is fun” → Shingles = { “data science”, “science is”, “is fun” }.
  3. Application: These shingle sets are used to compute Jaccard similarity for measuring document closeness.

Shingling and Similarity for Documents D1–D4 (Q104)

Docs: D1: “I am Sam.”; D2: “Sam I am.”; D3: “I do not like green eggs and ham.”; D4: “I do not like them, Sam I am.”

Procedure (k=2 word shingles, normalized):

  • D1 → { “i am”, “am sam” } (Size 2)
  • D2 → { “sam i”, “i am” } (Size 2)
  • D3 → { “i do”, “do not”, “not like”, “like green”, “green eggs”, “eggs and”, “and ham” } (Size 7)
  • D4 → { “i do”, “do not”, “not like”, “like them”, “them sam”, “sam i”, “i am” } (Size 7)

Compute pairwise Jaccard (|intersection| / |union|):

  • D1 vs D2: Intersection { “i am” } → 1 / 3 ≈ 0.333
  • D2 vs D4: Intersection { “sam i”, “i am” } → 2 / 7 ≈ 0.286
  • D3 vs D4: Intersection { “i do”, “do not”, “not like” } → 3 / 11 ≈ 0.273

Result: D1 and D2 show the highest similarity due to sharing the exact phrase “i am”.


Apriori vs. PCY Algorithm Comparison (Q105)

  1. Market-Basket Model: A transactional dataset where each transaction is a set of items; used for finding frequent itemsets and association rules.
  2. Apriori: Iteratively generates candidate $k$-itemsets from frequent $(k-1)$-itemsets and requires multiple full database scans to count supports, relying on the downward closure property for pruning.
  3. PCY Algorithm: Improves Apriori by using a hash table in the first pass to count pairs and a bitmap to filter out unlikely candidate pairs, significantly reducing the number of pairs considered in subsequent passes.
  4. Efficiency Comparison: PCY (and its successors) is generally more efficient in practice for large datasets because hashing and bitmap compression drastically reduce candidate generation overhead and I/O.
  5. When to Prefer Apriori: For smaller datasets where implementation simplicity outweighs the performance gains of hashing techniques.
  6. Modern Practice: For very large-scale mining, algorithms like FP-growth or distributed methods (like SON) are often preferred over basic Apriori or PCY.

Related entries: