Document Similarity Metrics: Jaccard, Cosine, and Hamming Calculations

Posted by Anonymous and classified in Mathematics

Written on March 11, 2026 in English with a size of 128 KB

Document Similarity Metrics: Jaccard, Cosine, and Hamming Calculations (Q99)

Documents (after lowercasing and tokenizing words, removing punctuation):

D1: “the night is dark and the moon is red”
D2: “the moon in the night is red”
D3: “i can see moon is red the night is dark”

Step A — Construct Word Sets / Vectors (Unified Vocabulary)

Vocabulary (unique words across D1–D3): {the, night, is, dark, and, moon, red, in, i, can, see} (11 words)

i) Jaccard Similarity (Set of Words)

Set(D1) = {the, night, is, dark, and, moon, red}

Set(D2) = {the, moon, in, night, is, red}

Set(D3) = {i, can, see, moon, is, red, the, night, dark}

Compute pairwise Jaccard:

D1 ∩ D2 = {the, night, is, moon, red} → size 5; D1 ∪ D2 size = 8 → J(D1,D2)=5/8=0.625
D1 ∩ D3 = {the, night, is, moon, red, dark} → size 6; union size = 9 → J(D1,D3)=6/9≈0.6667
D2 ∩ D3 = {the, moon, is, red, night} → size 5; union size = 8 → J(D2,D3)=5/8=0.625

ii) Cosine Similarity (TF Vectors)

Use term frequency (counts). Order: the, night, is, dark, and, moon, red, in, i, can, see:

TF(D1) = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0] (Norm √7)
TF(D2) = [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0] (Norm √6)
TF(D3) = [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1] (Norm √9)

Compute cosine (dot product / product of norms):

Cos(D1,D2) = (1+1+1+0+0+1+1) / (√7 × √6) = 5 / (2.6458 × 2.4495) ≈ 0.7716
Cos(D1,D3) = (1+1+1+1+0+1+1) / (√7 × √9) = 6 / (2.6458 × 3) ≈ 0.7561
Cos(D2,D3) = (1+1+1+0+0+1+1) / (√6 × √9) = 5 / (2.4495 × 3) ≈ 0.6806

iii) Hamming Distance (Binary Presence Vectors)

Use presence (1)/absence (0) for the 11 vocabulary terms:

D1: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
D2: [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0]
D3: [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1]

Hamming distances (count differing positions out of 11):

H(D1,D2): Differ at positions corresponding to dark (1 vs 0), and (1 vs 0), in (0 vs 1) → H = 3.
H(D1,D3): Differ at positions corresponding to and (1 vs 0), i (0 vs 1), can (0 vs 1), see (0 vs 1) → H = 4.
H(D2,D3): Differ at positions corresponding to dark (0 vs 1), in (1 vs 0), i (0 vs 1), can (0 vs 1), see (0 vs 1) → H = 5.

Summary of Results

Jaccard: D1–D3 ≈ 0.667 (highest), D1–D2 = D2–D3 = 0.625.
Cosine (TF): D1–D2 ≈ 0.772, D1–D3 ≈ 0.756, D2–D3 ≈ 0.681.
Hamming (Binary): D1–D2 = 3, D1–D3 = 4, D2–D3 = 5.

K-Means Clustering Analysis (Q102)

Given points: A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).

Initial centroids: C₁ = A1(2,10), C₂ = B1(5,8), C₃ = C1(1,2).

Iteration 1: Assignment

Cluster C₁ (2,10): members → A1
Cluster C₂ (5,8): members → A3, B1, B2, B3, C2 (closest)
Cluster C₃ (1,2): members → A2, C1

Iteration 1: Recompute Centroids

C₁ = mean of {A1} = (2.0, 10.0) (Unchanged)
C₂ = mean of {A3(8,4), B1(5,8), B2(7,5), B3(6,4), C2(4,9)} = (6.0, 6.0)
C₃ = mean of {A2(2,5), C1(1,2)} = (1.5, 3.5)

Iteration 2: Reassignment

C₁ (2.0,10.0): members → A1, C2
C₂ (6.0,6.0): members → A3, B1, B2, B3
C₃ (1.5,3.5): members → A2, C1

Iteration 2: Recompute Centroids

C₁ = mean(A1(2,10), C2(4,9)) = (3.0, 9.5)
C₂ = mean(A3,B1,B2,B3) = (6.5, 5.25)
C₃ = (1.5, 3.5) (Unchanged)

Iteration 3: Final Stable Clusters (Convergence)

The algorithm stabilizes around Iteration 3 or 4. The final stable clusters are:

Cluster 1 (Centroid ≈ (3.667, 9.0)): {A1(2,10), B1(5,8), C2(4,9)}
Cluster 2 (Centroid ≈ (7.0, 4.333)): {A3(8,4), B2(7,5), B3(6,4)}
Cluster 3 (Centroid = (1.5,3.5)): {A2(2,5), C1(1,2)}

Note: Show Euclidean distance calculations per assignment step is required for full marks.

K-Means Algorithm Flow Chart Explanation (Q92)

Input: Dataset $D$, choose number of clusters $k$.
Initialize: Select $k$ initial centroids (randomly or using a heuristic).
Assign: For each data point, compute its distance to every centroid and assign it to the nearest cluster.
Update: Recompute each centroid as the mean (average) of all points currently assigned to that cluster.
Convergence Check: If the centroids have not changed significantly (or the maximum iteration limit is not reached), stop; otherwise, return to Step 3.
Output: Final centroids and cluster assignments.

Notes: The algorithm repeats until centroids stabilize. It is sensitive to initialization; using techniques like k-means++ or multiple random runs improves result quality.

Data Mining Fundamentals (Q94)

Definition: Automated extraction of meaningful patterns, trends, and anomalies from large datasets.
Applications: Market-basket analysis, fraud detection, customer segmentation, predictive maintenance, healthcare prognosis.
Typical Tasks: Classification, clustering, association rule mining, outlier detection.
Benefits: Enhanced decision support, personalization, and cost reduction.
Problems/Challenges: Handling noisy/incomplete data, scalability issues, privacy concerns, and model interpretability.
Deployment Issues: Integration with enterprise systems and continuous model monitoring for drift.

PageRank for Website Ranking Improvement (Q95)

PageRank Definition: A link-analysis algorithm that scores web pages based on the quantity and quality of inbound links.
Core Idea: A page is considered important if important pages link to it. The calculation incorporates a damping factor (typically 0.85).
Computation: Performed iteratively using an eigenvector or Markov process until the scores converge.
To Improve Ranking: Acquire high-quality backlinks from authoritative sites, optimize internal linking structure, and ensure pages are easily crawlable (remove broken links).
Content & Relevance: PageRank must be combined with query relevance signals and strong on-page SEO (high-quality content, optimized metadata).
Caveat: PageRank is only one signal; modern search engines rely on hundreds of ranking features.

Shingling: Definition, Disadvantages, and Remedies (Q96)

Shingling: The process of breaking a document into overlapping $k$-grams (word or character shingles) to capture local sequence information.
Use: Documents are converted into sets of shingles; similarity is then computed using the Jaccard index on these sets.
Disadvantages: Leads to very high dimensionality, resulting in expensive storage and comparison operations. It is also sensitive to minor edits (e.g., punctuation changes).
Remedy for Size: Apply MinHash to compress the large shingle sets into small, fixed-size signatures while preserving similarity estimates.
Remedy for Speed: Use Locality Sensitive Hashing (LSH) to quickly identify candidate pairs that are likely to be similar, avoiding all-pairs comparison.
Practical Tip: Choose the shingle size ($k$) carefully; larger $k$ reduces false positives but increases sensitivity to small changes.

Data Stream Management Systems (DSMS) and Sensor Data (Q97)

DSMS Definition: A system designed to process continuous, unbounded streams of data in real time using continuous queries.
Characteristics: Requires low latency processing, supports sliding or tumbling window semantics, and performs incremental aggregation.
Operators: Includes standard operators like filters, joins, and aggregates applied over defined time windows.
Relation to Sensors: Sensors generate continuous data streams (e.g., temperature, pressure); the DSMS is the engine that ingests and analyzes this data immediately.
Benefit: DSMS enables in-network filtering and aggregation, significantly reducing transmission load and providing real-time alerts from sensor readings.
Challenges: Handling out-of-order data arrival, managing resource constraints on processing nodes, and ensuring fault tolerance.

Market-Basket Model and Apriori Algorithm (Q100)

Market-Basket Model: A transactional dataset where each record (basket) is a set of items purchased together.
Goal: Discover frequent itemsets and generate actionable association rules (e.g., {bread} → {butter}).
Apriori Key Idea: The Apriori property states that any subset of a frequent itemset must also be frequent (downward closure).
Algorithm Steps:
1. Scan DB to find frequent 1-itemsets (support ≧ minsup).
2. Generate candidate $k$-itemsets from frequent $(k-1)$-itemsets (join and prune steps).
3. Scan DB to count supports for candidates and prune non-frequent ones. Repeat until no new frequent itemsets are found.
Rule Generation: From each frequent itemset, generate all possible association rules and test them against a minimum confidence threshold.
Use in Supermarkets: Identify frequently co-purchased items to optimize product placement, bundling, and promotional strategies.
Practical Note: Apriori can be computationally expensive due to multiple database passes; modern systems use optimizations like FP-growth or sampling.

Hamming Distance Calculations

Hamming Distance: "1011101" and "1001001" (Q86)

Compare bitwise:

1 0 1 1 1 0 1
1 0 0 1 0 0 1

Differences occur at positions 3 (1 vs 0) and 5 (1 vs 0). → Hamming distance = 2.

Hamming Distance: X = 0101010001, Y = 0100010000 (Q88)

Compare bitwise:

0 1 0 1 0 1 0 0 0 1
0 1 0 0 0 1 0 0 0 0

Differences occur at positions 4 (1 vs 0) and 10 (1 vs 0). → Hamming distance = 2.

k+yqa3Bjb6gmccolghNBEFaNyKAIgiAIgrDmiAyKIAiCIAhrjghQBEEQBEFYc0SAIgiCIAjCmiMCFEEQBEEQ1hwRoAiCIAiCsOaIAEUQBEEQhDVHBCiCIAiCIKw5IkARBEEQBGHN+f8B12OEaJKFfXwAAAAASUVORK5CYII=

Edit Distance Calculation (Q87)

Calculate the Edit Distance (Levenshtein) between s = "pepperatcoding" and t = "pepcodings".

Observe: "pepperatcoding" → Keep "pep".
Delete "perat" (5 deletions). The string becomes "pepcoding".
Insert final "s" to match "pepcodings" (1 insertion).

Total minimal edits = 5 deletions + 1 insertion = 6 operations. → Edit distance = 6.

Components for Finding Document Similarity (Q93)

Preprocessing: Tokenization, stop-word removal, and optional stemming or lemmatization.
Representation: Converting text into numerical vectors (e.g., TF, TF-IDF, or MinHash signatures for shingles).
Similarity Measure: Applying a metric such as Cosine, Jaccard, or Euclidean distance to the representations.

3R6YgAAAAZJREFUAwDd0yByxhy18QAAAABJRU5ErkJggg==

Shingling Concept Example (Q101)

Shingling Definition: Breaking text into overlapping $k$-grams (word or character sequences) to create a set representation of the document structure.
Example (k=2 words): “Data Science is fun” → Shingles = { “data science”, “science is”, “is fun” }.
Application: These shingle sets are used to compute Jaccard similarity for measuring document closeness.

Shingling and Similarity for Documents D1–D4 (Q104)

Docs: D1: “I am Sam.”; D2: “Sam I am.”; D3: “I do not like green eggs and ham.”; D4: “I do not like them, Sam I am.”

Procedure (k=2 word shingles, normalized):

D1 → { “i am”, “am sam” } (Size 2)
D2 → { “sam i”, “i am” } (Size 2)
D3 → { “i do”, “do not”, “not like”, “like green”, “green eggs”, “eggs and”, “and ham” } (Size 7)
D4 → { “i do”, “do not”, “not like”, “like them”, “them sam”, “sam i”, “i am” } (Size 7)

Compute pairwise Jaccard (|intersection| / |union|):

D1 vs D2: Intersection { “i am” } → 1 / 3 ≈ 0.333
D2 vs D4: Intersection { “sam i”, “i am” } → 2 / 7 ≈ 0.286
D3 vs D4: Intersection { “i do”, “do not”, “not like” } → 3 / 11 ≈ 0.273

Result: D1 and D2 show the highest similarity due to sharing the exact phrase “i am”.

Apriori vs. PCY Algorithm Comparison (Q105)

Market-Basket Model: A transactional dataset where each transaction is a set of items; used for finding frequent itemsets and association rules.
Apriori: Iteratively generates candidate $k$-itemsets from frequent $(k-1)$-itemsets and requires multiple full database scans to count supports, relying on the downward closure property for pruning.
PCY Algorithm: Improves Apriori by using a hash table in the first pass to count pairs and a bitmap to filter out unlikely candidate pairs, significantly reducing the number of pairs considered in subsequent passes.
Efficiency Comparison: PCY (and its successors) is generally more efficient in practice for large datasets because hashing and bitmap compression drastically reduce candidate generation overhead and I/O.
When to Prefer Apriori: For smaller datasets where implementation simplicity outweighs the performance gains of hashing techniques.
Modern Practice: For very large-scale mining, algorithms like FP-growth or distributed methods (like SON) are often preferred over basic Apriori or PCY.

Related entries:

Tags: