Data Mining and Knowledge Discovery: Core Concepts
Multilevel Association Rules
Hierarchy Matters: Items in databases are structured from general to specific (e.g., Food → Dairy → Milk → Amul Milk).
Low-Level Items: Rarely appear, so have low support.
Support & Confidence: Support decreases as we get specific; confidence usually stays stable in the hierarchy.
Approaches: Uniform Minimum Support uses the same support for all levels, but may miss specific patterns. Reduced Minimum Support assigns different supports; higher levels get a higher threshold, while lower levels get a smaller threshold.
Search Strategies: Independent Search mines levels separately. Level-Cross Filtering explores child nodes only if the parent is frequent. Controlled Level-Cross Filtering balances exploration and efficiency.
Multidimensional Association Rules
Multiple Dimensions: Involves two or more attributes (e.g., age, buys, gender), not just one.
Types of Rules: Single-dimensional involves one attribute (buys(X, "Butter") → buys(X, "Milk")). Inter-dimensional involves multiple attributes (gender(X, "Male") & salary(X, "High") → buys(X, "Computer")). Hybrid-dimensional uses the same predicate with multiple attributes.
Types of Attributes: Categorical attributes are unordered (e.g., color, brand). Quantitative attributes are ordered or numeric (e.g., age, income).
Techniques: Static Discretization converts numeric data to ranges (e.g., age: 20-30). Quantitative Rules associate numeric data with categorical data. The ARCS System clusters quantitative data into bins to find patterns. Distance Association uses similarity or distance between data points.
Data Processing and Cleaning Techniques
Data preprocessing prepares raw, incomplete, and inconsistent data for analysis. Its purpose is to enhance quality, consistency, and accuracy so analysis yields reliable results.
1. Data Cleaning
Goal: Detect, correct, or remove errors and inconsistencies to improve data quality.
Major Problems Addressed: Missing values, noisy data, duplicate records, and inconsistent data formats.
- a) Handling Missing Values: Ignoring tuples, manual filling, replacing with mean/median/mode, or using predictive models like regression.
- b) Handling Noisy Data: Binning, regression to fit data points to a line, or clustering to treat non-clustered data as outliers.
- c) Removing Duplicate Data: Compare all records and keep only one version.
- d) Correcting Inconsistent Data: Convert formats to one standard, standardize measurement units, and unify naming conventions.
2. Data Integration: Combine data from multiple sources into one dataset, resolving conflicts in naming, units, and redundancies.
3. Data Transformation: Convert data into a format suitable for analysis. Includes normalization (scaling), aggregation (combining values), and generalization (replacing low-level data with high-level concepts).
4. Data Reduction: Minimizes data volume while keeping important information to improve efficiency. Techniques include dimensionality reduction, data cube aggregation, and numerosity reduction.
Web Mining Fundamentals
Web mining is the process of discovering and extracting useful information or patterns from large volumes of web data.
Aim: Turn raw web data (pages, hyperlinks, server logs) into usable knowledge.
Combines Technologies: Data mining, machine learning, artificial intelligence, statistics, and information retrieval.
Main Techniques: Extract both structured and unstructured data automatically.
Three Key Types of Web Mining: Web Content Mining, Web Structure Mining, and Web Usage Mining.
Web Content Mining
Extracts meaningful information from the actual content of web pages (text, images, audio, video, tables, metadata). Its objective is to analyze, filter, and retrieve relevant data from large, unstructured web sets.
Features: Discovers valuable information from news, blogs, and social media; handles both structured and unstructured data; organizes information for relevance; and improves SEO and recommendation systems.
Types:
- Unstructured Data Mining: Uses Natural Language Processing (NLP) for summarization and sentiment analysis, and Information Extraction (IE) to find names, dates, and locations.
- Structured Data Mining: Extracts data from tables, metadata, and database entries.
Techniques: Text mining, keyword extraction, topic modeling, multimedia mining, and web crawling.
Web Structure Mining
A branch of web mining that analyzes the structure (links and connections) of websites using graph theory to find patterns based on hyperlinks.
Key Techniques:
- Web Graph: A directed graph where nodes represent webpages and edges represent hyperlinks.
- PageRank Algorithm: Used by Google to rank webpages based on the number and quality of incoming links.
- Hubs and Authorities Model: Hubs are pages with many outbound links; authorities are pages with many inbound links.
Applications: Social network analysis, website relevance, website completeness, and search engine optimization.
ETL: Extract, Transform, Load
ETL prepares raw data for analysis by cleaning, structuring, and integrating it from various sources into a target system.
- Extraction: Gather raw data from structured (SQL), semi-structured (JSON/XML), and unstructured (logs/files) sources.
- Transformation: Convert data into a consistent format through filtering, sorting, aggregating, and applying business rules.
- Loading: Move transformed data into a data warehouse or data lake using full or incremental loads.
Importance: Ensures data quality, enables business intelligence, and enhances decision-making speed.
Knowledge Discovery in Databases (KDD)
KDD is the step-by-step process of extracting useful, novel, and valuable knowledge from large datasets.
- Data Selection: Pick relevant data for the analysis.
- Data Preprocessing: Remove noise, fix missing values, and remove duplicates.
- Data Transformation: Merge and convert data into a suitable format.
- Data Mining: Apply algorithms for classification, clustering, or association rule mining.
- Pattern Evaluation: Identify significant patterns using support, confidence, and lift.
- Knowledge Representation: Present patterns visually using charts or decision trees.
- Knowledge Deployment: Integrate discovered knowledge into real-world systems.
Decision Tree Classification
A supervised learning technique that identifies patterns by splitting data into smaller branches, creating a tree-like model.
Structure: Root node (starting point), branches (decision paths), internal nodes (decisions), and leaf nodes (final outcomes).
Splitting Criteria:
- Gini Impurity: Measures the frequency of incorrect classification.
- Entropy and Information Gain: Entropy measures randomness; Information Gain measures the reduction in entropy after a split.
HITS Algorithm
The Hyperlink Induced Topic Search (HITS) algorithm ranks webpages based on link structures. It identifies Hubs (pages that point to many relevant pages) and Authorities (pages pointed to by many hubs).
DBSCAN Clustering
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clusters data points based on density. It handles arbitrary shapes and detects noise.
Key Parameters: ε (eps) (neighborhood radius) and MinPts (minimum neighbors to form a cluster).
Data Integration Approaches
1. Tight Coupling: Data is physically gathered and stored centrally in a data warehouse. Offers high data quality but can be expensive and inflexible.
2. Loose Coupling: Data remains in source systems; integration occurs virtually on-demand. Offers real-time access but can be slower for complex queries.
CLARANS in Web Mining
CLARANS (Clustering Large Applications based on Randomized Search) is an advanced clustering algorithm that improves on k-medoids by using randomized searching. It scales well for large, dynamic web data.
Market-Based Analysis
Used to find associations between products bought together. It utilizes metrics like Support (frequency), Confidence (likelihood), and Lift (statistical significance). Common algorithms include Apriori and FP-Growth.
OLAP Operations
Online Analytical Processing (OLAP) allows multidimensional analysis of data. Key operations include:
- Roll-up: Summarize data.
- Drill-down: Move from summary to detail.
- Slice: Select a single dimension.
- Dice: Select a subcube with multiple dimensions.
- Pivot: Reorient the cube.
Hierarchical Clustering
Builds a hierarchy of clusters using Agglomerative (bottom-up) or Divisive (top-down) methods. Results are visualized in a dendrogram.
Data Warehouse Features
- Central repository, supports decision-making, stores historical data, integrated and consistent, subject-oriented, non-volatile, and time-variant.
Performance Formulas
Accuracy: (TP+TN) / (TP+TN+FP+FN)
Precision: TP / (TP+FP)
Recall: TP / (TP+FN)
F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
FP-Tree (Frequent Pattern Growth)
Mines frequent itemsets without generating candidate sets. It compresses transactional data into an FP-Tree and recursively mines conditional FP-trees to extract patterns.
Naive Bayesian Classification
A probabilistic classifier based on Bayes' Theorem, assuming feature independence. It is simple, fast, and effective for large datasets.
K-Medoids Clustering (PAM)
Partitions data into k clusters using actual data points (medoids) as centers, making it more robust to outliers than K-means.
BIRCH Algorithm
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for large datasets. It uses a CF Tree to store compact summaries of data, allowing for incremental and efficient clustering.
Classification Issues
Common challenges include overfitting (model too complex), underfitting (model too simple), class imbalance, high dimensionality, and noise.
Data Sampling Techniques
Probability Sampling: Simple random, systematic, stratified, cluster, and multi-stage sampling.
Non-Probability Sampling: Convenience, judgmental, quota, and snowball sampling.
Outlier Detection
Outliers are data points that deviate significantly from the rest. Types include Global (point), Collective (group behavior), and Contextual (deviates only in specific conditions).
English with a size of 1.43 MB