Data Science, Machine Learning, and AI Concepts
Classified in Mathematics
Written on in
English with a size of 10.38 KB
Data Science, Machine Learning, and Artificial Intelligence
| Data Science | Machine Learning (ML) | Artificial Intelligence (AI) |
|---|---|---|
| A field that deals with extracting insights from structured and unstructured data. | A subset of AI that enables systems to learn from data without explicit programming. | A broad field that aims to create intelligent systems that mimic human cognition. |
| Involves data collection, cleaning, analysis, visualization, and predictive modeling. | Focuses on developing models that can make predictions or decisions based on data. | Encompasses various technologies, including ML, robotics, and expert systems. |
| Data wrangling, statistics, data visualization, and predictive analytics. | Supervised, unsupervised, and reinforcement learning. | Natural language processing (NLP), computer vision, robotics, and expert systems. |
| Extract meaningful insights and patterns from data. | Develop models that improve performance with experience. | Create intelligent systems capable of decision-making and problem-solving. |
| Statistics, programming (Python, R), data visualization. | Mathematics, ML algorithms, programming (Python, TensorFlow, Scikit-learn). | Cognitive science, ML, robotics, knowledge representation. |
Bayes' Theorem in Probability Theory
Bayes' theorem is a fundamental principle in probability theory that updates the probability of an event based on new evidence. It is:
P(A∣B) → Probability of event A occurring given that B has occurred (Posterior Probability).
P(B∣A) → Probability of event B occurring given that A has occurred (Likelihood).
P(A) → Prior probability of event A occurring before considering evidence B.
P(B) → Total probability of event B occurring (Marginal Probability).
A population includes all possible data points, whereas a sample is a smaller portion used to make generalizations about the population. Sampling is crucial in data science because analyzing an entire population is often impractical.
| Population | Sample |
|---|---|
| The entire set of individuals, objects, or data points under study. | A subset of the population selected for analysis. |
| Usually large or infinite. | Smaller and manageable. |
| More accurate but harder to obtain. | Less accurate but easier to analyze. |
| All customers of Amazon. | A group of 1,000 randomly selected Amazon customers. |
Data Science is an interdisciplinary field that combines statistics, programming, and domain expertise to extract meaningful insights from structured and unstructured data. It involves processes such as data collection, cleaning, analysis, visualization, and predictive modeling using machine learning and AI techniques.
| Data Science | Information Science |
|---|---|
| The study of data to extract insights and make data-driven decisions using statistical and computational techniques. | The study of how information is collected, stored, retrieved, and communicated effectively. |
| Focuses on data analysis, machine learning, and predictive modeling. | Focuses on information management, retrieval, and human interaction with information. |
| Data collection, statistics, programming, ML, AI, data visualization. | Library science, knowledge organization, information retrieval, data management. |
| Deals with raw data and builds models to find patterns. | Deals with information and ensures efficient organization and retrieval. |
| Business analytics, AI development, fraud detection, healthcare, finance. | Library science, digital archiving, search engines, database management, UX design. |
Measures of Distribution Shape
1. Skewness:
Skewness measures the asymmetry of a data distribution. It indicates whether data is symmetrically distributed around the mean or skewed to one side.
- Positive Skew (Right-Skewed, Skewness>0): Tail is longer on the right side. Example: Income distribution (a few people earn significantly more).
- Negative Skew (Left-Skewed, Skewness<0): Tail is longer on the left side. Example: Exam scores (many students score high, few score very low).
- Zero Skew (Symmetric, Skewness=0): Data is evenly distributed around the mean. Example: Normal distribution (bell curve).
2. Kurtosis:
Kurtosis measures the tailedness (how extreme values or outliers are distributed) of a dataset compared to a normal distribution.
- Leptokurtic (High Kurtosis, Kurtosis>3): Heavy tails (more extreme outliers). Example: Stock market crashes.
- Mesokurtic (Normal Kurtosis, Kurtosis=3): Follows a normal distribution. Example: Ideal bell curve.
- Platykurtic (Low Kurtosis, Kurtosis<3): Light tails (fewer extreme outliers). Example: Uniform distributions (data is evenly spread).
Degree of Freedom (DoF)
The Degree of Freedom (DoF) refers to the number of independent values that can vary in a statistical calculation without violating any constraints. It is crucial in hypothesis testing, regression analysis, and statistical modeling.
Formula for Degree of Freedom: DoF = Total Observations − Number of Constraints
- Determines the reliability of statistical tests (e.g., t-test, chi-square test).
- Affects critical values in probability distributions.
- Helps in assessing model complexity (fewer degrees of freedom in complex models can lead to overfitting).
Data Discretization Techniques
Data Discretization is the process of converting continuous data (numeric values) into discrete categories or intervals. It is commonly used in data mining, machine learning, and statistical analysis to improve model interpretability and efficiency.
For example, instead of using continuous age values (e.g., 23, 35, 42), we can categorize them as: Young (0-25), Middle-aged (26-50), Senior (51+).
- Binning (Equal-Width & Equal-Frequency): Divides data into a fixed number of bins (intervals).
- Top-Down Splitting (Recursive Splitting): A hierarchical approach where data is split into multiple levels.
- Clustering-Based Discretization: Uses clustering algorithms (e.g., k-means) to group data into clusters and assign each cluster to a category.
- Supervised Discretization: Uses a target variable (label) to determine bin boundaries.