Foundations of Data Analytics, Databases, and Statistics
Classified in Mathematics
Written on in English with a size of 2.23 MB
Introduction to Data Analytics
Types of Data Analytics
- Descriptive: Analyzes past trends.
- Predictive: Applies past trends to current data to understand the future.
- Prescriptive: Suggests actions and outlines potential impacts.
Project Stages
- Problem Specification:
- Understand the Problem Statement.
- Define the Project Scope.
- Data Gathering & Preprocessing:
- Define a system for data collection.
- Clean data with data processing.
- Descriptive Analytics:
- Perform Exploratory Data Analysis (EDA).
- Get a basic understanding of the dataset.
- Answer initial assumptions about the data.
- Machine Learning:
- Apply correct ML models depending on the scope/nature of the data and project.
- Train the ML model to assess its performance.
- Deployment:
- Consult with project stakeholders on the suitability of the model.
- Deploy the model for live usage.
Tools
- R
- Python
- Weka
- RapidMiner
- Orange
- Tableau
- Power BI
- IBM SPSS Modeler
- SAS Data Mining
- Oracle Data Mining
Major Steps in Data Analytics
Data Cleaning & Integration (From DB) -> Data Mining -> Pattern Evaluation -> Knowledge Discovery
Step 1: Data Preprocessing (60% Effort)
- Data Integration: Combining data from multiple sources.
- Data Cleaning: Handling noisy values/outliers.
- Data Filling: Addressing missing values.
- Data Selection: Selecting relevant attributes & objects.
- Data Reduction & Transformation: Simplifying or changing data structure.
Step 2: Data Mining
- Choose the function of data mining.
- Choose the mining algorithm.
- Search for patterns of interest.
Step 3: Knowledge Discovery
- Pattern Evaluation: Generalization, Re-mining.
- Knowledge Representation: Visualization.
- Decision Making: Use extracted knowledge.
Typical Data Analysis System
Data Analytics: Multiple Disciplines
- Data Processing
- Statistics
- Visualization
- Database Technology
- High-Performing Computing
- Machine Learning
- Application Development
Case Study: SMRT Circle Line
Marey Chart: Used in transport to visualize timetables of buses/trains.
Statistical Analytics Fundamentals
- The science of collection, presentation, analysis, and reasonable interpretation of data.
- Statistics presents a scientific method for gaining insight into data.
- Can provide graphical presentation, numerical summarization, make inferences, and predict variable relationships.
Statistical Methods
- Descriptive:
- Univariate: Shape, center, spread, relative positions.
- Bivariate: Correlation, regression/prediction.
- Multivariate: Multiple Regression.
- Inferential:
- Applied to means: T-test, dependent groups t-test, ANOVA.
Variable Types and Forms
Variable Type
- Independent: Systematically varied by the researcher.
- Dependent: Observed values that depend on the effect of the independent variable.
Variable Forms
- Discrete: Finite set of values (e.g., yes/no, good/bad).
- Continuous: Any value on a continuous scale (e.g., height, weight, length, time).
Data Types
- Scale:
- Continuous: Measurements, can take any value.
- Discrete: Counts/integers.
- Categorical:
- Ordinal: Has an obvious order.
- Nominal: Has no meaningful order.
Statistical Analysis Overview
- Provides a compact description for a subset of data.
- Provides mean, standard deviation, and other statistical features.
- Includes correlation coefficient, multivariate visualization techniques, and functional relationships between variables (regression).
Statistical Description: Numerical Data
Focuses on Center, Variability, and Shape.
Statistical Description: Categorical Data
Focuses on Frequency, Percentage, and Proportion for each category.
Choosing the Right Graph for Data
- Table: Can display a large amount of data.
- Graph/Chart: Visual, easier to detect patterns.
- Label Components of Graphic: Interpreting data adds meaning by making connections and comparisons to the program.
Note: Service data are good at tracking processes and identifying concerns but do not show causality.
Supervised Learning Basics
- Uses labeled data.
- Models relationships and dependencies between the target prediction output and input features.
Statistical Models in Supervised Learning
- Regression models: Linear, Non-Linear.
- Time Series Models: ARMA, Moving Average.
Databases and Data Warehouses
Evolution of Database Technology
Database Types
SQL (Relational)
Relational (table-based) databases, structured data.
NoSQL
Handles unstructured data, suitable for large amounts of data (e.g., documents, videos, audio files). Example: MongoDB.
Object-Oriented Database Management Systems (OODBMS)
Align database operations with object-oriented programming practices; not widely adopted.
Graph Database
Handles numerous relationships (rls) that traditional databases cannot easily manage. Real-world objects often have complex relationships.
Consists of a collection of nodes and edges, where edges represent the relationships between nodes.
Relationships between data are the priority; querying relationships is fast because they are stored directly in the database.
Can model any kind of relationship (e.g., a system of roads, network of devices, a population's medical history).
SQL Databases Explained
Databases are made up of entities and relationships.
Entity Relationship Diagrams (ERDs)
Visually represent the components that make up a database.
Attributes of an Entity
Example: Entity = Customer, Attributes = CustomerID, Name, Phone, Email, etc.
Database Model Stages
- Conceptual Database Model:
- Represents the entities that make up the database.
- Includes entities and their general relationships but does not include entity attributes.
- Independent of database type (Relational, NoSQL, Graph).
- Logical Database Model:
- Includes entity attributes.
- Physical Database Model:
- Specifies each attribute's data type (e.g., integer, floating point, string).
- Specifies fields that will be primary and foreign keys (creates the relationships between entities).
Primary Keys
- A primary key uniquely identifies each record.
- Cannot be NULL and must be unique.
Foreign Keys
- Entities have relationships with one another; related tables must have common fields for such relationships.
- Databases use foreign keys to provide referential integrity rules, preventing the assignment of a value to a foreign key field that would be invalid for the corresponding primary key.
Microsoft SQL Commands
- Create:
CREATE TABLE Customer (CustomerID int, Name varchar(255), Phone varchar(64));
- Insert:
INSERT INTO Customer (CustomerID, Name, Phone) VALUES (111, 'John', '12345678');
Note: If an INSERT query does not specify values for each field, SQL will assign a null value to the remaining fields.
- Update:
UPDATE Customer SET Phone = '98765432' WHERE Name = 'John';
- Select:
SELECT * FROM Customer;
SELECT * FROM Customer ORDER BY Customer_ID;
SELECT [Customer_ID], [Name], [Phone] FROM [Customer].[dbo].[Customer] WHERE Customer_ID = '113';
- Delete:
DELETE FROM [Customer].[dbo].[Customer] WHERE Customer_ID = '113';
SELECT * FROM [Customer].[dbo].[Customer];
- Drop:
DROP DATABASE 'DatabaseName';
DROP TABLE 'TableName';
Note: When using DROP, SQL will delete the object regardless of the data it contains.
- Foreign Keys (Join Example):
SELECT O.OrderID, C.CustomerName, O.OrderDate FROM Orders O INNER JOIN Customers C ON O.CustomerID=C.CustomerID;
SQL Database Summary
- Structure: Tables, Documents, Graphs (Note: SQL primarily uses Tables).
- Storage: Concentrated.
- Scale: Vertical (better machine, bigger hard disk).
- Access: Raw SQL, Direct DB Connection, Object-Relational Mappers (ORMs).
Databases and Data Warehouses (Continued)
Online Transaction Processing (OLTP)
Relates to applications performing database transactions, like e-commerce sites tracking product sales or production systems tracking machines, products, quality, etc.
Online Analytical Processing (OLAP)
Includes reporting, dashboards, and decision support tools.
Note: Normally, OLTP databases are the data source for OLAP operations.
OLTP vs. OLAP Comparison
Data Warehouse Concepts
- A warehouse designed to optimize analytic operations.
- Companies usually have databases to store transactional data (like product sales, inventory levels), but these are not optimized for high-performance data analytics, decision support, or dashboard performance.
- It's normal to have databases at many locations. To analyze company-wide operations, data must be gathered from each database, focusing on interesting records.
- Flow: Operational Data & Flat Files & SAP Data -> Data Warehouse (Collect, Analyze, View/Present Data) -> Local Clients, Defined Queries, World Wide Web.
- A collection of data to support management's decision-making process. Characteristics:
1. Subject-Oriented Data Warehouse
- Organized around major subjects like customer, product, sales.
- Focused on modeling and analysis for decision-makers, not daily operations or transaction processing.
- Provides a simple, concise view around particular subject issues by excluding data not useful in the decision support process.
2. Integrated Data Warehouse
- Constructed by integrating multiple heterogeneous data sources (e.g., relational databases, flat files, online transaction records) into a single, unified database.
- Data cleaning and integration are applied to ensure consistency in naming conventions, encoding structures, and attribute meanings among different data sources. Data is converted when moved to the warehouse.
3. Time-Variant Data Warehouse
- Provides access to a greater volume of detailed information over a longer period, with data associated with a point in time.
- The time horizon is significantly longer than that of an operational database (e.g., past 5-10 years vs. current value data).
- Every key structure contains an element of time, explicitly or implicitly, unlike operational data which may or may not.
4. Non-Volatile Data Warehouse
- Physically separate storage of data from the operational environment.
- Updates from operational DB data may not occur in the data warehouse immediately.
- Does not require transaction processing, recovery, and concurrency control mechanisms.
- Only needs two operations for data access: initial loading of data and access of data.
Database vs. Data Warehouse
- Database:
- Structured with a defined schema.
- Items organized as a set of tables with columns and rows.
- Columns include attributes; tables indicate an object or entity.
- Designed as transactional; not designed to perform data analytics efficiently.
- Data Warehouse:
- Built on top of several databases and used for business intelligence.
- Consumes data from operational databases and creates an optimized layer for data analytics.
- Schema is often generated on import or query (though typically predefined).
Datamarts Explained
Contains data for a specific business group.
Data Warehouse Design Considerations
- What reports are required?
- What and where are the sources of data?
- How frequently will new data be collected and loaded?
- How many users will simultaneously access the data warehouse?
- What access will different user types (finance, sales, operations) have?
- How will the Extract, Transform, and Load (ETL) operations be performed?
- What data marts are required?
- What changes are expected in the future for reporting, analytics, decision support tools, and dashboards?
- How will the data warehouse be backed up and with what frequency?
- What factors influence data quality?
- Who owns the data, and who are the primary stakeholders?
- Who will test the reports, data analytics, and dashboards for quality assurance?
Multi-Tiered Architecture
Types of Data Warehouses
- Enterprise Warehouse: Collects all information about subjects spanning the entire organization.
- Data Mart: A subset of corporate-wide data for a specific group of users (e.g., marketing data mart). Scope confined to specific groups.
- Virtual Warehouse: Clusters of compute resources leveraged for SQL execution and Data Manipulation Language (DML) operations.
Conceptual Modeling of Data Warehouses
Modeling involves Facts and Dimensions.
- Star Schema: A central fact table connected to a set of dimension tables.
- Snowflake Schema: A refinement of the star schema where some dimensional hierarchies are normalized into smaller dimension tables, forming a snowflake shape.
- Fact Constellation (Galaxy Schema): Multiple fact tables share dimension tables, viewed as a collection of stars.
NoSQL Databases Introduction
- Structured data: Has a schema, fits in fixed fields in a table.
- Unstructured data: No fixed schema or structure.
- Needed to store documents, videos, and less 'structured' data.
- Needed to scale database storage and processing power up and down based on demand.
- Addresses the continued need for higher performance and reliability with unstructured data.
- NoSQL databases do not use SQL as their primary query mechanism.
- Store data in a less structured way, often using JSON objects.
JSON (JavaScript Object Notation)
- Self-describing, human-readable, language-independent, lightweight.
- Used between web browsers, web servers, and mobile applications.
- Developers store JSON Objects in text-based files, using the .json file extension.
- MongoDB uses collections (similar to tables) that store documents (similar to records).
- MongoDB provides queries to create and manage databases and collections and perform CRUD operations on documents.
- Scaling can be done vertically (adding RAM & CPU to a server) or horizontally (adding more servers).
NoSQL Database Pros and Cons
- Pros:
- Flexibility, scalability, schemaless, fast writes.
- Developers don't need to know the data structure ahead of time.
- Ideal for unstructured/semi-structured data, content management.
- Rapid prototyping and collecting high-traffic data.
- Cons:
- Limited querying within a single document. Relationships/cross-references are not enforced; cannot join documents/collections in a single query easily.
- Lack of database enforcement requires developer discipline for application-level enforcement.
- Potential for data duplication.
NoSQL Database Summary
- Structure: Tables, Documents, Graphs.
- Storage: Hashing Input (Distributed).
- Scale: Horizontal (Add more Partitions/Servers).
- Access: Representational State Transfer (REST) API, CRUD in vendor-specific language.
Data Lakes Explained
- A centralized repository for structured and unstructured data storage.
- Used to store raw data as-is, without any predefined structure (schema).
- No need to perform transformation jobs before storage.
- Can store many types of data such as images, text files, videos.
- Processing is done by data experts; schema is defined on read (schema-on-read).
- Data lake files reside on specialized file systems like Hadoop Distributed File System (HDFS), Amazon S3, Microsoft Azure Data Lake Storage.
Data Lake vs. Data Warehouse Comparison
Fundamentals of Data Visualization
Importance of Visualization
Visualization is the representation of an object with the goal of improving communication.
Visualization Tools
- Excel
- Tableau
- Google Charts (integrate charts into HTML)
- Programming Tools like Python, R
Factors to consider when choosing:
- Price
- Learning Curve
- Support for web-based charts & dashboards
The Science of Visualization
Use the correct chart for the right purpose.
The Art of Visualization
- Use color effectively.
- Integrate complementary fonts.
- Highlight points of interest to communicate a message.
Visualization Best Practices
- Begin with the end goal in mind.
- Keep the audience in mind.
- Use the correct chart for the right purpose.
- Use color effectively.
- Use text labels to meaningfully describe data.
- Use axis values consistently.
- Focus on actionable content.
- Keep it simple.
- Build user trust.
- Make sure chart axis labels are easy to read.
- Use meaningful chart titles that begin to tell a story.
- Sort data appropriately.
- Draw the viewer's focus to key points.
Common Chart Types
1. Time-Based Charts
Represent how one or more sets of values change over time.
- Line Chart: Compare productivity for one week.
- Multi-Line Chart: Compare productivity from three similar machines.
- Area Chart: Represents a cumulated total over time.
2. Category-Based Charts
To compare one or more categories of values.
- Vertical/Horizontal Bar Chart: Compare productivity over 12 months.
- Column Chart: (Similar to Vertical Bar Chart)
- Clustered/Grouped Bar or Column Chart: Compare productivity of 3 machines over 12 months.
- Radar Chart: Compare multiple quantitative variables.
- Combo Chart: Combine different chart types (e.g., bar and line).
- Difference Chart: Show differences between series.
- Waterfall Chart: Show daily total hours and its components; analyze productivity losses.
3. Composition Charts
Represents how one or more values relate to a larger whole.
- Pie Chart: Compare production volume for different customers.
Note: Viewers may find it hard to accurately determine the area of slices.
- Donut Chart: Addresses the pie chart's issue by using arcs instead of slices.
- Sunburst Chart: Visualize hierarchical data.
- Stacked Bar/Column Chart: Show composition (e.g., compare total production volume for different customers to understand different machines' contributions).
- Stacked Area Chart: Show composition over time.
4. Distribution Charts
Show how data points are distributed (e.g., Histogram, Box Plot).
5. Correlation Charts
Show the relationship between variables (e.g., Scatter Plot).
6. Dashboards
A collection of visualizations displayed together to provide an overview.
7. GeoCharts
Integrates maps with data values.
Data Presentation Methods
Graphical Presentation
- Look for overall patterns and striking deviations from that pattern.
- Overall pattern usually described by shape, center, and spread of the data.
- Outlier: An individual value that falls outside the overall pattern.
Numerical Presentation
Using summary statistics (mean, median, standard deviation, etc.).
Categorical vs. Numerical Variables Charts
- Categorical: Bar Diagrams, Pie Charts.
- Numerical: Histogram, Stem-and-Leaf Plot, Box Plot.
Gauge Charts
Provides an at-a-glance representation of a value, often used within a KPI dashboard.
Overall Equipment Effectiveness (OEE)
- One of the most important metrics for highlighting manufacturing productivity.
- Management gains important insight into how to systematically improve the manufacturing process.
- A good metric for identifying losses, eliminating waste, benchmarking progress, and improving the productivity of manufacturing equipment.
- Availability:
Availability = 990 / 1390 = 0.71 or 71%
- Performance:
Performance = 12000 / 14850 = 0.8 or 80%
- Quality:
Quality = 10000 / 12000 = 0.87 or 87%
- OEE Calculation:
OEE = Availability * Performance * Quality
OEE = 0.71 * 0.8 * 0.87 = 0.49 or 49%
Tableau File Types
Tableau Workbook File (.twb)
- XML documents.
- Contains information about sheets, dashboards, and stories.
- References a data source file (e.g., Excel) and is linked to the source when saved.
Tableau Packaged Workbook (.twbx)
- A package of files compressed together.
- Includes data source files, the TWB, and other files used (e.g., images).
- Bigger file size than TWB.
- The original source file can be extracted if needed.
Core Concepts in Statistical Analysis
Understanding Data in Statistics
Examples: Answers to questionnaires, measurements from experiments, production data, quality inspection results, monitoring through sensors.
Variables in Statistics
- Concepts that are observable and measurable for an individual or entity.
- Can be categorical or quantitative (scale).
- Examples: Color classification, loudness, level of satisfaction/agreement, amount of time spent, media choices.
Variable Types
- Independent: Those that are systematically controlled and varied by the researcher.
- Dependent: Changes in response to the independent variable.
Variable Forms
- Discrete: Only include a finite set of values (e.g., yes/no, satisfied/not satisfied).
- Continuous: Takes on any value on a continuous scale (e.g., Height, weight, length, time).
Scales of Measurement
- Identity: Each value has a unique meaning.
- Magnitude: Values have an ordered relationship to one another (specific order).
- Equal Intervals: Data points along the scale are equal; the difference between data points is the same.
- Minimum value of Zero (True Zero): The scale has a true zero point where zero means the absence of the quantity (e.g., weight). If zero still has meaning (e.g., temperature), it's not a true zero.
Samples and Populations
- Sample data 'represents' the whole population.
- Used to estimate parameters of a population.
- Statistics are calculated using sample data.
- Parameters are characteristics of population data.
- Sample Mean & Sample Standard Deviation estimate Population Mean & Population Standard Deviation.
Sampling Methods
- Probability Sampling: Every member of the population has a known, non-zero probability of selection.
- Non-Probability Sampling: Units of the sample are selected based on personal judgment or convenience. The probability of any particular member being chosen is unknown.
Statistical Description of Data
Describing Categorical Data
Described by frequency, percentage, proportion of each category.
Describing Numerical Data
Described by Center, Variability, and Shape.
Histogram Charts
- Analysts can estimate future events based on the chart's probability distribution.
- Bins: Grouping the number of data points that fall within a specified range of values.
- Bin width must be chosen carefully: too large can hide meaningful data; too small will clutter the chart.
- Approach to choose the number of bins: Square root of the number of data samples (e.g., 100 data points ≈ 10 bins).
- Use when:
- Data is numerical.
- You want to see the data distribution to determine if it's approximately normal.
- To communicate data distribution easily and quickly.
- Example: Analyzing material inspection quality from a supplier; monitoring if a process change has occurred over time.
Measures of Central Tendency
These three characteristic numbers can be identified for each distribution:
- Mean: Arithmetic average.
- Median: Midpoint of the distribution (50th percentile).
- Mode: Value that occurs most often.
Mean vs. Median
- The median is less sensitive to outliers (extreme values) than the mean.
- The median is often a better measure for highly skewed distributions (e.g., family income) than the mean.
Standard Deviation Explained
- A measure of how much individual data points differ from the mean (average spread).
- Large SD: Data is very spread out.
- Small SD: Little variation from the mean; data points are close to the mean.
Assessing Normality
Visual inspection (e.g., histogram shape) and statistical tests can be used to assess if data follows a normal distribution.
Measures of Dispersion (Spread)
- Range: Distance between the smallest and largest value in the set.
- Variance: A method of describing variation in a set of scores. Higher variance means greater variability/spread. Used in T-tests, ANOVA, Regression. A variance of 0 means all values are identical. Large variance indicates numbers are far from the mean and each other; small variance means the opposite.
- Standard Deviation: Another method of describing variation. Square root of the variance. Higher standard deviation means greater variability/spread. Expressed in the same units as the original data.
Variability Measures: Quartiles
Quartiles
- Data can be divided into four regions covering the total range of observed values. The cut points are known as quartiles (Q1, Q2, Q3).
- Q1: 25th percentile.
- Q2: 50th percentile (Median).
- Q3: 75th percentile.
- Formula hint: ((n+1)/4) * q-th observation, where q is the desired quartile (1, 2, or 3) and n is the number of observations.
- Interquartile Range (IQR): Q3 - Q1. Represents the spread of the middle 50% of the data.
The Five-Number Summary
Provides a concise summary of distribution:
- Minimum (Smallest observation)
- First Quartile (Q1)
- Median (Q2)
- Third Quartile (Q3)
- Maximum (Largest observation)
Written in order from smallest to largest.
Box Plots
- A graph of the five-number summary.
- A central box spans the quartiles (Q1 to Q3).
- A line within the box marks the median (Q2).
- Lines (whiskers) extend above and below the box to mark the smallest and largest observations within a certain range (often 1.5 * IQR).
- Outliers can be plotted individually outside the whiskers.
Example: Data: 0 2 5 6 7 9 10 11 13 15 16 16 17 19 19 19 20
- Total Numbers (n) = 17
- Median (Q2) = 13
- Mean = 12
- Q1 (Middle value between min and median) = 6.5
- Q3 (Middle value between median and max) = 18
- IQR = Q3 - Q1 = 18 - 6.5 = 11.5
Choosing Summary Statistics
For Central Tendency & Spread:
- Scale Data:
- Normally distributed: Mean + Standard Deviation
- Skewed Data: Median + Interquartile Range (IQR)
- Categorical Data:
- Ordinal: Median + Interquartile Range (IQR)
- Nominal: Mode (+ Frequency/Proportion)
Choosing Appropriate Graphs
- Only 1 Variable (Univariate):
- Scale: Histogram, Box Plot
- Categorical: Pie Chart, Bar Chart
- 2 Variables (Bivariate):
- Scale vs. Scale: Scatter Plot
- Scale vs. Categorical: Box Plot (side-by-side), Confidence Interval Plot
- Categorical vs. Categorical: Stacked Bar Chart, Multiple Bar Chart, Contingency Table
Area Under the Normal Distribution Curve
- The total area under a normal curve is 1 (or 100%).
- The area represents probability or proportion.
- Z-score: Measures how many standard deviations a value is from the mean.
- Formula: Z = (Value - Mean) / Standard Deviation
- Example: Mean = 2 hours, Standard Deviation = 0.5 hours. What is the proportion between 2 and 2.5 hours?
- Z for 2 hours = (2 - 2) / 0.5 = 0
- Z for 2.5 hours = (2.5 - 2) / 0.5 = 1
- The area between Z=0 and Z=1 is approximately 0.34 or 34%.
Feature Scaling: Data Normalization
- Data normalization is often necessary for Machine Learning (ML) models.
- Multiple features spanning varying degrees of magnitude, range, and units can be an obstacle for algorithms sensitive to feature scales.
- Normalization (Min-Max Scaling): A scaling technique where values are shifted and rescaled to end up in a range of 0 and 1.
- Formula: Normalized Score = (X - Xmin) / (Xmax - Xmin)
Feature Scaling: Data Standardization
- Values are centered around the mean with a unit standard deviation.
- The mean of the attribute becomes 0, and the resultant distribution has a unit standard deviation (SD=1).
- Uses Z-scores.
- Higher Z-score (> 0) = higher than average; Lower Z-score (< 0) = lower than average.
- The standard score is a signed number: Positive indicates the datum is above the mean; Negative indicates it's below the mean.
Steps to Analyze and Interpret Data
- Understand the data type (Scale, Categorical - Nominal/Ordinal).
- If data type is scale, use appropriate statistical analysis methods (Mean, Mode, Median, Plot Histogram).
- If data appears normally distributed, calculate Range, Variance, and Standard Deviation. If skewed, use Median and IQR.
- If comparing multiple variables with different scales, consider normalization or standardization.
- Interpretation:
- The process of drawing inferences from analysis results.
- Inferences lead to managerial implications and decisions.
- From a management perspective, the qualitative meaning of data and their managerial implications are crucial aspects of interpretation.