Foundations of Data Analytics, Databases, and Statistics

Classified in Mathematics

Written on April 26, 2025 in English with a size of 2.23 MB

Introduction to Data Analytics

Types of Data Analytics

Descriptive: Analyzes past trends.
Predictive: Applies past trends to current data to understand the future.
Prescriptive: Suggests actions and outlines potential impacts.

Project Stages

Problem Specification:
1. Understand the Problem Statement.
2. Define the Project Scope.
Data Gathering & Preprocessing:
1. Define a system for data collection.
2. Clean data with data processing.
Descriptive Analytics:
1. Perform Exploratory Data Analysis (EDA).
2. Get a basic understanding of the dataset.
3. Answer initial assumptions about the data.
Machine Learning:
1. Apply correct ML models depending on the scope/nature of the data and project.
2. Train the ML model to assess its performance.
Deployment:
1. Consult with project stakeholders on the suitability of the model.
2. Deploy the model for live usage.

Tools

R
Python
Weka
RapidMiner
Orange
Tableau
Power BI
IBM SPSS Modeler
SAS Data Mining
Oracle Data Mining

Major Steps in Data Analytics

Data Cleaning & Integration (From DB) -> Data Mining -> Pattern Evaluation -> Knowledge Discovery

Step 1: Data Preprocessing (60% Effort)

Data Integration: Combining data from multiple sources.
Data Cleaning: Handling noisy values/outliers.
Data Filling: Addressing missing values.
Data Selection: Selecting relevant attributes & objects.
Data Reduction & Transformation: Simplifying or changing data structure.

Step 2: Data Mining

Choose the function of data mining.
Choose the mining algorithm.
Search for patterns of interest.

Step 3: Knowledge Discovery

Pattern Evaluation: Generalization, Re-mining.
Knowledge Representation: Visualization.
Decision Making: Use extracted knowledge.

Typical Data Analysis System

+Pcg79bhF2AAAAAElFTkSuQmCC

Data Analytics: Multiple Disciplines

Data Processing
Statistics
Visualization
Database Technology
High-Performing Computing
Machine Learning
Application Development

Case Study: SMRT Circle Line

Marey Chart: Used in transport to visualize timetables of buses/trains.

Statistical Analytics Fundamentals

The science of collection, presentation, analysis, and reasonable interpretation of data.
Statistics presents a scientific method for gaining insight into data.
Can provide graphical presentation, numerical summarization, make inferences, and predict variable relationships.

Statistical Methods

Descriptive:
- Univariate: Shape, center, spread, relative positions.
- Bivariate: Correlation, regression/prediction.
- Multivariate: Multiple Regression.
Inferential:
- Applied to means: T-test, dependent groups t-test, ANOVA.

Variable Types and Forms

Variable Type

Independent: Systematically varied by the researcher.
Dependent: Observed values that depend on the effect of the independent variable.

Variable Forms

Discrete: Finite set of values (e.g., yes/no, good/bad).
Continuous: Any value on a continuous scale (e.g., height, weight, length, time).

Data Types

Scale:
1. Continuous: Measurements, can take any value.
2. Discrete: Counts/integers.
Categorical:
1. Ordinal: Has an obvious order.
2. Nominal: Has no meaningful order.

Statistical Analysis Overview

Provides a compact description for a subset of data.
Provides mean, standard deviation, and other statistical features.
Includes correlation coefficient, multivariate visualization techniques, and functional relationships between variables (regression).

Statistical Description: Numerical Data

Focuses on Center, Variability, and Shape.

Statistical Description: Categorical Data

Focuses on Frequency, Percentage, and Proportion for each category.

Choosing the Right Graph for Data

Table: Can display a large amount of data.
Graph/Chart: Visual, easier to detect patterns.
Label Components of Graphic: Interpreting data adds meaning by making connections and comparisons to the program.

Note: Service data are good at tracking processes and identifying concerns but do not show causality.

Supervised Learning Basics

Uses labeled data.
Models relationships and dependencies between the target prediction output and input features.

Statistical Models in Supervised Learning

Regression models: Linear, Non-Linear.
Time Series Models: ARMA, Moving Average.

Databases and Data Warehouses

Evolution of Database Technology

pwXV60ShFwUAAAAASUVORK5CYII=

h+2W4TyyCpD0wAAAABJRU5ErkJggg==

Database Types

SQL (Relational)
Relational (table-based) databases, structured data.
NoSQL
Handles unstructured data, suitable for large amounts of data (e.g., documents, videos, audio files). Example: MongoDB.
Object-Oriented Database Management Systems (OODBMS)
Align database operations with object-oriented programming practices; not widely adopted.
Graph Database
Handles numerous relationships (rls) that traditional databases cannot easily manage. Real-world objects often have complex relationships.
Consists of a collection of nodes and edges, where edges represent the relationships between nodes.
Relationships between data are the priority; querying relationships is fast because they are stored directly in the database.
Can model any kind of relationship (e.g., a system of roads, network of devices, a population's medical history).

SQL Databases Explained

Databases are made up of entities and relationships.

wOlvLCm7JV0eQAAAABJRU5ErkJggg==

Entity Relationship Diagrams (ERDs)

Visually represent the components that make up a database.

U+r2pvC25UQAAAAASUVORK5CYII=

Attributes of an Entity

Example: Entity = Customer, Attributes = CustomerID, Name, Phone, Email, etc.

Database Model Stages

Conceptual Database Model:
- Represents the entities that make up the database.
- Includes entities and their general relationships but does not include entity attributes.
- Independent of database type (Relational, NoSQL, Graph).
Logical Database Model:
- Includes entity attributes.
Physical Database Model:
- Specifies each attribute's data type (e.g., integer, floating point, string).
- Specifies fields that will be primary and foreign keys (creates the relationships between entities).

Primary Keys

A primary key uniquely identifies each record.
Cannot be NULL and must be unique.

Foreign Keys

Entities have relationships with one another; related tables must have common fields for such relationships.
Databases use foreign keys to provide referential integrity rules, preventing the assignment of a value to a foreign key field that would be invalid for the corresponding primary key.

Microsoft SQL Commands

Create:
CREATE TABLE Customer (CustomerID int, Name varchar(255), Phone varchar(64));
Insert:
INSERT INTO Customer (CustomerID, Name, Phone) VALUES (111, 'John', '12345678');
Note: If an INSERT query does not specify values for each field, SQL will assign a null value to the remaining fields.
Update:
UPDATE Customer SET Phone = '98765432' WHERE Name = 'John';
Select:
SELECT * FROM Customer;
SELECT * FROM Customer ORDER BY Customer_ID;
SELECT [Customer_ID], [Name], [Phone] FROM [Customer].[dbo].[Customer] WHERE Customer_ID = '113';
Delete:
DELETE FROM [Customer].[dbo].[Customer] WHERE Customer_ID = '113';
SELECT * FROM [Customer].[dbo].[Customer];
Drop:
DROP DATABASE 'DatabaseName';
DROP TABLE 'TableName';
Note: When using DROP, SQL will delete the object regardless of the data it contains.
Foreign Keys (Join Example):
SELECT O.OrderID, C.CustomerName, O.OrderDate FROM Orders O INNER JOIN Customers C ON O.CustomerID=C.CustomerID;

SQL Database Summary

Structure: Tables, Documents, Graphs (Note: SQL primarily uses Tables).
Storage: Concentrated.
Scale: Vertical (better machine, bigger hard disk).
Access: Raw SQL, Direct DB Connection, Object-Relational Mappers (ORMs).

Databases and Data Warehouses (Continued)

Online Transaction Processing (OLTP)

Relates to applications performing database transactions, like e-commerce sites tracking product sales or production systems tracking machines, products, quality, etc.

Online Analytical Processing (OLAP)

Includes reporting, dashboards, and decision support tools.

Note: Normally, OLTP databases are the data source for OLAP operations.

OLTP vs. OLAP Comparison

BzwAH6tdimJAAAAAAElFTkSuQmCC

Data Warehouse Concepts

A warehouse designed to optimize analytic operations.
Companies usually have databases to store transactional data (like product sales, inventory levels), but these are not optimized for high-performance data analytics, decision support, or dashboard performance.
It's normal to have databases at many locations. To analyze company-wide operations, data must be gathered from each database, focusing on interesting records.
Flow: Operational Data & Flat Files & SAP Data -> Data Warehouse (Collect, Analyze, View/Present Data) -> Local Clients, Defined Queries, World Wide Web.
A collection of data to support management's decision-making process. Characteristics:

1. Subject-Oriented Data Warehouse

Organized around major subjects like customer, product, sales.
Focused on modeling and analysis for decision-makers, not daily operations or transaction processing.
Provides a simple, concise view around particular subject issues by excluding data not useful in the decision support process.

2. Integrated Data Warehouse

Constructed by integrating multiple heterogeneous data sources (e.g., relational databases, flat files, online transaction records) into a single, unified database.
Data cleaning and integration are applied to ensure consistency in naming conventions, encoding structures, and attribute meanings among different data sources. Data is converted when moved to the warehouse.

3. Time-Variant Data Warehouse

Provides access to a greater volume of detailed information over a longer period, with data associated with a point in time.
The time horizon is significantly longer than that of an operational database (e.g., past 5-10 years vs. current value data).
Every key structure contains an element of time, explicitly or implicitly, unlike operational data which may or may not.

4. Non-Volatile Data Warehouse

Physically separate storage of data from the operational environment.
Updates from operational DB data may not occur in the data warehouse immediately.
Does not require transaction processing, recovery, and concurrency control mechanisms.
Only needs two operations for data access: initial loading of data and access of data.

Database vs. Data Warehouse

Database:
- Structured with a defined schema.
- Items organized as a set of tables with columns and rows.
- Columns include attributes; tables indicate an object or entity.
- Designed as transactional; not designed to perform data analytics efficiently.
Data Warehouse:
- Built on top of several databases and used for business intelligence.
- Consumes data from operational databases and creates an optimized layer for data analytics.
- Schema is often generated on import or query (though typically predefined).

Datamarts Explained

Contains data for a specific business group.

D9jc7DyxtFoowAAAABJRU5ErkJggg==

Data Warehouse Design Considerations

What reports are required?
What and where are the sources of data?
How frequently will new data be collected and loaded?
How many users will simultaneously access the data warehouse?
What access will different user types (finance, sales, operations) have?
How will the Extract, Transform, and Load (ETL) operations be performed?
What data marts are required?
What changes are expected in the future for reporting, analytics, decision support tools, and dashboards?
How will the data warehouse be backed up and with what frequency?
What factors influence data quality?
Who owns the data, and who are the primary stakeholders?
Who will test the reports, data analytics, and dashboards for quality assurance?

Multi-Tiered Architecture

CgA0n69fwfCZGZYYBgpL8AAAAASUVORK5CYII=

Types of Data Warehouses

Enterprise Warehouse: Collects all information about subjects spanning the entire organization.
Data Mart: A subset of corporate-wide data for a specific group of users (e.g., marketing data mart). Scope confined to specific groups.
Virtual Warehouse: Clusters of compute resources leveraged for SQL execution and Data Manipulation Language (DML) operations.

Conceptual Modeling of Data Warehouses

Modeling involves Facts and Dimensions.

Star Schema: A central fact table connected to a set of dimension tables.
Snowflake Schema: A refinement of the star schema where some dimensional hierarchies are normalized into smaller dimension tables, forming a snowflake shape.
Fact Constellation (Galaxy Schema): Multiple fact tables share dimension tables, viewed as a collection of stars.

Sayuo0mAAAAABJRU5ErkJggg== ooAAAAASUVORK5CYII=

NoSQL Databases Introduction

Structured data: Has a schema, fits in fixed fields in a table.
Unstructured data: No fixed schema or structure.
Needed to store documents, videos, and less 'structured' data.
Needed to scale database storage and processing power up and down based on demand.
Addresses the continued need for higher performance and reliability with unstructured data.
NoSQL databases do not use SQL as their primary query mechanism.
Store data in a less structured way, often using JSON objects.

JSON (JavaScript Object Notation)

Self-describing, human-readable, language-independent, lightweight.
Used between web browsers, web servers, and mobile applications.
Developers store JSON Objects in text-based files, using the .json file extension.
MongoDB uses collections (similar to tables) that store documents (similar to records).
MongoDB provides queries to create and manage databases and collections and perform CRUD operations on documents.
Scaling can be done vertically (adding RAM & CPU to a server) or horizontally (adding more servers).

NoSQL Database Pros and Cons

Pros:
- Flexibility, scalability, schemaless, fast writes.
- Developers don't need to know the data structure ahead of time.
- Ideal for unstructured/semi-structured data, content management.
- Rapid prototyping and collecting high-traffic data.
Cons:
- Limited querying within a single document. Relationships/cross-references are not enforced; cannot join documents/collections in a single query easily.
- Lack of database enforcement requires developer discipline for application-level enforcement.
- Potential for data duplication.

NoSQL Database Summary

Structure: Tables, Documents, Graphs.
Storage: Hashing Input (Distributed).
Scale: Horizontal (Add more Partitions/Servers).
Access: Representational State Transfer (REST) API, CRUD in vendor-specific language.

rThdv4ezEUAAAAASUVORK5CYII=

Data Lakes Explained

A centralized repository for structured and unstructured data storage.
Used to store raw data as-is, without any predefined structure (schema).
No need to perform transformation jobs before storage.
Can store many types of data such as images, text files, videos.
Processing is done by data experts; schema is defined on read (schema-on-read).
Data lake files reside on specialized file systems like Hadoop Distributed File System (HDFS), Amazon S3, Microsoft Azure Data Lake Storage.

Data Lake vs. Data Warehouse Comparison

aak4+LFi8Q7nSTFxbV+WXqp2vp6LtbX43Q6vQ+JiHS5S5cuUVdXh8Ph8D4kIiI9QElJCaWlpYGTDhERERERkc6k73SIiIiIiEhQKekQEREREZGg+h867HPGotNWkgAAAABJRU5ErkJggg==

Fundamentals of Data Visualization

Importance of Visualization

Visualization is the representation of an object with the goal of improving communication.

Visualization Tools

Excel
Tableau
Google Charts (integrate charts into HTML)
Programming Tools like Python, R

Factors to consider when choosing:

Price
Learning Curve
Support for web-based charts & dashboards

The Science of Visualization

Use the correct chart for the right purpose.

The Art of Visualization

Use color effectively.
Integrate complementary fonts.
Highlight points of interest to communicate a message.

Visualization Best Practices

Begin with the end goal in mind.
Keep the audience in mind.
Use the correct chart for the right purpose.
Use color effectively.
Use text labels to meaningfully describe data.
Use axis values consistently.
Focus on actionable content.
Keep it simple.
Build user trust.
Make sure chart axis labels are easy to read.
Use meaningful chart titles that begin to tell a story.
Sort data appropriately.
Draw the viewer's focus to key points.

Common Chart Types

1. Time-Based Charts

Represent how one or more sets of values change over time.

Line Chart: Compare productivity for one week.
Multi-Line Chart: Compare productivity from three similar machines.
Area Chart: Represents a cumulated total over time.

2. Category-Based Charts

To compare one or more categories of values.

Vertical/Horizontal Bar Chart: Compare productivity over 12 months.
Column Chart: (Similar to Vertical Bar Chart)
Clustered/Grouped Bar or Column Chart: Compare productivity of 3 machines over 12 months.
Radar Chart: Compare multiple quantitative variables.
Combo Chart: Combine different chart types (e.g., bar and line).
Difference Chart: Show differences between series.
Waterfall Chart: Show daily total hours and its components; analyze productivity losses.

3. Composition Charts

Represents how one or more values relate to a larger whole.

Pie Chart: Compare production volume for different customers.
Note: Viewers may find it hard to accurately determine the area of slices.
Donut Chart: Addresses the pie chart's issue by using arcs instead of slices.
Sunburst Chart: Visualize hierarchical data.
Stacked Bar/Column Chart: Show composition (e.g., compare total production volume for different customers to understand different machines' contributions).
Stacked Area Chart: Show composition over time.

4. Distribution Charts

Show how data points are distributed (e.g., Histogram, Box Plot).

5. Correlation Charts

Show the relationship between variables (e.g., Scatter Plot).

6. Dashboards

A collection of visualizations displayed together to provide an overview.

7. GeoCharts

Integrates maps with data values.

wOQ8ALx79YbfAAAAABJRU5ErkJggg==

Data Presentation Methods

Graphical Presentation

Look for overall patterns and striking deviations from that pattern.
Overall pattern usually described by shape, center, and spread of the data.
Outlier: An individual value that falls outside the overall pattern.

Numerical Presentation

Using summary statistics (mean, median, standard deviation, etc.).

Categorical vs. Numerical Variables Charts

Categorical: Bar Diagrams, Pie Charts.
Numerical: Histogram, Stem-and-Leaf Plot, Box Plot.

Gauge Charts

Provides an at-a-glance representation of a value, often used within a KPI dashboard.

wG6FfAr74ExFAAAAABJRU5ErkJggg==

Overall Equipment Effectiveness (OEE)

One of the most important metrics for highlighting manufacturing productivity.
Management gains important insight into how to systematically improve the manufacturing process.
A good metric for identifying losses, eliminating waste, benchmarking progress, and improving the productivity of manufacturing equipment.

Availability:
Availability = 990 / 1390 = 0.71 or 71%
Performance:
Performance = 12000 / 14850 = 0.8 or 80%
Quality:
Quality = 10000 / 12000 = 0.87 or 87%
OEE Calculation:
OEE = Availability * Performance * Quality
OEE = 0.71 * 0.8 * 0.87 = 0.49 or 49%

Tableau File Types

Tableau Workbook File (.twb)
- XML documents.
- Contains information about sheets, dashboards, and stories.
- References a data source file (e.g., Excel) and is linked to the source when saved.
Tableau Packaged Workbook (.twbx)
- A package of files compressed together.
- Includes data source files, the TWB, and other files used (e.g., images).
- Bigger file size than TWB.
- The original source file can be extracted if needed.

Core Concepts in Statistical Analysis

Understanding Data in Statistics

Examples: Answers to questionnaires, measurements from experiments, production data, quality inspection results, monitoring through sensors.

Variables in Statistics

Concepts that are observable and measurable for an individual or entity.
Can be categorical or quantitative (scale).
Examples: Color classification, loudness, level of satisfaction/agreement, amount of time spent, media choices.

Variable Types

Independent: Those that are systematically controlled and varied by the researcher.
Dependent: Changes in response to the independent variable.

Variable Forms

Discrete: Only include a finite set of values (e.g., yes/no, satisfied/not satisfied).
Continuous: Takes on any value on a continuous scale (e.g., Height, weight, length, time).

Scales of Measurement

Identity: Each value has a unique meaning.
Magnitude: Values have an ordered relationship to one another (specific order).
Equal Intervals: Data points along the scale are equal; the difference between data points is the same.
Minimum value of Zero (True Zero): The scale has a true zero point where zero means the absence of the quantity (e.g., weight). If zero still has meaning (e.g., temperature), it's not a true zero.

Samples and Populations

Sample data 'represents' the whole population.
Used to estimate parameters of a population.
Statistics are calculated using sample data.
Parameters are characteristics of population data.
Sample Mean & Sample Standard Deviation estimate Population Mean & Population Standard Deviation.

zxMo2CNnJLAAAAAElFTkSuQmCC

Sampling Methods

Probability Sampling: Every member of the population has a known, non-zero probability of selection.
Non-Probability Sampling: Units of the sample are selected based on personal judgment or convenience. The probability of any particular member being chosen is unknown.

Statistical Description of Data

Describing Categorical Data

Described by frequency, percentage, proportion of each category.

WFlZQVtbW3VIpbPYmJi8O7dO1SoUEG1iBWAx48fw87ODhKJRLWI5bPIyEgYGxujU6dOqkXftU+GMGOMMcYK1v8BlRpNn5CuOBMAAAAASUVORK5CYII=

QnXPtG2bhzQAAAABJRU5ErkJggg==

Describing Numerical Data

Described by Center, Variability, and Shape.

Histogram Charts

Analysts can estimate future events based on the chart's probability distribution.
Bins: Grouping the number of data points that fall within a specified range of values.
Bin width must be chosen carefully: too large can hide meaningful data; too small will clutter the chart.
Approach to choose the number of bins: Square root of the number of data samples (e.g., 100 data points ≈ 10 bins).
Use when:
1. Data is numerical.
2. You want to see the data distribution to determine if it's approximately normal.
3. To communicate data distribution easily and quickly.
Example: Analyzing material inspection quality from a supplier; monitoring if a process change has occurred over time.

Measures of Central Tendency

These three characteristic numbers can be identified for each distribution:

Mean: Arithmetic average.
Median: Midpoint of the distribution (50th percentile).
Mode: Value that occurs most often.

Mean vs. Median

The median is less sensitive to outliers (extreme values) than the mean.
The median is often a better measure for highly skewed distributions (e.g., family income) than the mean.

GkCJgAAAABJRU5ErkJggg==

Standard Deviation Explained

A measure of how much individual data points differ from the mean (average spread).
Large SD: Data is very spread out.
Small SD: Little variation from the mean; data points are close to the mean.

Vj9CwAAAAASUVORK5CYII=

1ahLd0uXaNRHRF5wBOzKJraIvkJkAAAAASUVORK5CYII=

Assessing Normality

Visual inspection (e.g., histogram shape) and statistical tests can be used to assess if data follows a normal distribution.

wG7FnPUE0GANgAAAABJRU5ErkJggg== w+CFS1bS4fkAwAAAABJRU5ErkJggg== j8q8oyPNcR9PQAAAABJRU5ErkJggg==

Measures of Dispersion (Spread)

Range: Distance between the smallest and largest value in the set.
Variance: A method of describing variation in a set of scores. Higher variance means greater variability/spread. Used in T-tests, ANOVA, Regression. A variance of 0 means all values are identical. Large variance indicates numbers are far from the mean and each other; small variance means the opposite.
Standard Deviation: Another method of describing variation. Square root of the variance. Higher standard deviation means greater variability/spread. Expressed in the same units as the original data.

Variability Measures: Quartiles

Quartiles

Data can be divided into four regions covering the total range of observed values. The cut points are known as quartiles (Q1, Q2, Q3).
Q1: 25th percentile.
Q2: 50th percentile (Median).
Q3: 75th percentile.
Formula hint: ((n+1)/4) * q-th observation, where q is the desired quartile (1, 2, or 3) and n is the number of observations.
Interquartile Range (IQR): Q3 - Q1. Represents the spread of the middle 50% of the data.

The Five-Number Summary

Provides a concise summary of distribution:

Minimum (Smallest observation)
First Quartile (Q1)
Median (Q2)
Third Quartile (Q3)
Maximum (Largest observation)

Written in order from smallest to largest.

Box Plots

A graph of the five-number summary.
A central box spans the quartiles (Q1 to Q3).
A line within the box marks the median (Q2).
Lines (whiskers) extend above and below the box to mark the smallest and largest observations within a certain range (often 1.5 * IQR).
Outliers can be plotted individually outside the whiskers.

Example: Data: 0 2 5 6 7 9 10 11 13 15 16 16 17 19 19 19 20

Total Numbers (n) = 17
Median (Q2) = 13
Mean = 12
Q1 (Middle value between min and median) = 6.5
Q3 (Middle value between median and max) = 18
IQR = Q3 - Q1 = 18 - 6.5 = 11.5

Choosing Summary Statistics

For Central Tendency & Spread:

Scale Data:
- Normally distributed: Mean + Standard Deviation
- Skewed Data: Median + Interquartile Range (IQR)
Categorical Data:
- Ordinal: Median + Interquartile Range (IQR)
- Nominal: Mode (+ Frequency/Proportion)

Choosing Appropriate Graphs

Only 1 Variable (Univariate):
- Scale: Histogram, Box Plot
- Categorical: Pie Chart, Bar Chart
2 Variables (Bivariate):
- Scale vs. Scale: Scatter Plot
- Scale vs. Categorical: Box Plot (side-by-side), Confidence Interval Plot
- Categorical vs. Categorical: Stacked Bar Chart, Multiple Bar Chart, Contingency Table

Area Under the Normal Distribution Curve

The total area under a normal curve is 1 (or 100%).
The area represents probability or proportion.
Z-score: Measures how many standard deviations a value is from the mean.
Formula: Z = (Value - Mean) / Standard Deviation
Example: Mean = 2 hours, Standard Deviation = 0.5 hours. What is the proportion between 2 and 2.5 hours?
Z for 2 hours = (2 - 2) / 0.5 = 0
Z for 2.5 hours = (2.5 - 2) / 0.5 = 1
The area between Z=0 and Z=1 is approximately 0.34 or 34%.

ewlw9DsqqhcAAAAASUVORK5CYII=

Feature Scaling: Data Normalization

Data normalization is often necessary for Machine Learning (ML) models.
Multiple features spanning varying degrees of magnitude, range, and units can be an obstacle for algorithms sensitive to feature scales.
Normalization (Min-Max Scaling): A scaling technique where values are shifted and rescaled to end up in a range of 0 and 1.
Formula: Normalized Score = (X - Xmin) / (Xmax - Xmin)

Feature Scaling: Data Standardization

Values are centered around the mean with a unit standard deviation.
The mean of the attribute becomes 0, and the resultant distribution has a unit standard deviation (SD=1).
Uses Z-scores.
Higher Z-score (> 0) = higher than average; Lower Z-score (< 0) = lower than average.
The standard score is a signed number: Positive indicates the datum is above the mean; Negative indicates it's below the mean.

Steps to Analyze and Interpret Data

Understand the data type (Scale, Categorical - Nominal/Ordinal).
If data type is scale, use appropriate statistical analysis methods (Mean, Mode, Median, Plot Histogram).
If data appears normally distributed, calculate Range, Variance, and Standard Deviation. If skewed, use Median and IQR.
If comparing multiple variables with different scales, consider normalization or standardization.
Interpretation:
- The process of drawing inferences from analysis results.
- Inferences lead to managerial implications and decisions.
- From a management perspective, the qualitative meaning of data and their managerial implications are crucial aspects of interpretation.

Related entries:

Tags:

Introduction to Data Analytics

Types of Data Analytics

Project Stages

Tools

Major Steps in Data Analytics

Step 1: Data Preprocessing (60% Effort)

Step 2: Data Mining

Step 3: Knowledge Discovery

Typical Data Analysis System

Data Analytics: Multiple Disciplines

Case Study: SMRT Circle Line

Statistical Analytics Fundamentals

Statistical Methods

Variable Types and Forms

Variable Type

Variable Forms

Data Types

Statistical Analysis Overview

Statistical Description: Numerical Data

Statistical Description: Categorical Data

Choosing the Right Graph for Data

Supervised Learning Basics

Statistical Models in Supervised Learning

Databases and Data Warehouses

Evolution of Database Technology

Database Types

SQL (Relational)

NoSQL

Object-Oriented Database Management Systems (OODBMS)

Graph Database

SQL Databases Explained

Entity Relationship Diagrams (ERDs)

Attributes of an Entity

Database Model Stages

Primary Keys

Foreign Keys

Microsoft SQL Commands

SQL Database Summary

Databases and Data Warehouses (Continued)

Online Transaction Processing (OLTP)

Online Analytical Processing (OLAP)

OLTP vs. OLAP Comparison

Data Warehouse Concepts

1. Subject-Oriented Data Warehouse

2. Integrated Data Warehouse

3. Time-Variant Data Warehouse

4. Non-Volatile Data Warehouse

Database vs. Data Warehouse

Datamarts Explained

Data Warehouse Design Considerations

Multi-Tiered Architecture

Types of Data Warehouses

Conceptual Modeling of Data Warehouses

NoSQL Databases Introduction

JSON (JavaScript Object Notation)

NoSQL Database Pros and Cons

NoSQL Database Summary

Data Lakes Explained

Data Lake vs. Data Warehouse Comparison

Fundamentals of Data Visualization

Importance of Visualization

Visualization Tools

The Science of Visualization

The Art of Visualization

Visualization Best Practices

Common Chart Types

1. Time-Based Charts

2. Category-Based Charts

3. Composition Charts

4. Distribution Charts

5. Correlation Charts

6. Dashboards

7. GeoCharts

Data Presentation Methods

Graphical Presentation

Numerical Presentation

Categorical vs. Numerical Variables Charts

Gauge Charts

Overall Equipment Effectiveness (OEE)

Tableau File Types