Data Warehousing and Business Intelligence Essentials

Classified in Other subjects

Written on in English with a size of 586.85 KB

Data Warehouse

A Data Warehouse (DW) is a centralized, integrated repository designed for analysis rather than transaction processing. It consolidates data from multiple OLTP systems to support reporting, dashboards, and advanced analytics, serving as a single source of truth for strategic decision support.

Key Properties

The data is curated to be subject-oriented, integrated, time-variant, and non-volatile, ensuring data quality and consistency. Unlike OLTP, the DW is built for read-heavy analytics and historical trends, providing the governance needed for managers to trust the KPIs.

Process Context

Data flows via ETL (Extract-Transform-Load) into the DW, typically in scheduled batches (e.g., nightly). This process standardizes and cleanses data, loading it into dimensional models (like star/snowflake schemas) to facilitate self-service analytics and what-if analysis.

Example and Benefit

In a retail scenario, merging POS, e-commerce, and inventory data allows managers to spot a weekly out-of-stock pattern. Adjusting replenishment schedules decreases the stock-out rate and increases gross margin, converting raw data into actionable insights.


Master Data Management (MDM)

Master Data Management (MDM) is the process that governs and maintains authoritative master records (such as Customer, Product, or Supplier) across an organization's various systems. Its fundamental goal is to ensure consistency and avoid duplicates by creating a golden record or single source of truth. This is achieved by applying survivorship rules and through data stewardship, ensuring that critical business entities are defined accurately and uniformly throughout the enterprise.

In the context of Business Intelligence, MDM is crucial because it provides the standard identifiers and business rules (such as naming conventions, codes, and hierarchies) necessary for ETL processes to effectively conform incoming data. By consolidating and standardizing attributes before they are loaded into the Data Warehouse, MDM prevents the Garbage In, Garbage Out (GIGO) effect and ensures that conformed dimensions are consistent. This rigor drastically reduces the manual reconciliation effort between data silos and guarantees that decision-makers can fully trust the KPIs and trends shown in reports.

A clear example is a company implementing Customer MDM to merge conflicting records like "ACME Ltd.," "Acme," and "ACME INC" into a single CustomerID with a canonical name and address. This de-duplication process improves business efficiency by eliminating rework and offering a unified view of the customer. As a result, the company can accurately track churn and revenue, reduce billing errors, and achieve more effective campaign targeting by eliminating duplicate mailings.


ETL Process

ETL is the critical process that moves data from operational systems (OLTP) to the Data Warehouse (DW) after cleaning and integrating it. It is essential to guarantee a single version of the truth and avoid analysis based on inconsistent or low-quality data.

  • Extract: Pulls data from heterogeneous sources into a staging area, often filtering by timestamps for safety.
  • Transform: Applies business rules to ensure quality: includes standardization of formats, validation, deduplication, and the creation of conformed codes/IDs to correctly structure facts & dimensions.
  • Load: Loads the curated data into the DW while maintaining referential integrity and audit/logging metrics. It typically follows a schedule (nightly or multiple times per day).

Synergy of the Steps

Extract gathers the scattered raw material; Transform fixes quality and aligns semantic meanings; Load delivers the query-ready final product. This sequence converts siloed data into a consistent historical/time-variant store optimized for OLAP analysis.

ETL is vital to avoid GIGO by ensuring:

  • Quality: Catches null/duplicate values before reports.
  • Consistency: Applies uniform rules so KPIs align across the enterprise.
  • Usability: Formats data for fast BI queries.

Example: A retail system extracts sales from POS and E-commerce. In transformation, it unifies currencies, cleans codes, and deduplicates customers. It loads data into Fact_Sales and dimensions. Result: The margin KPI is reliable and consistent, improving decisions regarding stock-outs.


Normalization: 2NF vs 3NF

Normalization is the process of organizing relational tables to minimize redundancy and prevent anomalies (specifically insert, update, and delete anomalies). By restructuring data, it ensures data consistency and accuracy, creating a well-organized database that is easier to maintain and query without errors.

Second Normal Form (2NF)

A table is in 2NF if it is in 1NF and every non-key attribute is fully functionally dependent on the whole primary key. This means there must be no partial dependencies, where a field depends on only part of a composite key.

Third Normal Form (3NF)

A table is in 3NF if it is in 2NF and contains no transitive dependencies (where a non-key attribute depends on another non-key attribute). In short, every non-key attribute must depend on the key, the whole key, and nothing but the key.

Normalization Example

Consider a table Enrollments(StudentID, CourseID, Semester, StudentName, CourseTitle, DeptName, DeptOffice) with a composite PK (StudentID, CourseID, Semester).

  • 2NF Violation: StudentName depends only on StudentID (a partial dependency).
  • 3NF Violation: DeptOffice depends on DeptName, not the primary key (a transitive dependency).

Resolution:

  • 2NF: Create Students(StudentID, StudentName) and Enrollments(StudentID, CourseID, Semester, Grade).
  • 3NF: Move department details to Departments(DeptName, DeptOffice) and keep a Foreign Key in Courses(CourseID, CourseTitle, DeptName_FK).

Integrity Rules

Integrity rules are essential for data quality:

  • Entity Integrity: Ensures that Primary Keys (PK) are unique and NOT NULL, so every row is reliably identifiable.
  • Referential Integrity: Ensures that every Foreign Key (FK) matches a valid parent PK (or is NULL), preventing orphan rows.


Normalization vs. Denormalization

Normalization: The process of minimizing redundancy to ensure data integrity. It splits data into many related tables, serving as the standard for OLTP (Transactional) systems. It optimizes write performance but slows down analysis due to complex joins.

Denormalization: The process of intentionally adding redundancy to optimize read performance. It combines tables into a flat structure (like a Star Schema), forming the basis of Data Warehousing (OLAP). It optimizes read performance for dashboards by eliminating joins, but increases storage usage and complicates data updates.


OLTP vs OLAP

OLTP processes day-to-day transactions with high concurrency, whereas OLAP performs analytical queries on integrated, historical/time-variant data within a DW.

  • Purpose: OLTP runs the business (operational updates, ACID compliance). OLAP analyzes the business (strategic planning, dashboards, slice-and-dice).
  • Data Organization: OLTP uses a highly normalized schema. OLAP uses dimensional models (star schemas) with facts & dimensions.
  • Response Time: OLTP requires milliseconds. OLAP requires seconds to minutes.
  • Freshness: OLTP is real-time. OLAP is periodic (e.g., nightly updates via ETL).


SCD1 vs SCD2

Slowly Changing Dimensions (SCD) manage how data changes over time. SCD2 preserves full history by inserting a new row for each change (creating a new version), which ensures auditability and accurate historical reporting, unlike SCD1 which simply overwrites data.

5AoAAAAASUVORK5CYII=

w8SAB9ThFwxAgAAAABJRU5ErkJggg==

Comparison

  • SCD2 (History kept): Pros: full historical analysis, time-travel reporting, auditability. Cons: more rows, complex ETL, complex joins.
  • SCD1 (Overwrite): Pros: simple ETL, smaller dimensions, straightforward joins. Cons: loses history, weak audit trail.

A surrogate key is a DW-generated, meaningless identifier that uniquely identifies each dimension row/version. In star schemas, the surrogate key appears in the fact table as a foreign key, ensuring facts join to the correct historical version of the dimension.


OLAP Cube & Multidimensional Model

An OLAP Cube is a multidimensional data structure optimized for very fast data analysis. Unlike relational tables, a cube organizes data into Dimensions (context like Time, Product, Location) and Measures (quantitative facts like Sales Amount). This structure pre-aggregates data, allowing Business Intelligence tools to retrieve complex calculations instantly.

Key Concepts

  • Dimensions: The edges of the cube that provide context.
  • Measures: The values inside the cells derived from the Fact Table.
  • Slice: Filtering the cube to a single dimension value (e.g., "Sales for 2024").
  • Dice: Selecting a sub-cube by filtering on multiple dimensions (e.g., "Laptops in Spain during Q1").
  • Drill-Down/Up: Navigating hierarchies from summary to detail or vice versa.


Analytics Maturity Levels

Descriptive Analytics: Answers "What happened?" by summarizing historical data via dashboards and KPIs.

Predictive Analytics: Answers "What will happen?" using statistical models and machine learning to identify patterns and estimate future outcomes.

Prescriptive Analytics: Answers "What should we do?" using optimization and simulation to recommend specific actions.

Retail Scenario

  • Descriptive: Dashboard shows a 15% drop in "Winter Jacket" sales.
  • Predictive: Forecast model predicts low demand for 10 days due to a warm front.
  • Prescriptive: System suggests transferring stock to a high-demand warehouse or applying a discount.


Data Governance

Data Governance is the management of the availability, usability, integrity, and security of enterprise data. To succeed, organizations must focus on Critical Data Elements (CDEs) rather than attempting to govern everything at once.

  • ETL & Data Quality: Dictates which pipelines require cleaning rules.
  • MDM: Selects which entities are mastered first.
  • BI & Analytics: Guarantees that KPIs on executive dashboards are based on trusted data.


Dependent vs. Independent Data Marts

A Dependent Data Mart is sourced directly from the Enterprise Data Warehouse (EDW), ensuring it uses cleaned, integrated data. An Independent Data Mart is built directly from operational sources, bypassing the central ETL and EDW, which creates data silos.

Dependent marts align with Data Governance and MDM, guaranteeing a single version of the truth. Independent marts fragment data quality and lead to conflicting reports across departments.


Inmon vs. Kimball

  • Bill Inmon (Top-Down): Views the DW as a centralized, normalized (3NF) repository. It provides a robust, flexible foundation for long-term enterprise reporting but has a slower time-to-value.
  • Ralph Kimball (Bottom-Up): Views the DW as a union of dimensional data marts. It focuses on delivering business value quickly via star schemas, offering fast ROI and agility.


Star Schema vs. Snowflake Schema

Star Schema: Connects a central fact table to multiple denormalized dimension tables. It is the preferred choice for OLAP because it requires fewer joins, making it faster for BI tools.

Snowflake Schema: An extension where dimension tables are normalized (split into sub-tables). This removes redundancy but creates a complex web of tables that slows down dashboards due to complex joins.

9R8kpNAAAABklEQVQDACiIviqdP+xnAAAAAElFTkSuQmCC


Data Warehousing Implementation Issues

  • Wrong Sponsorship: Requires an Executive Sponsor with authority.
  • Unrealistic Expectations: Manage expectations to avoid the project being labeled as Vaporware.
  • Politically Naive Behavior: Position the DW as a tool for future decision making, not a critique of past decisions.
  • Loading Irrelevant Data: Only load data that serves specific business needs to ensure performance.
  • Confusing OLTP and DW Design: A DW must be multidimensional and denormalized, not normalized like an OLTP system.
  • Tech-Oriented Management: Success depends on meeting business requirements.
  • Scalability Planning: Always plan for extra capacity.


The GIGO Effect

GIGO (Garbage In, Garbage Out) states that the quality of the output is determined by the quality of the input. If source data is flawed, analytics and dashboards will be worthless.

GIGO occurs when the ETL process fails to perform adequate data profiling and cleansing. The only cure is strict data governance, implementing data quality policies at the source and using ETL logic to trap invalid records before they contaminate the fact tables.

Related entries: