Dimensional Modeling Shawn Nesser Microsoft Topics Introduction Basic
Dimensional Modeling Shawn Nesser Microsoft
Topics Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
OLTP v. Data Warehouse Differences between transactional processing (OLTP) and data warehousing understood Transactional Processing Data Warehousing Represents current state Preserves history Predictable usage Highly unpredictable usage Optimized to get data “in” Deals with small number of records Optimized for machine processing Optimized to get data “out” Deals with millions of records Optimized for human processing
Different Designs for Different Problems Transaction processing systems prefer normalized (3 NF) structure Optimized for single record data entry and retrieval Very well defined process and access path Pre-supposes the types of work being performed Ship Type Contract Shipper Ship Product Order Item Contact Location Product Line Sales Order Customer Location Product Group Contract Type Customer Contact Data warehousing requires a different approach Optimized for query (discovery / exploration) performance Simple, understandable and memorable Flexible
Agenda Introduction Basic Concepts Star Schema (Fact and Dimensions Tables) Surrogate Keys Hierarchies / Drill Down Factless Fact Tables Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Dimensional Model / Star Schema What is a Star Schema? Single data (fact) table surrounded by multiple descriptive (dimension) tables
Dimensional Model – Why do it? Easier to understand navigate Better performance due to fewer joins Extensible to handle change New dimensions New attributes New facts New aggregates New data sources Flexibility to support ad hoc queries / exploration Recommended by most data access tools Dim 1 Key Dim 3 Key Dim 1 Table Dim 3 Table Dim 1 Key Dim 2 Key Dim 3 Key Dim 4 Key Dim 2 Table Fact 1 Fact 2 Fact 3 Dim 4 Key Dim 4 Table
Dimension Table Characteristics Report labels and query constraints “By” words “Where” clauses Descriptive attributes Verbose; Minimal or no codes End user familiar terminology Hierarchical relationships Product Key Product Description SKU # Size Flavor Formula Brand Description Category Description Supports “drill down” into more detail Dimension tables store redundant, descriptive data in order to improve ease-of-use, support single table browsing, and reduce table joins
Product Dimension Table Sample Rows Product Key Product Description SKU # Size Class Formula Brand Description Category Description Product Key Product Description Brand Description Size Class 0001 Cheerios 10 oz Family Cheerios 0002 Cheerios 24 oz Family Cheerios 0003 Total 10 oz Health Total 0004 Total 24 oz Health Total
Fact Table Characteristics Facts Metrics result from business processes Rapidly changing Usually numeric and additive Tend to have huge numbers of records Contains lowest level of grain possible Typically built around single business process Orders, shipments, claims, deposits, sales transactions Multi-part key Foreign keys to dimension tables Date is always a key Resolve many-to-many relationships Types of fact tables Transactional Periodic Snapshots (points in time) Accumulating Snapshots Date Key Product Key Store Key Promotion Key Total $ Sales Total Unit Sales Promoted $ Sales Promoted Unit Sales
Example DIM_PRODUCT DIM_LOCATION PRODUCT_KEY PRODUCT_CODE PRODUCT_CNAME PRODUCT_CAT PRODUCT_SUBCAT PRODUCT_DEPT PRODUCT_BRAND … EFF_FROM_DT EFF_TO_DT CURRENT_FLAG LOCATION_KEY LOCATION_NAME LOCATION_TYPE LOCATION_DESC … EFF_FROM_DT EFF_TO_DT CURRENT_FLAG FACT_SALES PRODUCT_KEY LOCATION_KEY DATE_KEY … SUM_QUANTITY SUM_SALES SUM_COST SUM_MARGIN SUM_TAX PROMO_QUANTITY PROMO_SALES … select location_type, sum(sum_sales) from dim_product, dim_location, fact_sales where fact_sales. product_key = dim_product_key and fact_sales. location_key = dim_location_key and product_cat = ‘LAPTOPS’ group by location_type Sales of: LAPTOPS Location Type Store Internet DIM_TIME DATE_KEY DAY_DATE DAY_IN_YEAR MONTH_NUM QTR_NUM HALF_NUM YEAR_NUM … Sales $150, 000 $320, 000
Data Warehouse Keys Recommend surrogate keys Integer, non-meaningful, sequence number Don’t use operational ID’s or meaningful keys Benefits Isolate warehouse keys from operational system changes Improve performance Support integration from multiple data sources Enable tracking of dimension changes (slowly changing dimensions) Can record “Not applicable” or “TBD” values
Multiple Hierarchies in Dimensions Dimension tables can represent multiple hierarchical roll-ups For example, Store dimension could have the following hierarchies: Physical Geography: Zip, City, County, State, Country Sales Organization: District, Region, Zone Distribution Roll-up: DC Center, DC Region Store Key Store Description Store Type Zip City State Sales District Sales Region Sales Zone DC Center DC Region
Factless Fact Tables How does the data warehouse know that you had a promotion if no promoted products sell? Factless fact table Records either the coverage or occurrence of an event Promotion Event Coverage Table Date Key Store Key Date Desc. Store Desc. Date Key Item Key Store Key Promo Key Item Desc. Promo Counter Promo Key Promo Desc.
Agenda Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Slowly Changing Dimensions Dimensional attributes evolve over time Customers move, get married, have children, adjust salaries You may want to track the history of these changes as it could impact your analysis capabilities. For example: How does getting married impact the customer’s buying behaviors? What impact did the sales reorganization have on sales force productivity? For every dimensional attribute, you need to identify a “change” strategy Type 1: Store only the current value Type 2: Create a dimension record for each value (with or without date stamps) Type 3: Create an attribute in the dimension record for previous value Hybrid
Options for Slowly Changing Dimensions Type 1: Overwrite the Changed Attribute Situation: Tracking the change history of this attribute has no analytic value Original Record Item Key 12345 Item Desc Sim City 3000 Dept Educational SW Updated Record Item Key 12345 Item Desc Sim City 3000 Dept Strategy SW
Type I SCDs Overwrite history Prior state of table STATUS_KEY STATUS_CODE STATUS_DESC 1 OPEN Open Operational System New description read from system STATUS_CODE OPEN STATUS_DESC Open Orders Post-update state of table STATUS_KEY STATUS_CODE STATUS_DESC 1 OPEN Open Orders
Options for Slowly Changing Dimensions Type 2: Add a New Dimension Record Situation: Maintains history as new dimension record (need to access two records to determine the change date) Original Record Item Key 12345 Item Desc Sim City 3000 Dept Educational SW Updated Record Item Key 13456 Item Desc Sim City 3000 Dept Strategy SW
Type II SCDs Keep history Prior state of table STATUS_KEY STATUS_CODE STATUS_DESC 1 OPEN Open EFF_FROM_DT EFF_TO_DT 01/01/1970 12/31/2999 Operational System New description read from system STATUS_CODE OPEN STATUS_DESC Open Orders 1 st January 2005 Post-update state of table STATUS_KEY STATUS_CODE STATUS_DESC EFF_FROM_DT EFF_TO_DT 1 OPEN Open 01/01/1970 12/31/2004 2 OPEN Open Orders 01/01/2005 12/31/2999 New columns added to enable key lookup during fact loading
Issues with Type II SCDs Even simple name changes cause problems DEPT_KEY DESCR DEPTID EFF_FROM_DT 1 U. S. Ops. 000001 01/01/1900 2 Eastern Region 000002 01/01/1900 3 East 01/03/2005 000002 Change of name from ‘Eastern Region’ to ‘East’ Report for January 2005 Department Eastern Region Month Invoice $ Jan $200, 000 Report for March 2005 Department East Month Invoice $ Mar $300, 000 Report for all 2005 Department Eastern Region East Year Invoice $ 2005 $200, 000 2005 $300, 000
Options for Slowly Changing Dimensions Type 3: Add a “Prior” Attribute Situation: Maintains history as new dimensional attribute Original Record Item Key 12345 Item Desc Sim City 3000 Dept Educational SW Updated Record Item Key 12345 Item Desc Sim City 3000 Dept Strategy SW Prior Dept Educational SW
Type III SCDs Keep one version of history Prior state of table STATUS_KEY STATUS_CODE STATUS_DESC 1 OPEN Open PREV_STATUS_DESC Open Operational System New description read from system STATUS_CODE OPEN STATUS_DESC Open Orders 1 st January 2005 Post-update state of table STATUS_KEY STATUS_CODE STATUS_DESC 1 OPEN Open Orders PREV_STATUS_DESC Open
Options for Slowly Changing Dimensions Hybrid Approach Use Type 2 to track changes as they occur Include “current” Type 3 attribute, treated as Type 1 Item Key 12345 13456 14567 Item Desc Sim City 3000 “As Was” Dept Strategy SW Simulation SW “Current” Dept Simulation SW
Type II / Type III Hybrid SCDs Keep history + keep current on each row Prior state of table STATUS_KEY STATUS_CODE CURR_STATUS HIST_STATUS 1 OPEN Open EFF_FROM_DT EFF_TO_DT 01/01/1970 12/31/2999 Operational System New description read from system STATUS_CODE OPEN STATUS_DESC Open Orders 1 st January 2005 Post-update state of table STATUS_KEY STATUS_CODE CURR_STATUS HIST_STATUS EFF_FROM_DT EFF_TO_DT 1 OPEN Open Orders Open 01/01/1970 12/31/2004 2 OPEN Open Orders 01/01/2005 12/31/2999
Type II / Type III Hybrid SCDs select product_name, product_cname, quarter, sum(sum_sales) from dim_product, dim_time, fact_sales where dim_product_key = fact_sales. product_key and dim_time. date_key = fact_sales. date_key group by product_name, product_cname, quarter Name Marathon Snickers Skittles Quarter Invoice $ Q 1 $500, 000 Q 1 $250, 000 Q 2 $300, 000 Using historical name Name Snickers Skittles Quarter Invoice $ Q 1 $750, 000 Q 2 $300, 000 Using current name
Agenda Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
4 Step Dimensional Modeling Design Process 1. Choose the Business Process § Examples: invoices, orders, shipments, claims, deposits § A set of related fact and dimension tables for each process 2. Identify the Grain § Select the lowest possible granularity (transaction line item, call detail record) 3. Identify the Dimensions § Conformed dimensions 4. Identify the Facts
Value Chain Design Implications Separate fact tables for each business process Each business process represented by one fact table since the grain, dimensions and facts are unique for each business process The Conformed Dimensions enable the linkage between the different business processes (stars) Date Dimension Inventory Facts Date Key Item Key Store Key Item Dimension Store Dimension Promo Dimension Sales Facts Date Key Item Key Store Key Promo Key
Data Warehouse Enterprise Bus Architecture Purchase Orders Store Inventory Store Sales Date Item Store Promo Distribution Shipper Center Vendor
Data Warehouse Bus Matrix Business processes and shared dimensions Date Item Store Prom o X X Dist Ctr Shipp er Store Sales X X Store Inventory Store Deliveries Dist Ctr Inventory Dist Ctr Delivery Purchase Orders X X X X X Vend or X X
Agenda Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Advanced Dimensional Modeling Some questions … What does it mean to “conform” dimension tables? Should I treat very large dimensions differently from other dimensions?
Conformed Dimensions Separate fact tables for each business process Each business process represented by one fact table since the grain, dimensions and facts are unique for each business process The conformed dimensions enable the linkage between the different business processes (stars) Date Dimension Inventory Facts Date Key Product Key Store Key Product Dimension Location Dimension Promo Dimension Sales Facts Date Key Product Key Store Key Promo Key
Conforming Multiple Dimension Tables All ‘Product’ tables should conform Conform = built from a common source e. g. DIM_PRODUCT DIM_GL_PRODUCT_KEY GL_PRODUCT_KEY PRODUCT_CODE PRODUCT_NAME PRODUCT_CAT … GL_PRODUCT_DESC … Account Dimension GL Product Dimension Cost Center Dimension Product Dimension GL Facts Account Key GL Product Key Cost Center Key Period Activity $ AR Facts Account Key Cost Center Key Product Key Line Amount $ Line Number
Conforming Multiple Dimension Tables select gl_product, sum(line_amt) from dim_product, fact_inv_ln where fact_inv_ln. product_key = dim_product_key and gl_product in (‘License’, ‘Services’) group by gl_product Invoice Sales GL Product License Services select gl_product, sum(period_act) from dim_gl_product, fact_gl where fact_gl. gl_product_key = dim_product. gl_product_key and gl_product in (‘License’, ‘Services’) group by gl_product GL Period Activity Invoice $ $550, 000 $125, 000 GL Product License Services Period Activity $550, 000 $125, 000 ‘Reconciliation’ GL Product License Services Invoice $ Period Activity $550, 000, 000 $125, 000, 000
Very Large Customer Dimensions The process of de-normalization naturally leads to a single customer dimension table Potential issues Tracking change on some attributes = several million rows of data Each customer query hits a very large table How often do you need address information? Conclusion Non-optimal query performance Customer table is best split according to use Demographic queries Current customer queries Historical customer queries DIM_CUSTOMER_KEY CUST_NUMBER CUST_NAME CUST_TYPE CUST_STATUS GENDER NATIONALITY CHILDREN INCOME_BAND ADDR_LINE 1 ADDR_LINE 2 ADDR_LINE 3 …
Very-Large-Customer (VLC) Dimensional Model DIM_CST Type I/III SCD Current attributes MDIM_CST_DEMOG Mini-dimension Demographic attributes DIM_CST_EXTEND Type II SCD Changing attributes 20 million 100 million DIM_CST_EXTEND CUSTOMER_KEY CUSTOMER_NAME CUSTOMER_TYPE CUSTOMER_STATUS CUR_CST_DEMOG_KEY CST_PRODUCTN_ID CST_EXTEND_KEY FCT_SALES CUSTOMER_KEY CST_EXTEND_KEY CST_DEMOG_KEY DATE_KEY … MDIM_CST_DEMOG_KEY BANDED_INCOME CHILDREN AGE_BAND GENDER HOME_TYPE HOME_STATUS NATIONALITY MARITAL_STATUS … max 2 million SUM_QUANTITY SUM_SALES SUM_COST SUM_MARGIN SUM_TAX … >1 billion CUSTOMER_KEY ADDR_LINE 1 ADDR_LINE 2 ADDR_LINE 3 ADDR_CITY … EFF_FROM_DT EFF_TO_DT CURRENT_FLAG DIM_TIME DATE_KEY DAY_DATE DAY_IN_YEAR MONTH_NUM QTR_NUM HALF_NUM YEAR_NUM …
Using the VLC Dimensional Model Using the mini-dimension Using the ‘extension’ dimension select gender, sum(sum_sales) from mdim_cst_demog, fact_sales where sales. cst_demog_key = demog. cst_demog_key and gender = ‘Male’ group by gender select state, sum(sum_sales) from dim_cst_extend, fct_sales where sales. cst_extend_key = extend. cst_extend_key and state in (‘CA’, ‘FL’) group by state Sales by Gender Sales by State Gender Male Invoice $ $1, 250, 000 Retrieve large number of rows for each index value Constrain small table join to big table State CA FL Invoice $ $1, 500, 000 $750, 000 Retrieve small number of rows for each index value Constrain big table join to bigger table
Agenda Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Key Dimensional Modeling Benefit: Extensibility Get the grain right upfront, and everything else takes care of itself Atomic data is the most naturally dimensional data For example, POS business process data mart Natural grain is the individual POS transactions Date Dimension Date Key (PK) Day of Week Number Month Number Store Dimension Store Key (PK) Store ID Store Name Address District Region Product Dimension Product Key (PK) Description Brand Category Flavor Size POS Transaction Fact Date Key (FK) Product Key (FK) Promotion Dimension Promotion Key Store Key (FK) (PK) Promotion Key (FK) Promotion Name POS Transaction # Price Treatment (DD) Ad Treatment Dollars Display Treatment Units Coupon Costs Treatment
Key Dimensional Modeling Benefit: Extensibility If you get the grain right, then the data model is easily extensible In our POS example, we can extend for: Date Dimension Date Key (PK) Day of Week Number Month Number Product Dimension Product Key (PK) Description Brand Category Flavor Size POS Transaction Fact Date Key (FK) Store Dimension Promotion Dimension Product Key (FK) Promotion Key Store Key (PK) Store Key (FK) (PK) Store ID Promotion Key (FK) Promotion Name Store Name Customer Key (FK) Price Treatment Customer Address Clerk Key (FK) Ad Treatment Dimension (to District POS Transaction # Display Treatment support new loyalty Region (DD) Coupon Dollars card) Treatment Units Customer Clerk Dimension Customer. Dimension Key (PK) Clerk Dimension Costs Clerk Key (PK) Customer Name Clerk Name (to track individual Customer City Clerk Age Band clerk performance) Customer State Clerk Years of Customer Age Service Band Clerk Education Customer LTV Estimate
Dimensional Models for User Access Dimensional models required for end user access Focusing on data presentation area, not data staging area Data Presentation Data Staging Data Mart Data Staging Area Data Mart End User Data Access Tools Enterprise Data Warehouse Characteristics of a dimensional model Optimized for query (discovery / exploration) performance Simple, understandable and memorable Flexible
Additional Resources on Dimensional Modeling “The Data Warehouse Toolkit” second edition “The Microsoft Data Warehouse Toolkit” Kimball University Design Tips www. ralphkimball. com Kimball’s Data Warehouse Designer column www. intelligententerprise. com The Data Warehouse Institute www. dw-institute. com
- Slides: 44