Data Warehouse Design ROLAP or Star Schema Design

  • Slides: 110
Download presentation
Data Warehouse Design ROLAP or Star Schema Design

Data Warehouse Design ROLAP or Star Schema Design

The Dimension Model 1. . * Has parameter 2. . * Fact Dimension When

The Dimension Model 1. . * Has parameter 2. . * Fact Dimension When facts and dimensions are expressed in Relational form, then we get a star schema Multidimensional data structure, then we get a cube Information obtained from requirements engineering has to be converted into data warehouse design: star schema/cube

Design Issues Identify and represent different kinds of facts and dimensions Detail the measures

Design Issues Identify and represent different kinds of facts and dimensions Detail the measures and data types represented in facts Detail the properties of dimensions. Design “good” facts and dimensions

Basic Design Basic Dimension Design: Fact: Dimension : : 0. . *: 1 Fact

Basic Design Basic Dimension Design: Fact: Dimension : : 0. . *: 1 Fact Design

Basic Dimension Design Why 2. . *? Surrogate Key of dimension parameter itself 1.

Basic Dimension Design Why 2. . *? Surrogate Key of dimension parameter itself 1. . 1 Has property 2. . * Dimension Attribute 2. . * Composed of Simple 1. . 1 Simple Dimension Composed Dimension Surrogate Why 2. . *? No recursion: separate, distinct attributes needed Natural Key Property

Basic Dimension Design Steps in basic dimension design Make a dimension for each category

Basic Dimension Design Steps in basic dimension design Make a dimension for each category Introduce Surrogate key Introduce dimension parameter key Identify component dimensions of a dimension Identify the attributes of the dimension and of all component dimensions

Example Requirements engineering of a retail store has thrown up the need to maintain

Example Requirements engineering of a retail store has thrown up the need to maintain sales data productwise. (Assumption: consider only those aspects of a product that affect sales) Product is a dimension Introduce the surrogate and product keys Identify the component dimensions Brand of product Category of product Department of product Identify attributes Product: Pnumber, Pid, Product description Brand: Brand Name etc. Product Sales Product Pnumber (surrogate key) Pid (product key) Product description Brand name Category name Department name

Example Product dimension as a relational table (dimension in a star schema) Pnumber Pid

Example Product dimension as a relational table (dimension in a star schema) Pnumber Pid Description Brand_name Category_ Department name _name 1 WB 01 White bread Harvest Bread Bakery 2 BB 01 Brown bread Le Bon Bread Bakery 3 MI 01 Full milk Amul Milk Dairy 4 MI 02 Toned Milk Mother Dairy Milk Dairy 5 CH 01 Processed Cheese Amul Cheese Dairy

Exercise Requirements engineering of a retail store has thrown up the need to maintain

Exercise Requirements engineering of a retail store has thrown up the need to maintain sales data product-wise as well as department-wise Build the star schema and elaborate the main and component dimensions only. Thereafter show its sample relational table. Product Sales Department has Surrogate and user key Component Dimensions Logistics Features Customer Friendliness Department Deptid Deptkey Deptname Floor located on Fire escape flag Access(lift, stairs, escalator) Area Refrigerated flag Layout display Promotions display

Basic Dimension Design: Dimension Components Brand has Brandid Brand name Brand type (MNC, Local)

Basic Dimension Design: Dimension Components Brand has Brandid Brand name Brand type (MNC, Local) Product Category has its own dimension attributes Pnumber Pid Product description The table of Product will not be normalized since it contains non-atomic attributes, Brand Category Brandid Brand name Brand type Separating Brand from Product and Category from Brand shall result in normalized dimension tables Category id Category name Sub-category Product Dimension Brand Dimension Category Dimension

Dimension Design Why 2. . *? Surrogate Key of dimension parameter itself 1. .

Dimension Design Why 2. . *? Surrogate Key of dimension parameter itself 1. . 1 Has property 2. . * Dimension Simple Dimension Attribute 2. . * Composed of 1. . 1 Composed Dimension Simple composed Why 2. . *? No recursion: separate, distinct attributes needed Property

Dimension Design: Composed Attributes Make a dimension for each category Introduce Surrogate key Introduce

Dimension Design: Composed Attributes Make a dimension for each category Introduce Surrogate key Introduce dimension parameter key Identify component dimensions of a dimension Identify the attributes of the dimension and of all component dimensions Flatten out all composed attributes recursively till simple attributes are reached Make combinations of flattened attributes and select the meaningful ones Introduce all the attributes of main dimension For un-normalized dimension: Introduce all attributes of all components For normalized dimension: keep component dimensions separate but link to main

Example Requirements engineering of a retail store has thrown up the need to maintain

Example Requirements engineering of a retail store has thrown up the need to maintain sales data product-wise as well as department-wise. It is also required to look at sales datewise Build the star schema and elaborate the date dimension ent m t r epa Product D Sales Date

The Date Dimension What is the date? 1/12/2015 Flatten the structure into day, month,

The Date Dimension What is the date? 1/12/2015 Flatten the structure into day, month, year Components of Date Georgian Calendar having attributes Day Week Month Quarter Half-year Fiscal Calendar Holiday flag Weekday flag Date Surrogate Key Day of week Day of Calendar Month Day of Year Month of Year Day of Fiscal week Day of Fiscal Month Day of Fiscal year Month of Fiscal Year Calendar Week of Month Calendar Week of Year Week of Fiscal Year Month of Fiscal Calendar Quarter Fiscal Quarter Calendar half-Year Fiscal Half year Calendar Year Fiscal year

The Date Dimension: Role of Data Type Several possible data types of attributes Day

The Date Dimension: Role of Data Type Several possible data types of attributes Day (1 -7, 1 -365, 1 -31) Week (Mon-Sun, 1 -52, 1 -5) Month (1 -12, Jan-Dec, Apr-Mar) Quarter Half-year Fiscal Calendar Holiday flag (0/1, Y/N, Holiday/Noholiday) Weekday flag Date Surrogate Key Day Number of week Day Number of Calendar Month Day Number of Year Month Number of Year Day Number of Fiscal week Day Number of Fiscal Month Day Number of Fiscal year Month Number of Fiscal Year Calendar Week Number of Month Calendar Week Number of Year Week Number of Fiscal Year Month Number of Fiscal Calendar Quarter Number Fiscal Quarter Number Calendar half-Year Number Fiscal Half year Number Calendar Year Fiscal year

The Date Dimension: Combining Attributes Results may be desired for first quarter of fiscal/calendar

The Date Dimension: Combining Attributes Results may be desired for first quarter of fiscal/calendar year Quarter number- Calendar Year: YYYY-Q Quarter Number- Fiscal Year: FYFY- FQ Month and Year combination: YYYY-MM Maintain all interesting combinations Date Surrogate Key AS BEFORE Quarter number- Calendar Year Quarter Number- Fiscal Year Month and Year OBSERVATION: Dimensions and their tables can be very wide due to Flattening taking account of data type Combining Compared to Facts, Dimension tables have fewer rows

Exercise Requirements engineering of a retail store has thrown up the need to maintain

Exercise Requirements engineering of a retail store has thrown up the need to maintain sales data product-wise as well as department-wise. It is also required to look at sales datewise and customer-wise Build the star schema and elaborate the customer dimension.

Basic Dimensions Design: Summary • Dimensions are non-additive and textual in nature • Dimensions

Basic Dimensions Design: Summary • Dimensions are non-additive and textual in nature • Dimensions have component dimensions • Dimension instances: Fact instances : : 1: N (will see later M: N) • Dimensions have attributes: single valued (will see multi-valued later) • Dimension attributes are atomic ① components may be repeating groups but flattened ② dimension attributes may be multi-valued • The dimension design process considers ① flattening repeating groups ② formulating dimension attributes with different data types ③ combining attributes into meaningful query terms

Fact Design

Fact Design

Fact Design. . * 1 by e 0 M * . . ed r

Fact Design. . * 1 by e 0 M * . . ed r u as Fact 1. . * H as p ara Type me te r 2. . * Dimension Measure Nonadditive Additive Semi-additive Atomic Aggregate Facts must be “computable” along all or subset of dimensions

Fact Design: Facts and Measures A fact is a collection of measures Requirements engineering

Fact Design: Facts and Measures A fact is a collection of measures Requirements engineering of a retail store has thrown up the need to maintain sales product-wise and shop-wise How do we measure sales? amount sold, sales value, cost value, profit value A Fact called sales Is ‘selling rate’ a measure? Is ‘discount offered’ a measure? Requirements engineering of a retail store has thrown up the need to keep data on quality of experience of customers for different shops How do we measure “quality of experience’? Enumeration(excellent, good, average, fair, poor) Rate on scale of 1 to 10, 1 being poor and 10 being excellent DANGER

Fact Design: Nature of Measures A measure is normally additive or semi-additive. Facts with

Fact Design: Nature of Measures A measure is normally additive or semi-additive. Facts with non-additive measures exist. Non-additive measures: textual, Boolean, percentages, ratios Text disguised as numeric: Rated quality of experience on a scale of 1 to 10 Not continuously valued: what does 1. 5 mean? Additive and Semi-additive facts must be continuously valued: numeric Additive measures are amenable to computation over all its dimensions sales amount of each sales transaction, dimensions are product, date marks of students of each subject, dimensions are all subjects, semester Semi-additive: computation over a subset of its dimensions Quantity on hand additive over products and product warehouses non-additive over dates

Fact Design: Atomicity Data Warehouse must contain atomic facts and, possibly, physically stored aggregates

Fact Design: Atomicity Data Warehouse must contain atomic facts and, possibly, physically stored aggregates resulting from these View stored aggregates without atomic facts suspiciously Requirements engineering of a retail store has thrown up the need to keep data on quality of experience of customers for different shops Good proposals 1. Store fact(amount sold, sales value, cost value, profit value) 2. Store Fact(amount sold, sales value, cost value, profit value) Fact(total amount sold over all products shop-wise) Questionable Proposal Fact(total amount sold over all products shop-wise)

Fact Design: Type of Fact Atomic Transaction Aggregate Accumulating Snapshot Periodic Snapshot Factless Conformed

Fact Design: Type of Fact Atomic Transaction Aggregate Accumulating Snapshot Periodic Snapshot Factless Conformed

Transaction Fact Type Fact type: Transaction Type, Periodic Snapshot Type, Accumulating Snapshot A transaction

Transaction Fact Type Fact type: Transaction Type, Periodic Snapshot Type, Accumulating Snapshot A transaction is basic to a business Buy, sell, reserve, cancel A record of a transaction is atomic, a fact Information Requirement A retail store has to determine the amount of commission it has to pay to different credit card companies. The payment has to be done at the end of each day. Transaction Fact(Transaction_id, Cash payment amount, Card payment amount) NOTE: fact kept for each credit card company

Transaction Fact Type Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment

Transaction Fact Type Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount CCcompany Dimension Date Dimension Sum up card payments at the end of the day and compute the commission to be paid • Computed daily: no record of total card payment amount as at the end of the day is maintained • Number of facts is equal to the number of sales made • To filter out those facts that have card payment component

Periodic Snapshot Fact Type Periodic snapshots generate facts at periodic intervals when status is

Periodic Snapshot Fact Type Periodic snapshots generate facts at periodic intervals when status is picked up cumulative performance at determined periodicity Each snapshot is a record of what happened over the interval Evolving status during the interval is not interesting End of interval status affects decision making Decision: Negotiate with credit card companies to get good commission rates. Information Requirement Total card payment amounts and commissions paid by credit card company by week Snapshot Fact: Commission(Commission_id, Card amount, Commission paid) Note: One fact per credit card company per week Total in the week

Periodic Snapshot Fact Type Star Schema Commission Fact Commission_id CCcompany_id(FK) Date_id (FK) Card amount

Periodic Snapshot Fact Type Star Schema Commission Fact Commission_id CCcompany_id(FK) Date_id (FK) Card amount Commission amount CCcompany Dimension Weekly performance is stored even though derivable from transaction data Need for co-existing transaction and periodic snapshot Week dimension is overlaid on basic date dimension to build multiple calendars

Periodic Snapshot Fact Type May contain fact types that are semi-additive Can be additive

Periodic Snapshot Fact Type May contain fact types that are semi-additive Can be additive on some but all dimensions Let bank balance be the aggregated fact with month and account as dimensions Bank balance at end of year is not sum of bank balances of the twelve months

Accumulating Snapshots Generate a fact as consisting of Intermediate data about a non-instantaneous transaction

Accumulating Snapshots Generate a fact as consisting of Intermediate data about a non-instantaneous transaction or a business process Performance over a period of time represented as discrete data The period of time is known as a “life”. We accumulate measures over the life. As each new measure of the fact becomes known, an update of the fact occurs Information Requirement Keep data of progress of order. Order occurs at various moments, order placement time, order review time, order acceptance time, under process time, shipped time, received time, paid time Accumulating Snapshot for behaviour of commission paid: One fact for the entire life (Order_id, Placed date, review date, acceptance date, under-process date, shipped date, received date, paid date)

Accumulating Snapshots Star Schema AC_Commission Fact Ac_id CCcompany_id(FK) Date_id (FK) Placed date review date

Accumulating Snapshots Star Schema AC_Commission Fact Ac_id CCcompany_id(FK) Date_id (FK) Placed date review date acceptance date under-process date Shipped date received date paid date CCcompany Dimension Year Dimension Need for co-existing transaction and accumulating snapshot. Periodic snapshot if needed Year dimension is overlaid on basic date dimension to build multiple calendars

Fact Design: Fact Type Comparison Characteristic Transaction Periodic Snapshot Accumulating Snapshot Time Period A

Fact Design: Fact Type Comparison Characteristic Transaction Periodic Snapshot Accumulating Snapshot Time Period A moment in time Fixed interval Variable time Grain Fact per occurrence Fact period Fact per life Load time Insert and Update Date Occurrence date End of period Multiple dates Fact Transaction record Performance in period Performance over life time

Exercises In the exercises that follow determine the facts and measures that the data

Exercises In the exercises that follow determine the facts and measures that the data warehouse shall contain and then draw the star schema. a. The branch of a bank wishes to determine the amount of cash it should ask the treasury to send to it next day morning. The heuristic followed is that cash withdrawn on the previous day shall be drawn in the next day as well. b. A tourist office received requests for reserving different tourist packages from various customers. In order to eliminate non-profit making reservations, it keep track of the amount of revenue generated by each package on a monthly basis. c. Student performance in a course is the sum of all the marks obtained in the various evaluation instruments like assignments, quizzes, projects, examinations etc. Student performance is kept student-wise, semester-wise, and course-wise.

Constellations This schema contains multiple fact tables that share a subset of the set

Constellations This schema contains multiple fact tables that share a subset of the set of dimensions For each star schema or snowflake schema it is possible to construct a fact constellation schema: Separate measures of a fact into separate facts if even one of the following holds: they are not updated at the same time (not same transaction) have different dimensions associated with them have different granularity D 1 D 3 F 1 F 2 D 2

Using Constellations: Granularity Some facts are associated with a given dimension level and other

Using Constellations: Granularity Some facts are associated with a given dimension level and other facts with a deeper dimension level. Related facts at different granularity levels Obtain data that carries no meaning in a report Keep as a constellation Sale Fact CCcompany Dimension Commission Fact Date Dimension

Using Constellations: Time Difference Two measures for an order: order quantity and shipped quantity

Using Constellations: Time Difference Two measures for an order: order quantity and shipped quantity Different update times of the two quantities: Time of order is different from time of shipping Introduce zeroes or NULLS that show up in queries. Keep as a constellation Customer Order Fact Day Shipping Fact Product Dimension

Using Constellations: Different Dimensions The shipped quantity has dimension Transporter An ordered quantity but

Using Constellations: Different Dimensions The shipped quantity has dimension Transporter An ordered quantity but not shipped needs transporter_id as FK Need: Introduce a zero/NULL Transporter that shows up in queries Order Fact Surrogate Customer_key(FK) Day_Key(FK) Product_Key(FK) Transporter_key(FK) Ordered quantity Shipped quantity Customer Dimension Day Dimension Product Dimension Transporter Dimension

Using Constellations: Different Dimensions Customer Dimension Order Fact Surrogate Customer_key(FK) Day_Key(FK) Product_Key(FK) Ordered quantity

Using Constellations: Different Dimensions Customer Dimension Order Fact Surrogate Customer_key(FK) Day_Key(FK) Product_Key(FK) Ordered quantity Day Dimension Product Dimension Surrogate Customer_key(FK) Day_Key(FK) Product_Key(FK) Transporter_key(FK) Shipped quantity Transporter Dimension

Fact-less (Measureless) Facts that contain no measures Existence of row is the fact Record

Fact-less (Measureless) Facts that contain no measures Existence of row is the fact Record of attendance of student can be kept as measurable facts: Date Course Hour Attendance Date key Course Key Hour key Teacher Key Present (Y/N) Teacher

Fact-less (Measureless) Facts Alternatively Keep information of those present only Date Course Hour Present

Fact-less (Measureless) Facts Alternatively Keep information of those present only Date Course Hour Present Date key Course Key Hour key Teacher Key Fact contains no measure, presence of row is enough Teacher

Fact-less (Measureless) Facts How to find those who are absent Date Course Hour Absent

Fact-less (Measureless) Facts How to find those who are absent Date Course Hour Absent Date key Course Key Hour key Teacher Key Present Fact-less facts represent a relationship between the dimensions. Teacher

Exercise Requirements Engineering throws up the need to monitor the stock in hand. A

Exercise Requirements Engineering throws up the need to monitor the stock in hand. A sales fact keeps track of the stock level before sales and after sales, besides having the sales transaction information. The organization is also interested in knowing which product did not sell on a given day. Draw a schema for the foregoing.

Advanced Dimension Design Universal Dimension Assumption: All dimension attributes are available in one universal

Advanced Dimension Design Universal Dimension Assumption: All dimension attributes are available in one universal dimension Universal Fact Assumption: Factors already considered in Fact Design

Dimensions: Grouping Attributes What should be the width of a dimension? Physical issues: Limitations

Dimensions: Grouping Attributes What should be the width of a dimension? Physical issues: Limitations on number of attributes in relational systems block sizes, buffer sizes MD Modeling Issues why not one catch-all dimension? why not several small dimensions? Factors Relationships between facts and dimensions Cardinality of the relationship

Dimension Attribute Groups: Relationships Explicit fact-dimension relationship calls for separate dimensions Remove the attributes

Dimension Attribute Groups: Relationships Explicit fact-dimension relationship calls for separate dimensions Remove the attributes group involved in relationship into separate dimension Implicit relationship between dimension attributes: put attributes in same dimension Affinity Test: Relationship between attributes independent of fact “ product has brand” exists whether a sales has occurred or not Browse-ability Test: Are attributes retrieved together in browse query? what are the products offered by a brand?

Dimension Groupings: Alternatives to Splitting When relationships cannot be determined and the dimension width

Dimension Groupings: Alternatives to Splitting When relationships cannot be determined and the dimension width is very large • Split the dimension D into D 1 and D 2, each with its own surrogate key D 1_SK and D 2_SK • D 1: D 2: : 1: 1. Therefore the split row has same common surrogate key • Include D 1_SK and D 2_SK in the fact Relocate Free form Text to an outrigger dimension Relocate such data to another dimension linked to the main dimension Identify sub-types Those groups of attributes that provide values to subset if dimension instances Publisher applies to the book category of product but not to TV sets

Dimension Groupings: Alternatives to Splitting Make Mini-Dimensions: Controlled width but reduced browse-ability Similar to

Dimension Groupings: Alternatives to Splitting Make Mini-Dimensions: Controlled width but reduced browse-ability Similar to junk dimensions can be arbitrary collection of attributes grouped to reduce width • Remove the relatively more changeable attributes and group into another dimension • Pre-compute all possibilities and make these as dimension instances Policy Key Policy number Policy holder address Policy coverage key Family_size Spouse coverage Children coverage PAYMENT FACT Policy Key Policy coverage key Payment amount

Dimension Roles A given fact instance is associated with several instances of the same

Dimension Roles A given fact instance is associated with several instances of the same dimension Introduce several foreign keys in the fact, one for each association Admission Application Fact Application key Employee key_receiver Employee key_processor Employee key_approver Define Views/Use aliases Admission Application Fact Employee key Employee name Receiver Processor Approver Employee

Kinds of Dimensions Dimension Casual Degenerate Role Playing Junk Causal Behavioural Conformed

Kinds of Dimensions Dimension Casual Degenerate Role Playing Junk Causal Behavioural Conformed

Causal Dimensions Change the normal behaviour of business events Promotions affect buying/selling; reservations/cancellations The

Causal Dimensions Change the normal behaviour of business events Promotions affect buying/selling; reservations/cancellations The changed behaviour has an impact on decisions about introducing these or not; continuing with such schemes or not etc. Requirements: For a retail shop keep sales promotion-wise. Date Dimension Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Promotion Dimension Start_Date End_Date Ad_Type Display_Type Offer description

Degenerate Dimensions Dimension has no content except for primary key invoice_number They are placed

Degenerate Dimensions Dimension has no content except for primary key invoice_number They are placed directly in the FACT table Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount

Role Playing Dimensions A single physical dimension can be referenced multiple times in a

Role Playing Dimensions A single physical dimension can be referenced multiple times in a fact table, with each reference linking to a logically distinct role for the dimension. Billing date to customer and Shipping date to customer Order address of customer, Ship_to address Each will be represented by a separate key in the FACT table Requirement: A customer order address and telephone number are different from delivery address and telephone number Delivery Address Dimension Delivery Telephone Dimension Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Order Address Dimension Order telephone Dimension

Junk Dimensions Some information in source systems is useful for query constraints flags: scholarship

Junk Dimensions Some information in source systems is useful for query constraints flags: scholarship [Yes/No] codes: product code [5 digit numeric] free form text Group all such into one ‘Junk Dimension’

Behavioural Dimension Information obtained from facts becomes a dimension What is the average discount

Behavioural Dimension Information obtained from facts becomes a dimension What is the average discount offered to customers who have bought goods more than Rs. 100, 000? BI is a two step process Step 1: Obtain information Sum the sales for every customer Step 2: Use obtained information as a parameter to do further querying Find Average Discount for Customers obtained Disadvantage: Expensive to execute and difficult for decision-makers to write Convert to expense at ETL time

Designing Behavioural Dimensions Form categories of customers Above 100, 000 Between 50, 000 and

Designing Behavioural Dimensions Form categories of customers Above 100, 000 Between 50, 000 and 99, 999 Less than 50, 000 Introduce customer category as an attribute in the customer dimension Introduce attribute What is the largest salary increase of employees after their last promotion? Introduce last promotion date as a dimension attribute in employee dimension Historical fact as dimension attribute Which department has sold more than the annual sales of last year Keep last year annual sales as attribute in department dimension

Dimension Type Summary Dimension Type Meaning Casual Parameters of business events Causal provide insight

Dimension Type Summary Dimension Type Meaning Casual Parameters of business events Causal provide insight into special causes of business events Degenerate identify control numbers of operational systems, collect measures together Multiple (Date, Products, . . ) Multiple date stamps, product stamps Role playing Single dimension appears many time in a fact Junk Correlated behaviours and flags Behavioural When facts become dimensions

Bridges

Bridges

Multi-valued Dimension Attributes A dimension attribute is multi-valued if it has more than one

Multi-valued Dimension Attributes A dimension attribute is multi-valued if it has more than one value at a given time Requirement: Maintain bank account balance customer-wise. In joint accounts there are more than one account holders. Is customer a role playing dimension? No, several customers play the same role Introduce a bridge dimension, customer group, between customer and account balance fact Account Balance Fact Transaction_Id Account Group Key(FK) Date_id (FK) Debit amount Credit amount Customer Group Bridge Account Group_key Customer Key Operating instruction Offer description Customer Dimension Customer key Customer data

Facts: Dimensions : : M: N A fact (instance of Sales) is associated with

Facts: Dimensions : : M: N A fact (instance of Sales) is associated with more than one instance of sales representative Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Sales Rep_id(FK) 1. . * Sales Rep Dimension Need multiple keys for the multiple sales reps

Facts: Dimensions : : M: N Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment

Facts: Dimensions : : M: N Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Sales Rep_id 1(FK) Sales Rep_id 2 (FK) Sales Rep 1 Dimension Sales Rep 2 Dimension Two difficulties if maximum number of sales representatives in not known, then ? sum multiple counting what is the total sold by a representative? A given representative may appear as rep 1 or rep 2: counted twice in

Facts: Dimensions : : M: N Introduce a bridge Sale Fact Transaction_id CCcompany_id(FK) Date_id

Facts: Dimensions : : M: N Introduce a bridge Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Sales group id(FK) Sales Rep Group Sales group id Sales Rep_id (FK) Sales Rep Dimension Sales Rep_id (PK) Name

Dimension Hierarchies Attribute Hierarchies Recursive Hierarchies

Dimension Hierarchies Attribute Hierarchies Recursive Hierarchies

Attribute Hierarchies Attributes or groups of attributes of a dimension exhibit a parent-child relationship

Attribute Hierarchies Attributes or groups of attributes of a dimension exhibit a parent-child relationship Multiple Hierarchies possible Must document all these and make them explicit in the design Product Biscuits Cakes Bakery Spices Pre-cooked ITC Pnumber (surrogate key) Pid (product key) Product description Products Categories Brand name Category name Departments Bakery

Recursive Hierarchies When dimensions/attributes are related to themselves Parts are composed of parts Employees

Recursive Hierarchies When dimensions/attributes are related to themselves Parts are composed of parts Employees are managed by Employees Companies have subsidiary companies Company dimension Company bridge Company dimension Company bridge

Designing to Provide Operational Support Changes in Values of Dimension Attributes Changes in Hierarchies

Designing to Provide Operational Support Changes in Values of Dimension Attributes Changes in Hierarchies Erroneous and NULL Values Snowflake Schemas

Updates of the Dimension Table What changes? The surrogate key does not change Changes

Updates of the Dimension Table What changes? The surrogate key does not change Changes in other attributes may occur with low frequency or rapidly NOTE: Changes in source (OLTP) systems overwrite old values Overwriting in dimension tables is not always desirable product category changes in the source analysis under the old and the new category is still required

Handling Changes The structure of the table allows only three possibilities • Type I

Handling Changes The structure of the table allows only three possibilities • Type I Change (overwrite: lose earlier information) • Type II Change (new record: keep old record) • Type III Change (new attribute: keep old information) Mix of approaches is possible • Hybrid Approach – Predictable changes with multiple version overlays – Unpredictable changes with single version overlays

Slowly Changing Attribute Values: Type 1: Correction of an error in source systems Typing

Slowly Changing Attribute Values: Type 1: Correction of an error in source systems Typing error in student name: must overwrite in source system Handling the change in DW Overwrite the attribute value in the appropriate row Most Common Change Do not preserve old value Surrogate key is not affected Metallic paint Group 1234567 R 16 White Paint Group Item 111

Slowly Changing Dimension Tables: Type I Change: Problem Loss of old value means that

Slowly Changing Dimension Tables: Type I Change: Problem Loss of old value means that it is difficult to compare behaviour before change and after change 12345 Intellikidz 1 Education ABC 922 Z 12345 Intellikidz 1 Strategy ABC 922 Z History of attribute changes is lost If there is any increase in sale of Intellikidz 1 due to repositioning then impossible to know. Aggregates over department have to be rebuilt

Slowly Changing Dimension Tables: Type 2: A change to preserve old data scholarship earned:

Slowly Changing Dimension Tables: Type 2: A change to preserve old data scholarship earned: yes from no or vice versa in the source system Analyze performance of scholarship holders and others need both old status and new status Handling in DW Introduce another row with new data Next Most Common Change Add an ’effective from date’ attribute No changes in original row New surrogate key for the row Cannot be implemented without the help of Surrogate Keys!!

Slowly Changing Dimension Tables: Type 2 Fact Table SK Dimension Table 1234567 …. 1234567

Slowly Changing Dimension Tables: Type 2 Fact Table SK Dimension Table 1234567 …. 1234567 Ram Goyal Single CUST 11111 …. 1234600 Ram Goyal Married CUST 11111 1234600 …. . Product Department NK 1234567 Intellikidz 1 Education ABC 922 Z 2014 -01 -01 2017 -06 -11 1234600 Intellikidz 1 Strategy ABC 922 Z 2017 -06 -11 2999 -12 -31

Slowly Changing Dimension Tables Type 3: Soft tentative change, ‘what if’ change. What if

Slowly Changing Dimension Tables Type 3: Soft tentative change, ‘what if’ change. What if we change a salesperson from one territory to another Change territory temporarily Analyze performance in both territories if result encouraging then change territory Also used for mapping between old territory and new territory to study change Handling in DW Add ‘old’ attribute, ‘effective date’ attribute in dimension table Copy current attribute value in old attribute Least Common Introduce new value in current attribute Change queries to use old and new attributes Change

Slowly Changing Attribute Values: Type 3 Add a dimension column Alternate Reality Both current

Slowly Changing Attribute Values: Type 3 Add a dimension column Alternate Reality Both current & prior values can be regarded as true at the same time New and historical fact data can be seen either with the new or prior attribute values SK 12345 Product Department NK Intellikidz 1 Education ABC 922 Z Product New_Dept NK Strategy ABC 922 Z Old_Dept Intellikidz 1 Education

Slowly Changing Dimension Tables Type III Change: Problems ¡ Good for handling predictable changes

Slowly Changing Dimension Tables Type III Change: Problems ¡ Good for handling predictable changes ¡ Can lead to lot of wastage of space ¡ Myriad of unpredictable changes ¡ Cannot track the impact of numerous intermediate attribute values

Changing Dimension Tables: Comparison Type I ¡ Fast & Easy to implement ¡ Attribute

Changing Dimension Tables: Comparison Type I ¡ Fast & Easy to implement ¡ Attribute value always reflect the latest assignment ¡ No history of prior attribute values Type II Change: Disadvantages ¡ Dimension table can become big ¡ Does not allow association of the new attribute value with old fact history & vice-versa ¡ Automatically partitions history in the fact table ¡ Tracks as many dimension changes as required ¡ No need to rebuild aggregates ¡ When we constraint on Dept=Strategy, we will not see Intellikidz 1 facts from before the change date

Rapidly Changing Dimension Tables Type 2 changes add rows, Type 3 changes add attributes

Rapidly Changing Dimension Tables Type 2 changes add rows, Type 3 changes add attributes Transition to large dimension tables occurs Break dimension table into slowly changing and rapidly changing parts Name DOB Address Phone Scholarship SGPA CGPA Regn. data Name DOB Address Phone Slowly changing and Name SGPA CGPA Regn. data Relatively rapidly changing

Rapidly Changing Dimension Tables Solution: Mini- Dimension Aka slowly changing Type IV Basic Student

Rapidly Changing Dimension Tables Solution: Mini- Dimension Aka slowly changing Type IV Basic Student Fact Table SK FK 1 FK 2 Measures Name DOB Address Phone Grade Data Name SGPA CGPA

Erroneous and NULL Values

Erroneous and NULL Values

NULL Values in Dimensions NULLs in dimensions need special handling in queries SELECT ……….

NULL Values in Dimensions NULLs in dimensions need special handling in queries SELECT ………. FROM ……. where status= “Exempt” OR status IS NULL //include facts if unknown where status= “Exempt” OR status IS NOT NULL // do not include NULL does not intuitively mean much to business users either in reports or in queries Rule: No analysis parameter can be NULL. Replace NULL by special value 0 for numeric columns n/a for text columns very far future value for dates: 31/12/9999 Queries are formulated using standard relational operators and values: avoid IS/IS Not NULL

NULL Values in Facts: NULL values in surrogate and fact key cannot occur NULL

NULL Values in Facts: NULL values in surrogate and fact key cannot occur NULL values in measures: essentially for business logic to handle NULL does not affect result when computing functions like SUM, AVG Presence of NULLs in query results may confuse the decision maker Often replaced by 0 during ETL process NULL values may also occur in Foreign Keys

Erroneous Values FK value in fact does not match any value in corresponding dimension

Erroneous Values FK value in fact does not match any value in corresponding dimension Erroneous value in dimension: update to the correct value (type 1 change) Erroneous value in fact: treat similar to type 2 change Introduce special row in dimension table Link all erroneous facts to special row Surrogate Salesperson address 0 erroneous n/a Link to appropriate row when correct value is known: need ‘non-surrogate fact key’ attribute in fact table

NULL Values in Foreign Keys of Facts The model allows optional dimensions: A fact

NULL Values in Foreign Keys of Facts The model allows optional dimensions: A fact instance may have no dimension sales may occur without a salesperson being involved There is unknown dimension sales occurs with salesperson but the salesperson is not known Treat similar to type 2 change: Introduce special rows in dimension tables Surrogate Row_type Salesrep_name designation 0 No salesrep n/a 1 Unknown salesrep n/a 2 Incorrect salesrep n/a 33 Salesrep Anjana executive NOTE: when data becomes known, fact_id (not surrogate) attribute needed for linking

NULL Values in Foreign Keys of Facts Late Arriving Dimension data During loading facts

NULL Values in Foreign Keys of Facts Late Arriving Dimension data During loading facts are loaded before dimensions Treat similar to type 2 changes: Introduce special row in dimension(s) Surrogate Row tyoe 0 No salesrep 1 salesrep 2 Salesperson designation n/a n/a Unknown n/a NOTE: surrogate key values 1 and 2 represent the same thing. When dimension data becomes known Non-surrogate fact_id attribute is needed for proper linking Row with 2 can be deleted why complicate: just do not introduce row with 2

NULL Values in Foreign Keys of Facts Future Events Accumulating snapshots requiring dates: date

NULL Values in Foreign Keys of Facts Future Events Accumulating snapshots requiring dates: date will be known later Teat similar to type 2 change Surrogate Row type Date 3 Future 31/12/9999 27 Day 12/3/2019 NOTE: when data arrives, fact_id attribute is needed for proper linking

Snowflake Schema Snowflaking is a storage/performance issue

Snowflake Schema Snowflaking is a storage/performance issue

When To Snowflakes are normalized dimensions: dimensions in 1 NF no repeating groups no

When To Snowflakes are normalized dimensions: dimensions in 1 NF no repeating groups no multi-valued attributes Trade-off Benefits Saving in storage space total bytes being removed per row of dimension number of rows of repeating group width of dimension depth of dimension Easier to update Drawbacks less intuitive schema, too many dimensions, end-users may be put-off Browsing through the columns of a dimension is difficulty Increased joins: loss of query performance

When To Snowflake: Size of Table A dimension table is 500, 000 rows deep

When To Snowflake: Size of Table A dimension table is 500, 000 rows deep and occupies 200 MB; the fact table is 20 GB (20000 MB). There is a 20 byte repeating group in the dimension table and unique rows of repeating group is 100. How much is the saving due to normalization? 1 Foreign key in fact+ surrogate key in dimension need 4 bytes each. New row of dimension = 20 -4 = 16 bytes Net saving in dimension = (16 * 500, 000) approx. = 8 MB Expenditure in repeating group = 100 * 24 = 2, 400 B (negligible) Saving percentage of dimension = 8/200 = 4% Percentage saved of total size= 8/(20000 + 200) = 0. 04% Savings increase with size of repeating group : repeating group occupies 40 B Saving/row = 40 -4=36 bytes Net saving = 36*500, 000 = 18 MB = 9% and 0. 09% respectively Savings increase with depth of dimension table Let rows double, size of dimension = 40 MB, repeating group size = 20 MB Net savings : 16 * 1000, 000 = 16 MB =8% and 0. 08% respectively

When To Snowflake when dimension depth is very large, the total sizes of the

When To Snowflake when dimension depth is very large, the total sizes of the repeating groups being removed is very high. Making a snowflake due to storage saving is a physical design consideration not a logical model issue NO REAL BENEFIT BY SNOWFLAKING SMALL WIDTH AND DEPTH DOMENSIONS Query performance is prime issue in DW. Snowflake only rarely

When To Snowflake: Other Considerations Refresh time of repeating group is different from other

When To Snowflake: Other Considerations Refresh time of repeating group is different from other dimension attributes Granularity of the repeating group is different from dimension Querying of repeating group occurs with high frequency independently of dimension Customer Dimension Sales Fact PAN City class key(FK) Address Circle State PIN Others City Class Repeating Group City class key City code Class description Population range Cost of living Pollution index Quality of life Others

Conforming Facts and Dimensions

Conforming Facts and Dimensions

Integration Needs Conforming Dimensions Customer Dim Day Dim Returned items fact Order fact Product

Integration Needs Conforming Dimensions Customer Dim Day Dim Returned items fact Order fact Product Dim Salesrep Dim Query to obtain quantity of order by product and quantity returned by product Merge the results to obtain percent returned

Differences Prohibiting Operation The names of the attributes of dimension are different The formats/types

Differences Prohibiting Operation The names of the attributes of dimension are different The formats/types pf attributes are different Lower case/Upper case/sentence case coding schemes : code is 12345 in one and 123 -45 in another different classifications may mean one row in one but multiple rows in the other: a change was treated as type 1 change in one but type 2 in the other 12345 Conformance requires Same Structure Same content Box 12345 Box, Type A

Conformed Dimensions Shared Dimension tables: allow analysis on the shared dimensions Customer Dim Order

Conformed Dimensions Shared Dimension tables: allow analysis on the shared dimensions Customer Dim Order fact Day Dim Delivery Fact Transporter Product Dim Shared dimension tables Customer, Day, and Product If held on separate physical platforms, the three dimension tables must be updated by the same ETL process

Conformed Dimensions Conformed Rollups dimension attributes of one are subset of the other common

Conformed Dimensions Conformed Rollups dimension attributes of one are subset of the other common attributes share the same structure and content Customer Dim Order fact Day Dim Holiday flag Month number Month Dim Month number Sales Target Smaller dimension, Month dim is called conformed rollup, the larger Day Dim is called base dimension Use base dimension as a source to populate the conformed rollup dimension

Conformed Dimensions Conformed Degenerate Dimensions natural key of product occurs in more than one

Conformed Dimensions Conformed Degenerate Dimensions natural key of product occurs in more than one fact table ensure same content and structure

Conformance Bus Documentation of dimensional conformance helps in aggregate design Identification of aggregate candidates

Conformance Bus Documentation of dimensional conformance helps in aggregate design Identification of aggregate candidates for drill across Ensures conformance of base dimensions and roll-up dimensions nt or y Sa les Two dimensions conform if all conditions hold: In ve Non-key attributes of one are a subset of the non-key attributes of the other t uc e Tim Sa les pe rso n od Pr Cu sto m er Data types of the common attributes are the same in the subset of common attributes: All combinations of values present in the more detailed dimension are present in the other

Conformance Matrix A structure for representing the conformance Business Processes Dimensions Customer Inventory Product

Conformance Matrix A structure for representing the conformance Business Processes Dimensions Customer Inventory Product Salesperson Time Warehouse X X X Sales X X Delivery X X X Shipper X X For integration of Sales and Delivery Customer, Product, Time must conform For integration of Inventory and Delivery Product, Time, and Warehouse must conform X X

Augmenting the Matrix Augment with type of fact Fact type Multiple fact types for

Augmenting the Matrix Augment with type of fact Fact type Multiple fact types for a process Customer Product Sales person Time Warehouse Inventory Transacti on X X X Inventory Snapshot X X X Sales Transacti on X X Delivery Transacti on X X X Shipper X X

Multiple Fact Types and Conformity The subset dimension is called an aggregation point Pid

Multiple Fact Types and Conformity The subset dimension is called an aggregation point Pid (sk) Sku(nku) Pdesc Brandcode Brand mgr Category code Brandid (sk) Brandcode (nk) Brand mgr Category code Categoryid (sk) Category code (nk) Aggregate Point: Brand Aggregate Point Category of Brand of Product Aggregate point must contain all relevant attributes of the base dimensions and have same slow changing properties Mark base dimensions as Type 1/2 changing and, if present in aggregate point then it is also marked as being of same type

Computing Fact Sizes Product type Region City Sales Product name Rupees Assume one sale

Computing Fact Sizes Product type Region City Sales Product name Rupees Assume one sale fact per product, per city, per month Year Season Month Let there be 5000 products 60 months 50 cities Number of sales facts = 5000*60*50= 15000000

Sparse Facts Not all 5000 products may be sold each month in each city

Sparse Facts Not all 5000 products may be sold each month in each city Assume that 3000 products are sold each month in each city Number of sales facts = 3000 * 60 * 50 = 9000000 Approximately 60% of the cube is occupied and 40% is empty

Aggregation Aggregates are pre-calculated summaries along dimension hierarchies derived from basic facts. We need

Aggregation Aggregates are pre-calculated summaries along dimension hierarchies derived from basic facts. We need the total sales for each region, product wise and month-wise Number of products = 5000 Number of regions = 5 Number of months = 60 Aggregation is performed in order to speed up common queries Total number of facts = 5000*5*60 = 1500000 Space-time tradeoff if the frequency of use is high then pay the storage expense Aggregation guideline if the number of facts summarised is more than 10, then do aggregation

Aggregation Year Region Product type Season Month City Three-way aggregation product name No aggregation

Aggregation Year Region Product type Season Month City Three-way aggregation product name No aggregation One-way aggregation Two-way aggregation When aggregation is done by rising along n-dimensions then n-way aggregation is said to be performed

Sparsity and Aggregation As the amount of aggregation increases sparsity decreases One-way aggregation on

Sparsity and Aggregation As the amount of aggregation increases sparsity decreases One-way aggregation on regions results in 1. 5 M facts The probability of all 5000 products being sold in a month in a region is higher than of all 5000 being sold in a city Two-way aggregation on regions and season results in 0. 5 M facts The probability of all 5000 products being sold in a month in a region is higher than of all 5000 being sold in a region

Aggregation An aggregate forms its own star schema Seasonal Sales Rupees Region City Product

Aggregation An aggregate forms its own star schema Seasonal Sales Rupees Region City Product type Product name

Design Rules for Aggregate Schemas Create a separate star for each aggregation If same

Design Rules for Aggregate Schemas Create a separate star for each aggregation If same fact is used then double counting may occur Introduce a n extra attribute per aggregate point: level attribute Pid Level Pname Brandid Brandname Categoryid Category name 100 Product Tape GS 01 Grip set CS 01 Packing 101 Product Wonder tape GS 01 Grip Set CS 01 Packing 107 Brand Not applicable Grip set CS 01 Packing 108 Category Not applicable CS 01 Packing Not checking for level will result in multiple rows being counted (2 extra for product level, 3 extra for Brand Category levels respectively)

Design Rules for Aggregate Schemas Attributes of Aggregate Dimensions have same names as base

Design Rules for Aggregate Schemas Attributes of Aggregate Dimensions have same names as base dimensions: establish naming conventions to distinguish between these Always source aggregate dimensions from base dimensions The aggregate dimension conforms to the base dimensions • A perfect subset, except for keys • Values of attributes must be identical to those in the base • There must be exactly one row for each combination of attribute values in the base dimensions

Visible Summarizations/Invisible Aggregates Since each aggregate is a separate star, we get base and

Visible Summarizations/Invisible Aggregates Since each aggregate is a separate star, we get base and aggregate stars Both produce the same answer to a query Aggregate Schemas are made invisible to the user and queries are redirected to the aggregate schema Visible Summarizations: pre-joined aggregate Eliminate the need for doing a join by keeping a pre-joined relation Pid brand code Category code Region Quantity sold Order value Design Rule: A separate relation for each pre-joined Suffer from double counting if there is no level attribute or of query is formulated qithout using it in the WHERE clause

More Visible Aggregates Derived Tables Merged Fact Table Pivoted Fact Table Sliced Fact Tables

More Visible Aggregates Derived Tables Merged Fact Table Pivoted Fact Table Sliced Fact Tables with New facts

Issues in Aggregation 1 Aggregates on Periodic Snapshots cannot be made invisible Accumulating Snapshots

Issues in Aggregation 1 Aggregates on Periodic Snapshots cannot be made invisible Accumulating Snapshots are hard to build since there are several events in each 2 When to pre-aggregate? a) Pre-aggregate nothing: Pick from transactions b) Pre-aggregate nothing, but save the materialized view Save the first time the aggregate is built c) Pre-aggregate everything (static) Build everything at ETL time d) Pre-aggregate selectively, based on known statistics of use