Data Warehouse Design ROLAP or Star Schema Design














































































































- Slides: 110
Data Warehouse Design ROLAP or Star Schema Design
The Dimension Model 1. . * Has parameter 2. . * Fact Dimension When facts and dimensions are expressed in Relational form, then we get a star schema Multidimensional data structure, then we get a cube Information obtained from requirements engineering has to be converted into data warehouse design: star schema/cube
Design Issues Identify and represent different kinds of facts and dimensions Detail the measures and data types represented in facts Detail the properties of dimensions. Design “good” facts and dimensions
Basic Design Basic Dimension Design: Fact: Dimension : : 0. . *: 1 Fact Design
Basic Dimension Design Why 2. . *? Surrogate Key of dimension parameter itself 1. . 1 Has property 2. . * Dimension Attribute 2. . * Composed of Simple 1. . 1 Simple Dimension Composed Dimension Surrogate Why 2. . *? No recursion: separate, distinct attributes needed Natural Key Property
Basic Dimension Design Steps in basic dimension design Make a dimension for each category Introduce Surrogate key Introduce dimension parameter key Identify component dimensions of a dimension Identify the attributes of the dimension and of all component dimensions
Example Requirements engineering of a retail store has thrown up the need to maintain sales data productwise. (Assumption: consider only those aspects of a product that affect sales) Product is a dimension Introduce the surrogate and product keys Identify the component dimensions Brand of product Category of product Department of product Identify attributes Product: Pnumber, Pid, Product description Brand: Brand Name etc. Product Sales Product Pnumber (surrogate key) Pid (product key) Product description Brand name Category name Department name
Example Product dimension as a relational table (dimension in a star schema) Pnumber Pid Description Brand_name Category_ Department name _name 1 WB 01 White bread Harvest Bread Bakery 2 BB 01 Brown bread Le Bon Bread Bakery 3 MI 01 Full milk Amul Milk Dairy 4 MI 02 Toned Milk Mother Dairy Milk Dairy 5 CH 01 Processed Cheese Amul Cheese Dairy
Exercise Requirements engineering of a retail store has thrown up the need to maintain sales data product-wise as well as department-wise Build the star schema and elaborate the main and component dimensions only. Thereafter show its sample relational table. Product Sales Department has Surrogate and user key Component Dimensions Logistics Features Customer Friendliness Department Deptid Deptkey Deptname Floor located on Fire escape flag Access(lift, stairs, escalator) Area Refrigerated flag Layout display Promotions display
Basic Dimension Design: Dimension Components Brand has Brandid Brand name Brand type (MNC, Local) Product Category has its own dimension attributes Pnumber Pid Product description The table of Product will not be normalized since it contains non-atomic attributes, Brand Category Brandid Brand name Brand type Separating Brand from Product and Category from Brand shall result in normalized dimension tables Category id Category name Sub-category Product Dimension Brand Dimension Category Dimension
Dimension Design Why 2. . *? Surrogate Key of dimension parameter itself 1. . 1 Has property 2. . * Dimension Simple Dimension Attribute 2. . * Composed of 1. . 1 Composed Dimension Simple composed Why 2. . *? No recursion: separate, distinct attributes needed Property
Dimension Design: Composed Attributes Make a dimension for each category Introduce Surrogate key Introduce dimension parameter key Identify component dimensions of a dimension Identify the attributes of the dimension and of all component dimensions Flatten out all composed attributes recursively till simple attributes are reached Make combinations of flattened attributes and select the meaningful ones Introduce all the attributes of main dimension For un-normalized dimension: Introduce all attributes of all components For normalized dimension: keep component dimensions separate but link to main
Example Requirements engineering of a retail store has thrown up the need to maintain sales data product-wise as well as department-wise. It is also required to look at sales datewise Build the star schema and elaborate the date dimension ent m t r epa Product D Sales Date
The Date Dimension What is the date? 1/12/2015 Flatten the structure into day, month, year Components of Date Georgian Calendar having attributes Day Week Month Quarter Half-year Fiscal Calendar Holiday flag Weekday flag Date Surrogate Key Day of week Day of Calendar Month Day of Year Month of Year Day of Fiscal week Day of Fiscal Month Day of Fiscal year Month of Fiscal Year Calendar Week of Month Calendar Week of Year Week of Fiscal Year Month of Fiscal Calendar Quarter Fiscal Quarter Calendar half-Year Fiscal Half year Calendar Year Fiscal year
The Date Dimension: Role of Data Type Several possible data types of attributes Day (1 -7, 1 -365, 1 -31) Week (Mon-Sun, 1 -52, 1 -5) Month (1 -12, Jan-Dec, Apr-Mar) Quarter Half-year Fiscal Calendar Holiday flag (0/1, Y/N, Holiday/Noholiday) Weekday flag Date Surrogate Key Day Number of week Day Number of Calendar Month Day Number of Year Month Number of Year Day Number of Fiscal week Day Number of Fiscal Month Day Number of Fiscal year Month Number of Fiscal Year Calendar Week Number of Month Calendar Week Number of Year Week Number of Fiscal Year Month Number of Fiscal Calendar Quarter Number Fiscal Quarter Number Calendar half-Year Number Fiscal Half year Number Calendar Year Fiscal year
The Date Dimension: Combining Attributes Results may be desired for first quarter of fiscal/calendar year Quarter number- Calendar Year: YYYY-Q Quarter Number- Fiscal Year: FYFY- FQ Month and Year combination: YYYY-MM Maintain all interesting combinations Date Surrogate Key AS BEFORE Quarter number- Calendar Year Quarter Number- Fiscal Year Month and Year OBSERVATION: Dimensions and their tables can be very wide due to Flattening taking account of data type Combining Compared to Facts, Dimension tables have fewer rows
Exercise Requirements engineering of a retail store has thrown up the need to maintain sales data product-wise as well as department-wise. It is also required to look at sales datewise and customer-wise Build the star schema and elaborate the customer dimension.
Basic Dimensions Design: Summary • Dimensions are non-additive and textual in nature • Dimensions have component dimensions • Dimension instances: Fact instances : : 1: N (will see later M: N) • Dimensions have attributes: single valued (will see multi-valued later) • Dimension attributes are atomic ① components may be repeating groups but flattened ② dimension attributes may be multi-valued • The dimension design process considers ① flattening repeating groups ② formulating dimension attributes with different data types ③ combining attributes into meaningful query terms
Fact Design
Fact Design. . * 1 by e 0 M * . . ed r u as Fact 1. . * H as p ara Type me te r 2. . * Dimension Measure Nonadditive Additive Semi-additive Atomic Aggregate Facts must be “computable” along all or subset of dimensions
Fact Design: Facts and Measures A fact is a collection of measures Requirements engineering of a retail store has thrown up the need to maintain sales product-wise and shop-wise How do we measure sales? amount sold, sales value, cost value, profit value A Fact called sales Is ‘selling rate’ a measure? Is ‘discount offered’ a measure? Requirements engineering of a retail store has thrown up the need to keep data on quality of experience of customers for different shops How do we measure “quality of experience’? Enumeration(excellent, good, average, fair, poor) Rate on scale of 1 to 10, 1 being poor and 10 being excellent DANGER
Fact Design: Nature of Measures A measure is normally additive or semi-additive. Facts with non-additive measures exist. Non-additive measures: textual, Boolean, percentages, ratios Text disguised as numeric: Rated quality of experience on a scale of 1 to 10 Not continuously valued: what does 1. 5 mean? Additive and Semi-additive facts must be continuously valued: numeric Additive measures are amenable to computation over all its dimensions sales amount of each sales transaction, dimensions are product, date marks of students of each subject, dimensions are all subjects, semester Semi-additive: computation over a subset of its dimensions Quantity on hand additive over products and product warehouses non-additive over dates
Fact Design: Atomicity Data Warehouse must contain atomic facts and, possibly, physically stored aggregates resulting from these View stored aggregates without atomic facts suspiciously Requirements engineering of a retail store has thrown up the need to keep data on quality of experience of customers for different shops Good proposals 1. Store fact(amount sold, sales value, cost value, profit value) 2. Store Fact(amount sold, sales value, cost value, profit value) Fact(total amount sold over all products shop-wise) Questionable Proposal Fact(total amount sold over all products shop-wise)
Fact Design: Type of Fact Atomic Transaction Aggregate Accumulating Snapshot Periodic Snapshot Factless Conformed
Transaction Fact Type Fact type: Transaction Type, Periodic Snapshot Type, Accumulating Snapshot A transaction is basic to a business Buy, sell, reserve, cancel A record of a transaction is atomic, a fact Information Requirement A retail store has to determine the amount of commission it has to pay to different credit card companies. The payment has to be done at the end of each day. Transaction Fact(Transaction_id, Cash payment amount, Card payment amount) NOTE: fact kept for each credit card company
Transaction Fact Type Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount CCcompany Dimension Date Dimension Sum up card payments at the end of the day and compute the commission to be paid • Computed daily: no record of total card payment amount as at the end of the day is maintained • Number of facts is equal to the number of sales made • To filter out those facts that have card payment component
Periodic Snapshot Fact Type Periodic snapshots generate facts at periodic intervals when status is picked up cumulative performance at determined periodicity Each snapshot is a record of what happened over the interval Evolving status during the interval is not interesting End of interval status affects decision making Decision: Negotiate with credit card companies to get good commission rates. Information Requirement Total card payment amounts and commissions paid by credit card company by week Snapshot Fact: Commission(Commission_id, Card amount, Commission paid) Note: One fact per credit card company per week Total in the week
Periodic Snapshot Fact Type Star Schema Commission Fact Commission_id CCcompany_id(FK) Date_id (FK) Card amount Commission amount CCcompany Dimension Weekly performance is stored even though derivable from transaction data Need for co-existing transaction and periodic snapshot Week dimension is overlaid on basic date dimension to build multiple calendars
Periodic Snapshot Fact Type May contain fact types that are semi-additive Can be additive on some but all dimensions Let bank balance be the aggregated fact with month and account as dimensions Bank balance at end of year is not sum of bank balances of the twelve months
Accumulating Snapshots Generate a fact as consisting of Intermediate data about a non-instantaneous transaction or a business process Performance over a period of time represented as discrete data The period of time is known as a “life”. We accumulate measures over the life. As each new measure of the fact becomes known, an update of the fact occurs Information Requirement Keep data of progress of order. Order occurs at various moments, order placement time, order review time, order acceptance time, under process time, shipped time, received time, paid time Accumulating Snapshot for behaviour of commission paid: One fact for the entire life (Order_id, Placed date, review date, acceptance date, under-process date, shipped date, received date, paid date)
Accumulating Snapshots Star Schema AC_Commission Fact Ac_id CCcompany_id(FK) Date_id (FK) Placed date review date acceptance date under-process date Shipped date received date paid date CCcompany Dimension Year Dimension Need for co-existing transaction and accumulating snapshot. Periodic snapshot if needed Year dimension is overlaid on basic date dimension to build multiple calendars
Fact Design: Fact Type Comparison Characteristic Transaction Periodic Snapshot Accumulating Snapshot Time Period A moment in time Fixed interval Variable time Grain Fact per occurrence Fact period Fact per life Load time Insert and Update Date Occurrence date End of period Multiple dates Fact Transaction record Performance in period Performance over life time
Exercises In the exercises that follow determine the facts and measures that the data warehouse shall contain and then draw the star schema. a. The branch of a bank wishes to determine the amount of cash it should ask the treasury to send to it next day morning. The heuristic followed is that cash withdrawn on the previous day shall be drawn in the next day as well. b. A tourist office received requests for reserving different tourist packages from various customers. In order to eliminate non-profit making reservations, it keep track of the amount of revenue generated by each package on a monthly basis. c. Student performance in a course is the sum of all the marks obtained in the various evaluation instruments like assignments, quizzes, projects, examinations etc. Student performance is kept student-wise, semester-wise, and course-wise.
Constellations This schema contains multiple fact tables that share a subset of the set of dimensions For each star schema or snowflake schema it is possible to construct a fact constellation schema: Separate measures of a fact into separate facts if even one of the following holds: they are not updated at the same time (not same transaction) have different dimensions associated with them have different granularity D 1 D 3 F 1 F 2 D 2
Using Constellations: Granularity Some facts are associated with a given dimension level and other facts with a deeper dimension level. Related facts at different granularity levels Obtain data that carries no meaning in a report Keep as a constellation Sale Fact CCcompany Dimension Commission Fact Date Dimension
Using Constellations: Time Difference Two measures for an order: order quantity and shipped quantity Different update times of the two quantities: Time of order is different from time of shipping Introduce zeroes or NULLS that show up in queries. Keep as a constellation Customer Order Fact Day Shipping Fact Product Dimension
Using Constellations: Different Dimensions The shipped quantity has dimension Transporter An ordered quantity but not shipped needs transporter_id as FK Need: Introduce a zero/NULL Transporter that shows up in queries Order Fact Surrogate Customer_key(FK) Day_Key(FK) Product_Key(FK) Transporter_key(FK) Ordered quantity Shipped quantity Customer Dimension Day Dimension Product Dimension Transporter Dimension
Using Constellations: Different Dimensions Customer Dimension Order Fact Surrogate Customer_key(FK) Day_Key(FK) Product_Key(FK) Ordered quantity Day Dimension Product Dimension Surrogate Customer_key(FK) Day_Key(FK) Product_Key(FK) Transporter_key(FK) Shipped quantity Transporter Dimension
Fact-less (Measureless) Facts that contain no measures Existence of row is the fact Record of attendance of student can be kept as measurable facts: Date Course Hour Attendance Date key Course Key Hour key Teacher Key Present (Y/N) Teacher
Fact-less (Measureless) Facts Alternatively Keep information of those present only Date Course Hour Present Date key Course Key Hour key Teacher Key Fact contains no measure, presence of row is enough Teacher
Fact-less (Measureless) Facts How to find those who are absent Date Course Hour Absent Date key Course Key Hour key Teacher Key Present Fact-less facts represent a relationship between the dimensions. Teacher
Exercise Requirements Engineering throws up the need to monitor the stock in hand. A sales fact keeps track of the stock level before sales and after sales, besides having the sales transaction information. The organization is also interested in knowing which product did not sell on a given day. Draw a schema for the foregoing.
Advanced Dimension Design Universal Dimension Assumption: All dimension attributes are available in one universal dimension Universal Fact Assumption: Factors already considered in Fact Design
Dimensions: Grouping Attributes What should be the width of a dimension? Physical issues: Limitations on number of attributes in relational systems block sizes, buffer sizes MD Modeling Issues why not one catch-all dimension? why not several small dimensions? Factors Relationships between facts and dimensions Cardinality of the relationship
Dimension Attribute Groups: Relationships Explicit fact-dimension relationship calls for separate dimensions Remove the attributes group involved in relationship into separate dimension Implicit relationship between dimension attributes: put attributes in same dimension Affinity Test: Relationship between attributes independent of fact “ product has brand” exists whether a sales has occurred or not Browse-ability Test: Are attributes retrieved together in browse query? what are the products offered by a brand?
Dimension Groupings: Alternatives to Splitting When relationships cannot be determined and the dimension width is very large • Split the dimension D into D 1 and D 2, each with its own surrogate key D 1_SK and D 2_SK • D 1: D 2: : 1: 1. Therefore the split row has same common surrogate key • Include D 1_SK and D 2_SK in the fact Relocate Free form Text to an outrigger dimension Relocate such data to another dimension linked to the main dimension Identify sub-types Those groups of attributes that provide values to subset if dimension instances Publisher applies to the book category of product but not to TV sets
Dimension Groupings: Alternatives to Splitting Make Mini-Dimensions: Controlled width but reduced browse-ability Similar to junk dimensions can be arbitrary collection of attributes grouped to reduce width • Remove the relatively more changeable attributes and group into another dimension • Pre-compute all possibilities and make these as dimension instances Policy Key Policy number Policy holder address Policy coverage key Family_size Spouse coverage Children coverage PAYMENT FACT Policy Key Policy coverage key Payment amount
Dimension Roles A given fact instance is associated with several instances of the same dimension Introduce several foreign keys in the fact, one for each association Admission Application Fact Application key Employee key_receiver Employee key_processor Employee key_approver Define Views/Use aliases Admission Application Fact Employee key Employee name Receiver Processor Approver Employee
Kinds of Dimensions Dimension Casual Degenerate Role Playing Junk Causal Behavioural Conformed
Causal Dimensions Change the normal behaviour of business events Promotions affect buying/selling; reservations/cancellations The changed behaviour has an impact on decisions about introducing these or not; continuing with such schemes or not etc. Requirements: For a retail shop keep sales promotion-wise. Date Dimension Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Promotion Dimension Start_Date End_Date Ad_Type Display_Type Offer description
Degenerate Dimensions Dimension has no content except for primary key invoice_number They are placed directly in the FACT table Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount
Role Playing Dimensions A single physical dimension can be referenced multiple times in a fact table, with each reference linking to a logically distinct role for the dimension. Billing date to customer and Shipping date to customer Order address of customer, Ship_to address Each will be represented by a separate key in the FACT table Requirement: A customer order address and telephone number are different from delivery address and telephone number Delivery Address Dimension Delivery Telephone Dimension Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Order Address Dimension Order telephone Dimension
Junk Dimensions Some information in source systems is useful for query constraints flags: scholarship [Yes/No] codes: product code [5 digit numeric] free form text Group all such into one ‘Junk Dimension’
Behavioural Dimension Information obtained from facts becomes a dimension What is the average discount offered to customers who have bought goods more than Rs. 100, 000? BI is a two step process Step 1: Obtain information Sum the sales for every customer Step 2: Use obtained information as a parameter to do further querying Find Average Discount for Customers obtained Disadvantage: Expensive to execute and difficult for decision-makers to write Convert to expense at ETL time
Designing Behavioural Dimensions Form categories of customers Above 100, 000 Between 50, 000 and 99, 999 Less than 50, 000 Introduce customer category as an attribute in the customer dimension Introduce attribute What is the largest salary increase of employees after their last promotion? Introduce last promotion date as a dimension attribute in employee dimension Historical fact as dimension attribute Which department has sold more than the annual sales of last year Keep last year annual sales as attribute in department dimension
Dimension Type Summary Dimension Type Meaning Casual Parameters of business events Causal provide insight into special causes of business events Degenerate identify control numbers of operational systems, collect measures together Multiple (Date, Products, . . ) Multiple date stamps, product stamps Role playing Single dimension appears many time in a fact Junk Correlated behaviours and flags Behavioural When facts become dimensions
Bridges
Multi-valued Dimension Attributes A dimension attribute is multi-valued if it has more than one value at a given time Requirement: Maintain bank account balance customer-wise. In joint accounts there are more than one account holders. Is customer a role playing dimension? No, several customers play the same role Introduce a bridge dimension, customer group, between customer and account balance fact Account Balance Fact Transaction_Id Account Group Key(FK) Date_id (FK) Debit amount Credit amount Customer Group Bridge Account Group_key Customer Key Operating instruction Offer description Customer Dimension Customer key Customer data
Facts: Dimensions : : M: N A fact (instance of Sales) is associated with more than one instance of sales representative Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Sales Rep_id(FK) 1. . * Sales Rep Dimension Need multiple keys for the multiple sales reps
Facts: Dimensions : : M: N Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Sales Rep_id 1(FK) Sales Rep_id 2 (FK) Sales Rep 1 Dimension Sales Rep 2 Dimension Two difficulties if maximum number of sales representatives in not known, then ? sum multiple counting what is the total sold by a representative? A given representative may appear as rep 1 or rep 2: counted twice in
Facts: Dimensions : : M: N Introduce a bridge Sale Fact Transaction_id CCcompany_id(FK) Date_id (FK) Cash payment amount Card payment amount Sales group id(FK) Sales Rep Group Sales group id Sales Rep_id (FK) Sales Rep Dimension Sales Rep_id (PK) Name
Dimension Hierarchies Attribute Hierarchies Recursive Hierarchies
Attribute Hierarchies Attributes or groups of attributes of a dimension exhibit a parent-child relationship Multiple Hierarchies possible Must document all these and make them explicit in the design Product Biscuits Cakes Bakery Spices Pre-cooked ITC Pnumber (surrogate key) Pid (product key) Product description Products Categories Brand name Category name Departments Bakery
Recursive Hierarchies When dimensions/attributes are related to themselves Parts are composed of parts Employees are managed by Employees Companies have subsidiary companies Company dimension Company bridge Company dimension Company bridge
Designing to Provide Operational Support Changes in Values of Dimension Attributes Changes in Hierarchies Erroneous and NULL Values Snowflake Schemas
Updates of the Dimension Table What changes? The surrogate key does not change Changes in other attributes may occur with low frequency or rapidly NOTE: Changes in source (OLTP) systems overwrite old values Overwriting in dimension tables is not always desirable product category changes in the source analysis under the old and the new category is still required
Handling Changes The structure of the table allows only three possibilities • Type I Change (overwrite: lose earlier information) • Type II Change (new record: keep old record) • Type III Change (new attribute: keep old information) Mix of approaches is possible • Hybrid Approach – Predictable changes with multiple version overlays – Unpredictable changes with single version overlays
Slowly Changing Attribute Values: Type 1: Correction of an error in source systems Typing error in student name: must overwrite in source system Handling the change in DW Overwrite the attribute value in the appropriate row Most Common Change Do not preserve old value Surrogate key is not affected Metallic paint Group 1234567 R 16 White Paint Group Item 111
Slowly Changing Dimension Tables: Type I Change: Problem Loss of old value means that it is difficult to compare behaviour before change and after change 12345 Intellikidz 1 Education ABC 922 Z 12345 Intellikidz 1 Strategy ABC 922 Z History of attribute changes is lost If there is any increase in sale of Intellikidz 1 due to repositioning then impossible to know. Aggregates over department have to be rebuilt
Slowly Changing Dimension Tables: Type 2: A change to preserve old data scholarship earned: yes from no or vice versa in the source system Analyze performance of scholarship holders and others need both old status and new status Handling in DW Introduce another row with new data Next Most Common Change Add an ’effective from date’ attribute No changes in original row New surrogate key for the row Cannot be implemented without the help of Surrogate Keys!!
Slowly Changing Dimension Tables: Type 2 Fact Table SK Dimension Table 1234567 …. 1234567 Ram Goyal Single CUST 11111 …. 1234600 Ram Goyal Married CUST 11111 1234600 …. . Product Department NK 1234567 Intellikidz 1 Education ABC 922 Z 2014 -01 -01 2017 -06 -11 1234600 Intellikidz 1 Strategy ABC 922 Z 2017 -06 -11 2999 -12 -31
Slowly Changing Dimension Tables Type 3: Soft tentative change, ‘what if’ change. What if we change a salesperson from one territory to another Change territory temporarily Analyze performance in both territories if result encouraging then change territory Also used for mapping between old territory and new territory to study change Handling in DW Add ‘old’ attribute, ‘effective date’ attribute in dimension table Copy current attribute value in old attribute Least Common Introduce new value in current attribute Change queries to use old and new attributes Change
Slowly Changing Attribute Values: Type 3 Add a dimension column Alternate Reality Both current & prior values can be regarded as true at the same time New and historical fact data can be seen either with the new or prior attribute values SK 12345 Product Department NK Intellikidz 1 Education ABC 922 Z Product New_Dept NK Strategy ABC 922 Z Old_Dept Intellikidz 1 Education
Slowly Changing Dimension Tables Type III Change: Problems ¡ Good for handling predictable changes ¡ Can lead to lot of wastage of space ¡ Myriad of unpredictable changes ¡ Cannot track the impact of numerous intermediate attribute values
Changing Dimension Tables: Comparison Type I ¡ Fast & Easy to implement ¡ Attribute value always reflect the latest assignment ¡ No history of prior attribute values Type II Change: Disadvantages ¡ Dimension table can become big ¡ Does not allow association of the new attribute value with old fact history & vice-versa ¡ Automatically partitions history in the fact table ¡ Tracks as many dimension changes as required ¡ No need to rebuild aggregates ¡ When we constraint on Dept=Strategy, we will not see Intellikidz 1 facts from before the change date
Rapidly Changing Dimension Tables Type 2 changes add rows, Type 3 changes add attributes Transition to large dimension tables occurs Break dimension table into slowly changing and rapidly changing parts Name DOB Address Phone Scholarship SGPA CGPA Regn. data Name DOB Address Phone Slowly changing and Name SGPA CGPA Regn. data Relatively rapidly changing
Rapidly Changing Dimension Tables Solution: Mini- Dimension Aka slowly changing Type IV Basic Student Fact Table SK FK 1 FK 2 Measures Name DOB Address Phone Grade Data Name SGPA CGPA
Erroneous and NULL Values
NULL Values in Dimensions NULLs in dimensions need special handling in queries SELECT ………. FROM ……. where status= “Exempt” OR status IS NULL //include facts if unknown where status= “Exempt” OR status IS NOT NULL // do not include NULL does not intuitively mean much to business users either in reports or in queries Rule: No analysis parameter can be NULL. Replace NULL by special value 0 for numeric columns n/a for text columns very far future value for dates: 31/12/9999 Queries are formulated using standard relational operators and values: avoid IS/IS Not NULL
NULL Values in Facts: NULL values in surrogate and fact key cannot occur NULL values in measures: essentially for business logic to handle NULL does not affect result when computing functions like SUM, AVG Presence of NULLs in query results may confuse the decision maker Often replaced by 0 during ETL process NULL values may also occur in Foreign Keys
Erroneous Values FK value in fact does not match any value in corresponding dimension Erroneous value in dimension: update to the correct value (type 1 change) Erroneous value in fact: treat similar to type 2 change Introduce special row in dimension table Link all erroneous facts to special row Surrogate Salesperson address 0 erroneous n/a Link to appropriate row when correct value is known: need ‘non-surrogate fact key’ attribute in fact table
NULL Values in Foreign Keys of Facts The model allows optional dimensions: A fact instance may have no dimension sales may occur without a salesperson being involved There is unknown dimension sales occurs with salesperson but the salesperson is not known Treat similar to type 2 change: Introduce special rows in dimension tables Surrogate Row_type Salesrep_name designation 0 No salesrep n/a 1 Unknown salesrep n/a 2 Incorrect salesrep n/a 33 Salesrep Anjana executive NOTE: when data becomes known, fact_id (not surrogate) attribute needed for linking
NULL Values in Foreign Keys of Facts Late Arriving Dimension data During loading facts are loaded before dimensions Treat similar to type 2 changes: Introduce special row in dimension(s) Surrogate Row tyoe 0 No salesrep 1 salesrep 2 Salesperson designation n/a n/a Unknown n/a NOTE: surrogate key values 1 and 2 represent the same thing. When dimension data becomes known Non-surrogate fact_id attribute is needed for proper linking Row with 2 can be deleted why complicate: just do not introduce row with 2
NULL Values in Foreign Keys of Facts Future Events Accumulating snapshots requiring dates: date will be known later Teat similar to type 2 change Surrogate Row type Date 3 Future 31/12/9999 27 Day 12/3/2019 NOTE: when data arrives, fact_id attribute is needed for proper linking
Snowflake Schema Snowflaking is a storage/performance issue
When To Snowflakes are normalized dimensions: dimensions in 1 NF no repeating groups no multi-valued attributes Trade-off Benefits Saving in storage space total bytes being removed per row of dimension number of rows of repeating group width of dimension depth of dimension Easier to update Drawbacks less intuitive schema, too many dimensions, end-users may be put-off Browsing through the columns of a dimension is difficulty Increased joins: loss of query performance
When To Snowflake: Size of Table A dimension table is 500, 000 rows deep and occupies 200 MB; the fact table is 20 GB (20000 MB). There is a 20 byte repeating group in the dimension table and unique rows of repeating group is 100. How much is the saving due to normalization? 1 Foreign key in fact+ surrogate key in dimension need 4 bytes each. New row of dimension = 20 -4 = 16 bytes Net saving in dimension = (16 * 500, 000) approx. = 8 MB Expenditure in repeating group = 100 * 24 = 2, 400 B (negligible) Saving percentage of dimension = 8/200 = 4% Percentage saved of total size= 8/(20000 + 200) = 0. 04% Savings increase with size of repeating group : repeating group occupies 40 B Saving/row = 40 -4=36 bytes Net saving = 36*500, 000 = 18 MB = 9% and 0. 09% respectively Savings increase with depth of dimension table Let rows double, size of dimension = 40 MB, repeating group size = 20 MB Net savings : 16 * 1000, 000 = 16 MB =8% and 0. 08% respectively
When To Snowflake when dimension depth is very large, the total sizes of the repeating groups being removed is very high. Making a snowflake due to storage saving is a physical design consideration not a logical model issue NO REAL BENEFIT BY SNOWFLAKING SMALL WIDTH AND DEPTH DOMENSIONS Query performance is prime issue in DW. Snowflake only rarely
When To Snowflake: Other Considerations Refresh time of repeating group is different from other dimension attributes Granularity of the repeating group is different from dimension Querying of repeating group occurs with high frequency independently of dimension Customer Dimension Sales Fact PAN City class key(FK) Address Circle State PIN Others City Class Repeating Group City class key City code Class description Population range Cost of living Pollution index Quality of life Others
Conforming Facts and Dimensions
Integration Needs Conforming Dimensions Customer Dim Day Dim Returned items fact Order fact Product Dim Salesrep Dim Query to obtain quantity of order by product and quantity returned by product Merge the results to obtain percent returned
Differences Prohibiting Operation The names of the attributes of dimension are different The formats/types pf attributes are different Lower case/Upper case/sentence case coding schemes : code is 12345 in one and 123 -45 in another different classifications may mean one row in one but multiple rows in the other: a change was treated as type 1 change in one but type 2 in the other 12345 Conformance requires Same Structure Same content Box 12345 Box, Type A
Conformed Dimensions Shared Dimension tables: allow analysis on the shared dimensions Customer Dim Order fact Day Dim Delivery Fact Transporter Product Dim Shared dimension tables Customer, Day, and Product If held on separate physical platforms, the three dimension tables must be updated by the same ETL process
Conformed Dimensions Conformed Rollups dimension attributes of one are subset of the other common attributes share the same structure and content Customer Dim Order fact Day Dim Holiday flag Month number Month Dim Month number Sales Target Smaller dimension, Month dim is called conformed rollup, the larger Day Dim is called base dimension Use base dimension as a source to populate the conformed rollup dimension
Conformed Dimensions Conformed Degenerate Dimensions natural key of product occurs in more than one fact table ensure same content and structure
Conformance Bus Documentation of dimensional conformance helps in aggregate design Identification of aggregate candidates for drill across Ensures conformance of base dimensions and roll-up dimensions nt or y Sa les Two dimensions conform if all conditions hold: In ve Non-key attributes of one are a subset of the non-key attributes of the other t uc e Tim Sa les pe rso n od Pr Cu sto m er Data types of the common attributes are the same in the subset of common attributes: All combinations of values present in the more detailed dimension are present in the other
Conformance Matrix A structure for representing the conformance Business Processes Dimensions Customer Inventory Product Salesperson Time Warehouse X X X Sales X X Delivery X X X Shipper X X For integration of Sales and Delivery Customer, Product, Time must conform For integration of Inventory and Delivery Product, Time, and Warehouse must conform X X
Augmenting the Matrix Augment with type of fact Fact type Multiple fact types for a process Customer Product Sales person Time Warehouse Inventory Transacti on X X X Inventory Snapshot X X X Sales Transacti on X X Delivery Transacti on X X X Shipper X X
Multiple Fact Types and Conformity The subset dimension is called an aggregation point Pid (sk) Sku(nku) Pdesc Brandcode Brand mgr Category code Brandid (sk) Brandcode (nk) Brand mgr Category code Categoryid (sk) Category code (nk) Aggregate Point: Brand Aggregate Point Category of Brand of Product Aggregate point must contain all relevant attributes of the base dimensions and have same slow changing properties Mark base dimensions as Type 1/2 changing and, if present in aggregate point then it is also marked as being of same type
Computing Fact Sizes Product type Region City Sales Product name Rupees Assume one sale fact per product, per city, per month Year Season Month Let there be 5000 products 60 months 50 cities Number of sales facts = 5000*60*50= 15000000
Sparse Facts Not all 5000 products may be sold each month in each city Assume that 3000 products are sold each month in each city Number of sales facts = 3000 * 60 * 50 = 9000000 Approximately 60% of the cube is occupied and 40% is empty
Aggregation Aggregates are pre-calculated summaries along dimension hierarchies derived from basic facts. We need the total sales for each region, product wise and month-wise Number of products = 5000 Number of regions = 5 Number of months = 60 Aggregation is performed in order to speed up common queries Total number of facts = 5000*5*60 = 1500000 Space-time tradeoff if the frequency of use is high then pay the storage expense Aggregation guideline if the number of facts summarised is more than 10, then do aggregation
Aggregation Year Region Product type Season Month City Three-way aggregation product name No aggregation One-way aggregation Two-way aggregation When aggregation is done by rising along n-dimensions then n-way aggregation is said to be performed
Sparsity and Aggregation As the amount of aggregation increases sparsity decreases One-way aggregation on regions results in 1. 5 M facts The probability of all 5000 products being sold in a month in a region is higher than of all 5000 being sold in a city Two-way aggregation on regions and season results in 0. 5 M facts The probability of all 5000 products being sold in a month in a region is higher than of all 5000 being sold in a region
Aggregation An aggregate forms its own star schema Seasonal Sales Rupees Region City Product type Product name
Design Rules for Aggregate Schemas Create a separate star for each aggregation If same fact is used then double counting may occur Introduce a n extra attribute per aggregate point: level attribute Pid Level Pname Brandid Brandname Categoryid Category name 100 Product Tape GS 01 Grip set CS 01 Packing 101 Product Wonder tape GS 01 Grip Set CS 01 Packing 107 Brand Not applicable Grip set CS 01 Packing 108 Category Not applicable CS 01 Packing Not checking for level will result in multiple rows being counted (2 extra for product level, 3 extra for Brand Category levels respectively)
Design Rules for Aggregate Schemas Attributes of Aggregate Dimensions have same names as base dimensions: establish naming conventions to distinguish between these Always source aggregate dimensions from base dimensions The aggregate dimension conforms to the base dimensions • A perfect subset, except for keys • Values of attributes must be identical to those in the base • There must be exactly one row for each combination of attribute values in the base dimensions
Visible Summarizations/Invisible Aggregates Since each aggregate is a separate star, we get base and aggregate stars Both produce the same answer to a query Aggregate Schemas are made invisible to the user and queries are redirected to the aggregate schema Visible Summarizations: pre-joined aggregate Eliminate the need for doing a join by keeping a pre-joined relation Pid brand code Category code Region Quantity sold Order value Design Rule: A separate relation for each pre-joined Suffer from double counting if there is no level attribute or of query is formulated qithout using it in the WHERE clause
More Visible Aggregates Derived Tables Merged Fact Table Pivoted Fact Table Sliced Fact Tables with New facts
Issues in Aggregation 1 Aggregates on Periodic Snapshots cannot be made invisible Accumulating Snapshots are hard to build since there are several events in each 2 When to pre-aggregate? a) Pre-aggregate nothing: Pick from transactions b) Pre-aggregate nothing, but save the materialized view Save the first time the aggregate is built c) Pre-aggregate everything (static) Build everything at ETL time d) Pre-aggregate selectively, based on known statistics of use