Analysis of Additivity in OLAP Systems John Horner

  • Slides: 32
Download presentation
Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john. horner@drexel. edu

Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john. horner@drexel. edu College of Information Science & Technology Drexel University Philadelphia, PA 19104 USA Peter P. Chen Department of Computer Science Louisiana State University Baton Rouge, LA 70803

Online Analytical Processing (OLAP) Systems • Historical, integrated, relatively static data • Magnitudes larger

Online Analytical Processing (OLAP) Systems • Historical, integrated, relatively static data • Magnitudes larger than transactional systems • Used for strategic decision making • Query outputs nearly always aggregated sets of base data • Effective summarizability is of paramount concern 2

Structure • Facts are measures of interest • Dimensions are attributes used to identify,

Structure • Facts are measures of interest • Dimensions are attributes used to identify, select, group, and aggregate measures of interest. • Attributes that are used to aggregate measures are labeled classification attributes, and are typically conceptualized as hierarchies 3

Operations • Roll-up increases the level of aggregation along one or more classification hierarchies

Operations • Roll-up increases the level of aggregation along one or more classification hierarchies • Drill-down decreases the level of aggregation along one or more classification hierarchies • Slice-Dice selects and projects the data • Pivoting reorients the multi-dimensional data view to allow exchanging facts for dimensions symmetrically • Merging performs a union of separate roll-up operations 4

Additivity • The ability to use the aggregate summation operator to accurately summarize data

Additivity • The ability to use the aggregate summation operator to accurately summarize data is known as Additivity • A measure is Additive along a dimension if the sum operator can be used to meaningfully aggregate values along all hierarchies in that dimension • Fully-additive measures are additive across all dimensions • Semi-additive measures are only additive across certain dimensions • Non-additive measures are not additive across any dimension 5

Additivity Example Customer Date 100001 100002 100003 100004 TOTAL 1/1/2000 500 700 9890 600

Additivity Example Customer Date 100001 100002 100003 100004 TOTAL 1/1/2000 500 700 9890 600 ADDITIVE 2/1/2000 800 450 10050 200 … 3/1/2000 980 900 8700 800 … 4/1/2000 400 360 7800 750 … … … … TOTAL NONADDITIVE … … 6

Classification Examples 1. 0 Non-Additive 1. 1 Fractions 1. 1. 1 Ratios GMROI, Profitability

Classification Examples 1. 0 Non-Additive 1. 1 Fractions 1. 1. 1 Ratios GMROI, Profitability ratios 1. 1. 2 Percentages Profit margin percent, return percentage 1. 2 Measurements of intensity Temperature, Blood pressure 1. 3 Average/Maximum/Minimum 1. 3. 1 Averages Grade point average, Temperature 1. 3. 2 Maximums Temperature, Hourly hospital admissions, Electricity usage, Blood pressure 1. 3. 3 Minimums Temperature, Hourly hospital admissions, Electricity usage, Blood pressure 1. 4 Measurements of direction Wind direction, Cartographic bearings, Geometric angles 1. 5 Identification attributes 1. 5. 1 Codes Zip code, ISBN, ISSN, Area Code, Phone Number, Barcode 1. 5. 1 Sequence numbers Surrogate key, Order number, Transaction number, Invoice number 2. 0 Semi-Additive 2. 1 Dirty Data Missing data, Duplicate data, Incorrect data 2. 2 Changing data Area codes, Department names, customer address 2. 3 Temporally non-additive Account balances, Quantity on hand, Quantity sold 2. 4 Categorically non-additive Basket counts, Quantity on hand, Quantity sold 7

Non-Additive Measures • • Ratios and Percentages Measures of Intensity Average / Maximum /

Non-Additive Measures • • Ratios and Percentages Measures of Intensity Average / Maximum / Minimum Measures of Direction 8

Semi-Additive Facts • • • Dirty Data Changing Data Temporally Non-Additive Categorically Non-Additive Not

Semi-Additive Facts • • • Dirty Data Changing Data Temporally Non-Additive Categorically Non-Additive Not Mutually Exclusive – e. g. Measures can be both temporally and categorically non-additive 9

Causes of Dirty Data Customer. ID Arbitrary Missing Data Value 000001 01245 4 20145

Causes of Dirty Data Customer. ID Arbitrary Missing Data Value 000001 01245 4 20145 4 74565 4 99999 9 Customers as Stored in Database Actual Customers Customer who pre-dates system • Summing measures associated with dirty data can result in inaccurate summaries if not all instances are counted, if instances are counted multiple times, or if instances are counted in the wrong group 10

Rolling-up Dirty Data Classification Hierarchy Transactions Anomaly will disappear when rolled up to the

Rolling-up Dirty Data Classification Hierarchy Transactions Anomaly will disappear when rolled up to the State level Anomaly will disappear when rolled up to the zip code level Anomaly will disappear when rolled up to the country level • As measures are rolled up further along hierarchies, certain inaccurate values will be merged into the appropriate groups 11

Hierarchy Completeness • All instances belong to one higher level instance, which consists of

Hierarchy Completeness • All instances belong to one higher level instance, which consists of those instances only • Complete hierarchy (top), country consists of only the provinces listed • Incomplete hierarchy (bottom), not all customers in the city are stored in the data warehouse; or not all customers in data warehouse have a city listed Pro 1 Country Pro 2 Province Pro 3 Complete City Incomplete Cust 1 Cust 2 Custn Custx Customer 12

Example of Additivity Problems Associated with Incomplete Hierarchies Cust. ID City Sales. Amt 1

Example of Additivity Problems Associated with Incomplete Hierarchies Cust. ID City Sales. Amt 1 Washington 100 2 New York 200 999 Unknown 100 4 New York 150 5 Washington 150 6 Washington 150 999 Unknown 100 Total Summary City Sales Washington 400 New York 350 Total 750 Unknown 200 950 • If Sales are rolled up to the city, but not all customers have a city stored in the database, then the summary will not accurately portray the sales grouped by city. 13

Changing Data • It is important to track merges, splits, and overlapping hierarchies, especially

Changing Data • It is important to track merges, splits, and overlapping hierarchies, especially those that affect classification hierarchies, as the characteristics of the data and environment change 14

Changing Data Example Year City Area Code Population 1990 Philadelphia 215 2000 Philadelphia 610

Changing Data Example Year City Area Code Population 1990 Philadelphia 215 2000 Philadelphia 610 215 150 2000 Philadelphia 484 100 • Area code 215 split into 3 area codes. Looking at population trend in 215 area code would show a decrease, when in fact population in area originally covered by 215 area code has doubled. 15

Temporally Non-Additive • Measures that cannot be meaningfully added across different time periods are

Temporally Non-Additive • Measures that cannot be meaningfully added across different time periods are temporally non-additive • Examples – Account balances – Quantity on hand 16

Temporally Non-Additive Example Date 100001 100002 100003 100004 TOTAL 1/1/2000 500 700 9890 600

Temporally Non-Additive Example Date 100001 100002 100003 100004 TOTAL 1/1/2000 500 700 9890 600 … 2/1/2000 800 450 10050 200 … 3/1/2000 980 900 8700 800 … 4/1/2000 400 360 7800 750 … … … … TOTAL NON… ADDITIVE … … … 17

Temporally Non-Additive SQL Select sum(balance), Customer. ID From Account. Fact Group by Customer. ID;

Temporally Non-Additive SQL Select sum(balance), Customer. ID From Account. Fact Group by Customer. ID; Select sum(balance), date From Account. Fact Group by date; Must group by time interval of snapshot 18

Categorically Non-Additive • Measures that cannot meaningfully be summed across different types of items

Categorically Non-Additive • Measures that cannot meaningfully be summed across different types of items can be considered categorically nonadditive • Examples – Basket counts – Quantity on hand 19

Categorically Non-Additive Example Date Customer Item ID Product Name … Basket Count 1/1/2000 1

Categorically Non-Additive Example Date Customer Item ID Product Name … Basket Count 1/1/2000 1 10001 X Brand Soup … 5 1/1/2000 1 10002 Y Brand Soup … 2 1/1/2000 2 12510 Z Brand Television … 1 1/1/2000 3 10001 X Brand Soup … 4 … … … TOTAL … … NONADDITIVE 20

Categorically Non-Additive SQL Select sum(Basket. Count) From Sales. Fact; Select sum(Basket. Count), Product. Name

Categorically Non-Additive SQL Select sum(Basket. Count) From Sales. Fact; Select sum(Basket. Count), Product. Name From Sales. Fact Group by Product. Name; Must group by attribute in product family hierarchy 21

Others’ Suggestions • The distinction between meaningful and meaningless aggregation data should be stored

Others’ Suggestions • The distinction between meaningful and meaningless aggregation data should be stored in an appendix » • Data should be normalized into a General Multidimensional Normal Form (GMNF), whereby aggregation anomalies are avoided through a conceptual modeling approach that emphasizes sorting out dimensions, dimensional hierarchies, and which measures belong where. » • Golfarelli and Rizzi (1998) We need to rigorously classify hierarchies and detailed characteristics of hierarchies, such as completeness and multiplicity » • Hüsemann et al (2000) Conceptual models should explicitly depicts hierarchies and aggregation constraints along hierarchies, and a fact glossary should be developed describing how each fact was derived from an ER model » • Hüsemann et al (2000) Pourabbas and Rafanelli (1999) Slowly Changing Dimensions (Kimball and Ross, 2002) – – – Type 1: simply overwriting data Type 2: storing the new data instance in a new row, but with a common field to link the dimensions as being the same Type 3: Adding a new attribute to the dimension table to store both the new and old values 22

Our Suggestions • No simple solution – Can’t always eliminate potential inaccuracies – Categorically

Our Suggestions • No simple solution – Can’t always eliminate potential inaccuracies – Categorically Non-additive data – Glossaries may be ignored – Conceptual models may be overly complex – This doesn’t mean that we shouldn’t have glossaries and include constraints in conceptual models • Online Summarizability Constraints – Imagine abundance of update anomalies in transactional systems if possible violations are only stored in glossaries or conceptual models • Where measures are imprecise, queries should show error bounds 23

Hierarchies • Strict - each object at a lower level belongs to only one

Hierarchies • Strict - each object at a lower level belongs to only one value at a higher level • Non-strict - can be thought of as a many-to-many relationship between a higher level of the hierarchy and the lower level • Complete - all members belong to one higher-class object, which consists of those members only • Incomplete – not complete • Multiple path - lower object splits into two distinct higher level objects • Alternate path - multiple path hierarchy that joins again at a higher level 24

Hierarchy Strictness • In strict hierarchies, lower level instances in hierarchy belong to only

Hierarchy Strictness • In strict hierarchies, lower level instances in hierarchy belong to only one higher level instance D 1 D 2 Department Strict P 1 P 2 P 3 D 1 P 4 P 5 D 2 Person Department Non-Strict Pr 1 Pr 2 Pr 3 Pr 4 Pr 5 Project 25

Example of Additivity Problems Associated with Non-Strict Hierarchies Project Dollars 1 10000 2 3

Example of Additivity Problems Associated with Non-Strict Hierarchies Project Dollars 1 10000 2 3 15000 120000 4 50000 5 30000 Total 225000 Denormalized Fact Table Dept Project Dollars 1 1 10000 1 2 15000 1 3 120000 2 4 50000 2 5 30000 Total 345000 26

Alternate and Multiple Path Hierarchies a. Alternate Path Classification Hierarchy b. Multiple Path Classification

Alternate and Multiple Path Hierarchies a. Alternate Path Classification Hierarchy b. Multiple Path Classification Hierarchy Store Date Week City Area. Code Zip. Code County State Month Day. Of. Week Quarter Year Country • Inaccurate summaries can result from merging aggregates from multiple paths of a hierarchy. 27

Example of Problems Associated with Merging Multiple Path Hierarchies 140 hrs 320 hrs 460

Example of Problems Associated with Merging Multiple Path Hierarchies 140 hrs 320 hrs 460 hrs Person Dept Project Hours 1 1 1 40 2 100 3 2 2 50 4 2 2 50 5 2 2 40 6 2 2 80 Multiple Path Hierarchy Person Department Project Should be 360 hrs • Adding Hours from all the people in Department 1 with all the people who worked on Project 2 results in an inaccurate summary because Person 2 is counted twice. • The summary would not be inaccurate if each project mapped directly to 1 department 28

Our Suggestions (Cont. ) 29

Our Suggestions (Cont. ) 29

Our Suggestions (Cont. ) 30

Our Suggestions (Cont. ) 30

Conclusions • Recognizing whether measures are fully-, semi-, or non-additive is essential to identifying

Conclusions • Recognizing whether measures are fully-, semi-, or non-additive is essential to identifying and resolving potential inaccurate summaries in OLAP systems • Non-additive measures cannot be aggregated using the sum operator • Semi-additive measures can sometimes be aggregated using the sum operator, but at other times cannot • Therefore, semi-additive attributes pose the highest risk for unrecognized inaccurate summaries • There are several reasons why data could be semi-additive – – Adding different types of items together Adding measures multiple times in the same summary Not including all instances when aggregating measures Including measures in the wrong groups • Metadata could be used to alert analysts to potentially inaccurate queries 31

References • Golfarelli, M. , Maio, D. , and Rizzi, S. (1998). Conceptual Design

References • Golfarelli, M. , Maio, D. , and Rizzi, S. (1998). Conceptual Design of Data Warehouses from E/R Schemes. Proceedings of the Thirty-First Hawaii International Conference, 6 -9 Jan. 1998, 7, 334 – 343. • Hüsemann, B. , Lechtenbörger, J, and Vossen, G. (2000). Conceptual data warehouse design. Proc. International Workshop on Design and Management of Data Warehouses, 2000. • Kimball, R. and Ross, M. (2002). The Data Warehouse Toolkit: Second Edition. John Wiley and Sons, Inc. • Pourabbas, E. and Rafanelli, M. (1999). Characterizations of hierarchies and some operators in OLAP environments. . Proceedings of the 2 nd ACM international workshop on Data warehousing and OLAP. Kansas City, Missouri. 54 – 59. • Shoshani, A. (1997) OLAP and statistical databases: Similarities and differences. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. Tucson, Arizona. 185 – 196. ACM Press New York, NY. 32