An overview of Data Warehousing and OLAP Technology

  • Slides: 21
Download presentation
An overview of Data Warehousing and OLAP Technology VLDB Conference 1996 Surajit Chaudhuri Umeshwar

An overview of Data Warehousing and OLAP Technology VLDB Conference 1996 Surajit Chaudhuri Umeshwar Dayal Microsoft Reserch, Redmond HP Labs, Palo Alto Presenter : Ha Dong hun

Contents l What is a Data Warehouse ? l OLTP vs OLAP l Back

Contents l What is a Data Warehouse ? l OLTP vs OLAP l Back End Tools l Conceptual Model and Front End Tools l Data Warehousing Architecture l Database Design Methodology Database Lab. 2/21

What is a Data Warehouse ? (1/5) l A data warehouse is a “subject-oriented,

What is a Data Warehouse ? (1/5) l A data warehouse is a “subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management’s decision-making process. ” - W. H. Inmon l In simplest terms Data Warehouse can be defined as collection of Data marts l A data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker to make better and faster decisions Database Lab. 3/21

What is a Data Warehouse ? (2/5) l Characteristic ¡Subject Oriented l Organized around

What is a Data Warehouse ? (2/5) l Characteristic ¡Subject Oriented l Organized around major subjects, such as customer, product, sales l Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing l Provide a simple and concise view around particular subject issues by excluding useless data in the decision support process ¡Integrated ¡Time Variant ¡Nonvolatile Database Lab. 4/21

What is a Data Warehouse ? (3/5) l Characteristic ¡Subject Oriented ¡ Integrated l

What is a Data Warehouse ? (3/5) l Characteristic ¡Subject Oriented ¡ Integrated l Constructed by integrating multiple, heterogeneous data sources (e. g. , relational database, flat files, on-line transaction records) l Data cleaning and data integration techniques are applied e. g. , Hotel price : currency, tax, breakfast cost, etc. l When data is moved to the data warehouse, it is converted ¡Time Variant ¡Nonvolatile Database Lab. 5/21

What is a Data Warehouse ? (4/5) l Characteristic ¡Subject Oriented ¡Integrated ¡Time Variant

What is a Data Warehouse ? (4/5) l Characteristic ¡Subject Oriented ¡Integrated ¡Time Variant l The time horizon for the data warehouse is significantly longer than that of operational systems • Operational database : current value data • Data warehouse data : provide information from a historical perspective (e. g. , past 5 ~ 10 years) l Every key structure in the data warehouse contain an element of time, explicitly or implicitly. ¡Nonvolatile Database Lab. 6/21

What is a Data Warehouse ? (5/5) l Characteristic ¡Subject Oriented ¡Integrated ¡Time Variant

What is a Data Warehouse ? (5/5) l Characteristic ¡Subject Oriented ¡Integrated ¡Time Variant ¡Nonvolatile l A physically separate store of data transformed from the operational environment l Operational update of data does not occur in the data warehouse environment • Does not require transaction processing, recovery and concurrency control mechanisms • Requires only two operations in data processing : initial loading of data and access of data Database Lab. 7/21

OLTP vs. OLAP (1/2) l OLTP (On-line transaction processing) ¡Major task of traditional relational

OLTP vs. OLAP (1/2) l OLTP (On-line transaction processing) ¡Major task of traditional relational DBMS ¡Day-to-day operations : banking, payroll, accounting, etc. l OLAP (On-line analytical processing) ¡Major task of data warehouse system ¡Data analysis and decision making Database Lab. 8/21

OLTP vs. OLAP (2/2) Features OLTP OLAP Characteristic Operational processing Informational processing User DBA,

OLTP vs. OLAP (2/2) Features OLTP OLAP Characteristic Operational processing Informational processing User DBA, db professional Knowledge worker Function Day-to-day operations Long-term informational requirements, decision support Data Current Historical View Detailed, flat relational Summarized, multidimensional Access Read/write Mostly read Focus Data in Information out DB size 100 MB ~ GB 100 GB ~ TB Priority High performance High Flexibility, end-user autonomy Metric Transaction throughput Query throughput, response time Database Lab. 9/21

Back End Tools l Back End Tools ¡ Data extraction l From external sources

Back End Tools l Back End Tools ¡ Data extraction l From external sources is usually implemented via gateways and standard interfaces. ¡ Data Cleaning l Detect errors in the data and rectify them when possible ¡ Load l Sort, summarize, consolidate, compute views, check integrity and build indices and partitions ¡ Refresh l Propagate the updates from the data sources to the warehouse Database Lab. 10/21

Conceptual Model 1 2 3 4 sum Pr od TV PC MP 3 sum

Conceptual Model 1 2 3 4 sum Pr od TV PC MP 3 sum Total annual sales of TV in KOREA U. S. A JAPAN Country uc t Date sum ALL Database Lab. 11/21

es ) Front End Tools (1/4) Korea 2000 USA Q 1 3000 Lo ca

es ) Front End Tools (1/4) Korea 2000 USA Q 1 3000 Lo ca tio n( c Seoul 1560 Busan 440 New york 2270 Chicago Time(quarters) L (c oc Time(quarters) oun atio tr n ie s) ¡ Rollup (Drill-up) ¡ Drill-down (Roll-down) ¡ Slice and dice iti l Front End Tools Q 1 730 Q 2 Q 3 PC TV MP 3 Product Q 2 Q 3 PC TV MP 3 Roll-up on location from cities to countries Product Database Lab. 12/21

es ) Front End Tools (2/4) Time(months) Lo ca t io n (c iti

es ) Front End Tools (2/4) Time(months) Lo ca t io n (c iti Seoul Busan New york Chicago Jan. 130 Feb. 340 Mar. 260 PC Lo ca tio n( c Seoul 1560 Busan 440 New york 2270 Chicago Time(quarters) es ) ¡ Roll-up (Drill-up) ¡ Drill-down (Roll-down) ¡ Slice and dice iti l Front End Tools Q 1 730 Q 2 Q 3 PC TV MP 3 Product TV MP 3 Drill down on time from quarters to months Product Database Lab. 13/21

es ) Front End Tools (3/4) Seoul Lo ca tio n( c Seoul 1560

es ) Front End Tools (3/4) Seoul Lo ca tio n( c Seoul 1560 Busan 440 New york 2270 Chicago Time(quarters) Location(cities) ¡ Rollup (Drill-up) ¡ Drill-down (Roll-down) ¡ Slice and dice iti l Front End Tools Q 1 730 440 610 PC TV MP 3 Q 2 Q 3 Product Busan New york Slice for time Chicago 730 PC 440 610 TV MP 3 = “Q 1” Product Database Lab. 14/21

es ) Front End Tools (4/4) Seoul 1560 Chicago Q 1 Lo ca tio

es ) Front End Tools (4/4) Seoul 1560 Chicago Q 1 Lo ca tio n( c Seoul 1560 Busan 440 New york 2270 Chicago Time(quarters) Lo (c cat Time iti io es n (quarters) ) ¡ Rollup (Drill-up) ¡ Drill-down (Roll-down) ¡ Slice and dice iti l Front End Tools Q 1 730 Q 2 Q 3 PC TV MP 3 Product 730 Q 2 PC MP 3 Product Database Lab. Dice for (Location =“Seoul” or “Chicago”) and (time=“Q 1” or “Q 2”) and (product=“PC” or “MP 3”) 15/21

Data Warehousing Architecture External Sources Operational Databases Monitoring & Administering Tier 1 : Data

Data Warehousing Architecture External Sources Operational Databases Monitoring & Administering Tier 1 : Data Warehouse Server Metadata Repository Extract Clean Transform Load Refresh Data Warehouse Tier 2 : OLAP Server Serve Tier 3 : Clients Analysis Query Reporting Data mining Data Marts Data Sources Database Lab. Data Storage OLAP Engine Front-End Tools 16/21

Design Methodology - Star Schema Time dimension table Time_key day_week month quarter year Sales

Design Methodology - Star Schema Time dimension table Time_key day_week month quarter year Sales Fact Table Branch dimension table Branch_key Branch_name Branch_type Time_key Product_key Location_key Units_sold Dollars_sold Avg_sales Database Lab. Product dimension table Product_key Product_name brand type supplier_type Location dimension table Location_key street city province country 17/21

Design Methodology - Snowflake Schema Time dimension table Time_key day_week month quarter year Sales

Design Methodology - Snowflake Schema Time dimension table Time_key day_week month quarter year Sales Fact Table Branch dimension table Branch_key Branch_name Branch_type Time_key Product_key Location_key Units_sold Dollars_sold Avg_sales Database Lab. Product dimension table Supplier dimension table Product_key Product_name brand type supplier_key Supplier_type Location dimension table Location_key street City_key City dimension table City_key city province country 18/21

Design Methodology - Fact Constellation Time dimension table Time_key day_week month quarter year Branch

Design Methodology - Fact Constellation Time dimension table Time_key day_week month quarter year Branch dimension table Branch_key Branch_name Branch_type Sales fact table Time_key Product_key Branch_key Product dimension table Product_key Product_name brand type supplier_type Shipping fact table Time_key Product_key From_location Units_sold To_location Dollars_sold Location dimension table Shipper_key name Type Location_key Shipper_key Location_key Avg_sales Shipper dimension table Units_shipped Location_key street city province country Database Lab. 19/21

Summary l Data warehouse ¡A subject-oriented, integrated, time-variant, and nonvolatile collection of data in

Summary l Data warehouse ¡A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process l Architecture of Date warehouse ¡Consisting of Warehouse servers, front end and back end tools l OLAP operations : roll-up, drill-down, slice, dice l Multidimensional model of Data warehouse ¡ Multidimensional data cube ¡ Star Schema ¡ Snowflake Schema Database Lab. 20/21

Q&A Database Lab. 21/21

Q&A Database Lab. 21/21