Chapter 2 DATA WAREHOUSING FUNDAMENTALS of DATABASE SYSTEMS

Chapter 2: DATA WAREHOUSING FUNDAMENTALS of DATABASE SYSTEMS, Fifth Edition Fundamentals of Database Systems, Fifth Edition 1

Introduction What product promotions have the biggest impact on revenue? What impact will new products/services have on revenue and margins? Fundamentals of Database Systems, Fifth Edition Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? 2

Introduction (cont. ) § There is a great need for tools that provide decision makers with information to make decisions quickly and reliably based on historical data. § The above functionality is achieved by § data warehousing § it characterized by subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. Fundamentals of Database Systems, Fifth Edition 3

Introduction (cont. ) § online analytical processing (OLAP) § A term used to describe the analysis of complex data from the data warehouse. § and data mining. § The process of knowledge discovery Fundamentals of Database Systems, Fifth Edition 4

Characteristics of Data Warehouses. Subject oriented § Organized around major subjects, such as product, sales. § Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing. § Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision process. Fundamentals of Database Systems, Fifth Edition 5

Characteristics of Data Warehousesintegrated § Constructed by integrating multiple, heterogeneous data sources. § Data cleaning and data integration techniques are applied. Fundamentals of Database Systems, Fifth Edition 6

Characteristics of Data Warehouses. Time Variant § Data warehouse data : provide information from a historical perspective (e. g. , past 5 -10 years) § Every data in the data warehouse contains an element of time. Fundamentals of Database Systems, Fifth Edition 7

Characteristics of Data Warehouses. Non Volatile § Operational update of data doesn’t occur in the data warehouse environment. § Doesn't require transaction processing, recovery, and concurrency control mechanism. § Require only two operations in data accessing § Initial loading of data and quering. Fundamentals of Database Systems, Fifth Edition 8

Data Warehouse vs. operational databases DW Traditional DB Large amount of data from multiple sources It is a transactional (relational, objectthat may include different DB models or oriented , network , hierarchical) files acquired from independent systems and platforms. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Optimizes for retrieval. Focusing on daily operations or transaction processing Optimizes for routine transaction processing Provide information from a historical perspective (e. g. , past 5 -10 years). Current value data. It is nonvolatile. In traditional DB , transactions are the agent of change to the database. Supports DSS, Data Mining and OLAP. Supports OLTP. Fundamentals of Database Systems, Fifth Edition 9

OLTP vs. OLAP OLTP OLAP User Clerk, IT Professional. Decision-makers, analysts. Function Day to day operations. Decision support. Application-oriented (E-R based) Subject-oriented (Star, snowflake) Data Current. Historical. View Detailed. Summarized. Read/write. Read Mostly. Tens. Millions. #Users Thousands. Hundreds. Db size 100 MB-GB. 100 GB-TB. DB Design Access # Records accessed Fundamentals of Database Systems, Fifth Edition 10

What is a Data Warehouse? A Practitioners Viewpoint §“A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand use it in a business context. ” Barry Devlin, IBM Consultant Fundamentals of Database Systems, Fifth Edition 11

What is a Data Warehouse? Data source in Chicago Data source in New York client Clean Integrate Transform Load Refresh Data warehouse Query and analysis tools client Data source in Taranto Fundamentals of Database Systems, Fifth Edition 12

3 -D data cube Fundamentals of Database Systems, Fifth Edition 13

Q 1 1000 Q 2 Q 3 Q 4 Ave Sales Net Sales Units Price Dollars s Measures Dimension Grapes Cherries Melons Apples P Di rod m u en ct si s on Time Dimension Example of Querying a Cube

From table and spreadsheet to data cubes § A data warehouse is based on a multidimensional data model which views data in the form of data cube. § A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions. § Dimension tables contains descriptions about the subject of the business. § such as item (item_name, brand, type) or time (day, week, month, quarter, year) Fundamentals of Database Systems, Fifth Edition 15

From table and spreadsheet to data cubes (cont. ) § Fact table contain a factual or quantitative data § Fact table also contains measures (such as dollars_sold) and keys to each of the related dimension tables. Fundamentals of Database Systems, Fifth Edition 16

4 -D Data cube Fundamentals of Database Systems, Fifth Edition 17

Cube: a lattice of cuboids 0 -D (apex) cuboids 1 -D cuboids 2 -D cuboids 3 -D cuboids 4 -D (base) cuboids Fundamentals of Database Systems, Fifth Edition 18

Conceptual Modeling of Data Warehouses § Modeling data warehouses: dimensions & measures § Star schema: a fact table in the middle connected to a set of dimension tables. § Snowflake schema: a refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension table, forming a shape similar to snowflake. Fundamentals of Database Systems, Fifth Edition 19

Conceptual Modeling of Data Warehouses (cont. ) § Fact constellations: multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Fundamentals of Database Systems, Fifth Edition 20

Example of Star Schema time item Time_key item_key Day_of_the_week Month Quarter year Sales Fact Table Item_name Time_key brand Item_key type Branch_key Supplier_type Location_key Units_sold Dollars_sold branch Avg_sales branch_key location_key street Branch_name city Branch_type State_or_province country Measures Fundamentals of Database Systems, Fifth Edition 21

Example of Snowflake Schema time item Time_key Supplier item_key Day_of_the_week Month Quarter year Sales Fact Table Item_name Supplier_key Time_key brand Supplier_type Item_key type Branch_key Supplier_type Location_key Units_sold branch_key location Dollars_sold location_key Avg_sales street City_key Branch_name Branch_type city City_key city State_or_province Measures Fundamentals of Database Systems, Fifth Edition country 22

Example of Fact Constellation time item Time_key item_key Day_of_the_week Month Quarter year Sales Fact Table Item_name Time_key brand Item_key type Branch_key Supplier_type Avg_sales branch_key Item_key location Units_sold branch Time_key Shipper_key From_location Location_key Dollars_sold Shipping fact table To_location_key Dollars_cost street Units_shipped City_key Branch_name shipper Branch_type Shipper_key Measures Cshipper_name Location_key Shipper_type Fundamentals of Database Systems, Fifth Edition 23

Cube definition syntax in DMQL Fundamentals of Database Systems, Fifth Edition 24

Defining star schema in DMQL Fundamentals of Database Systems, Fifth Edition 25

Defining snowflake in DMQL Fundamentals of Database Systems, Fifth Edition 26

Defining fact constellation in DMQL Fundamentals of Database Systems, Fifth Edition 27

Measure of Data Cube: three categories § Distributive: if the result derived by applying the function to n aggregated values is the same as that derived by applying the function on all the data without portioning. § E. g. , count(), min() Fundamentals of Database Systems, Fifth Edition 28

Measure of Data Cube: three categories (cont. ) § Algebraic: if it can be computed by an algebraic function with M arguments ( where M is abounded integer), each of which is obtained by applying a distributive aggregated function § E. g. , avg() § Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate § Mode(), rank() Fundamentals of Database Systems, Fifth Edition 29

Typical OLAP operations § Roll up ( drill-up) summarize data. § By climbing up hierarchy § Drill down ( roll down): reverse of roll-up § From higher level summary to lower level summary or detailed data. § Slice and dice: project and select Fundamentals of Database Systems, Fifth Edition 30

Typical OLAP operations (cont. ) § Pivot ( rotate) § Reorient the cub, visualization, 3 D to series of 2 D planes § Other operations: § Drill across: involving ( across) more than one fact table § Drill through: through the bottom level of the cube to its back-end relational tables (using sql) Fundamentals of Database Systems, Fifth Edition 31

Design of Data Warehouse: A Business Analysis Framework § Four views regarding the design of data warehouse § Top down view: allow selection of the relevant information necessary for the data warehouse § Data source view: exposes the information being captured, stored, and managed by operational systems § Data warehouse view: consists of the fact table and dimension table Fundamentals of Database Systems, Fifth Edition 32

Design of Data Warehouse: A Business Analysis Framework (cont. ) § Business query view: see perspectives of data in the warehouse from the view of end-user Fundamentals of Database Systems, Fifth Edition 33

Data Warehouse Design Process § Top-down, bottom-up approaches or combination of both § Top-down: starts with overall design and planning § Bottom-up: starts with experiments and prototypes § From software engineering point of view § Waterfall: structure and systematic analysis at each step before proceeding to next. Fundamentals of Database Systems, Fifth Edition 34

Data Warehouse Design Process (cont. ) § Spiral : rapid generation of increasingly function systems, quick turn around. Fundamentals of Database Systems, Fifth Edition 35

Data Warehouse Design Process (cont. ) § Typical data warehouse design process: § Choose a business process to model. E. g. , orders, invoice, etc § Choose the grain (atomic level of data) of the business process § Choose the dimension that will apply to each fact table record § Choose measure that will populate each fact table record Fundamentals of Database Systems, Fifth Edition 36

Three Data Warehouse Models § Enterprise warehouse § Collect all of the organization about subjects spanning the entire organization § Data Mart: § A subset of corporate- wide data that is of value to specific group of users. § Virtual warehouse § Set of views over operational databases Fundamentals of Database Systems, Fifth Edition 37

Data Warehouse Back-End Tools and Utilities § Data extraction § Data cleaning § Data transformation § Load § refresh Fundamentals of Database Systems, Fifth Edition 38
- Slides: 38