Data Warehousing and OLAP Technology for Data Mining

  • Slides: 32
Download presentation
Data Warehousing and OLAP Technology for Data Mining SLIDES FROM THE BOOK : Data

Data Warehousing and OLAP Technology for Data Mining SLIDES FROM THE BOOK : Data Mining: Concepts and Techniques by Jiawei Han 2/19/2021 Data Mining: Concepts and Techniques 1

Data Warehousing and OLAP Technology for Data Mining n What is a data warehouse?

Data Warehousing and OLAP Technology for Data Mining n What is a data warehouse? n A multi-dimensional data model n Data warehouse architecture n Data warehouse implementation n Further development of data cube technology n From data warehousing to data mining 2/19/2021 Data Mining: Concepts and Techniques 2

What is Data Warehouse? n n n 2/19/2021 Defined in many different ways, but

What is Data Warehouse? n n n 2/19/2021 Defined in many different ways, but not rigorously. n A decision support database that is maintained separately from the organization’s operational database n Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time -variant, and nonvolatile collection of data in support of management’s decision-making process. ”—W. H. Inmon Data warehousing: n The process of constructing and using data warehouses Data Mining: Concepts and Techniques 3

Data Warehouse—Subject-Oriented n Organized around major subjects, such as customer, product, sales. n Focusing

Data Warehouse—Subject-Oriented n Organized around major subjects, such as customer, product, sales. n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 2/19/2021 Data Mining: Concepts and Techniques 4

Data Warehouse—Integrated n n Constructed by integrating multiple, heterogeneous data sources n relational databases,

Data Warehouse—Integrated n n Constructed by integrating multiple, heterogeneous data sources n relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. n Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources n n 2/19/2021 E. g. , Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Data Mining: Concepts and Techniques 5

Data Warehouse—Time Variant n The time horizon for the data warehouse is significantly longer

Data Warehouse—Time Variant n The time horizon for the data warehouse is significantly longer than that of operational systems. n n n Data warehouse data: provide information from a historical perspective (e. g. , past 5 -10 years) Every key structure in the data warehouse n n 2/19/2021 Operational database: current value data. Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”. Data Mining: Concepts and Techniques 6

Data Warehouse—Non-Volatile n A physically separate store of data transformed from the operational environment.

Data Warehouse—Non-Volatile n A physically separate store of data transformed from the operational environment. n Operational update of data does not occur in the data warehouse environment. n Does not require transaction processing, recovery, and concurrency control mechanisms n Requires only two operations in data accessing: n 2/19/2021 initial loading of data and access of data. Data Mining: Concepts and Techniques 7

Data Warehouse vs. Heterogeneous DBMS n Traditional heterogeneous DB integration: n Build wrappers/mediators on

Data Warehouse vs. Heterogeneous DBMS n Traditional heterogeneous DB integration: n Build wrappers/mediators on top of heterogeneous databases n Query driven approach n n n Complex information filtering, compete for resources Data warehouse: update-driven, high performance n 2/19/2021 When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis Data Mining: Concepts and Techniques 8

Data Warehouse vs. Operational DBMS n OLTP (on-line transaction processing) n n 2/19/2021 Major

Data Warehouse vs. Operational DBMS n OLTP (on-line transaction processing) n n 2/19/2021 Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) n Major task of data warehouse system n Data analysis and decision making Distinct features (OLTP vs. OLAP): n User and system orientation: customer vs. market n Data contents: current, detailed vs. historical, consolidated n Database design: ER + application vs. star + subject n View: current, local vs. evolutionary, integrated n Access patterns: update vs. read-only but complex queries Data Mining: Concepts and Techniques 9

OLTP vs. OLAP 2/19/2021 Data Mining: Concepts and Techniques 10

OLTP vs. OLAP 2/19/2021 Data Mining: Concepts and Techniques 10

Why Separate Data Warehouse? n n High performance for both systems n DBMS— tuned

Why Separate Data Warehouse? n n High performance for both systems n DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery n Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data: n missing data: Decision support requires historical data which operational DBs do not typically maintain n data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources n data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 2/19/2021 Data Mining: Concepts and Techniques 11

Chapter 2: Data Warehousing and OLAP Technology for Data Mining n What is a

Chapter 2: Data Warehousing and OLAP Technology for Data Mining n What is a data warehouse? n A multi-dimensional data model n Data warehouse architecture n Data warehouse implementation n Further development of data cube technology n From data warehousing to data mining 2/19/2021 Data Mining: Concepts and Techniques 12

From Tables and Spreadsheets to Data Cubes n n A data warehouse is based

From Tables and Spreadsheets to Data Cubes n n A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions n n n 2/19/2021 Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0 -D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. Data Mining: Concepts and Techniques 13

Cube: A Lattice of Cuboids all time, item 0 -D(apex) cuboid item time, location

Cube: A Lattice of Cuboids all time, item 0 -D(apex) cuboid item time, location item, location time, supplier time, item, location supplier location, supplier item, supplier time, location, supplier time, item, supplier 1 -D cuboids 2 -D cuboids 3 -D cuboids item, location, supplier 4 -D(base) cuboid time, item, location, supplier 2/19/2021 Data Mining: Concepts and Techniques 14

Conceptual Modeling of Data Warehouses n Modeling data warehouses: dimensions & measures n Star

Conceptual Modeling of Data Warehouses n Modeling data warehouses: dimensions & measures n Star schema: A fact table in the middle connected to a set of dimension tables n Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake n Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 2/19/2021 Data Mining: Concepts and Techniques 15

Example of Star Schema time item time_key day_of_the_week month quarter year Sales Fact Table

Example of Star Schema time item time_key day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch_name branch_type location_key units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location_key street city province_or_state country Measures 2/19/2021 Data Mining: Concepts and Techniques 16

Example of Snowflake Schema time_key day_of_the_week month quarter year item Sales Fact Table time_key

Example of Snowflake Schema time_key day_of_the_week month quarter year item Sales Fact Table time_key item_key branch location_key branch_name branch_type units_sold dollars_sold avg_sales Measures 2/19/2021 Data Mining: Concepts and Techniques item_key item_name brand type supplier_key supplier_type location_key street city_key city province_or_state country 17

Example of Fact Constellation time_key day_of_the_week month quarter year item Sales Fact Table time_key

Example of Fact Constellation time_key day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type location_key branch_name branch_type units_sold dollars_sold avg_sales item_key shipper_key location to_location_key street city province_or_state country dollars_cost Measures 2/19/2021 time_key from_location branch_key branch Shipping Fact Table Data Mining: Concepts and Techniques units_shipped shipper_key shipper_name location_key shipper_type 18

Measures: Three Categories n distributive: if the result derived by applying the function to

Measures: Three Categories n distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. n n algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. n n E. g. , count(), sum(), min(), max(). E. g. , avg(), min_N(), standard_deviation(). holistic: if there is no constant bound on the storage size needed to describe a subaggregate. n 2/19/2021 E. g. , median(), mode(), rank(). Data Mining: Concepts and Techniques 19

A Concept Hierarchy: Dimension (location) all Europe region country city office 2/19/2021 Germany Frankfurt

A Concept Hierarchy: Dimension (location) all Europe region country city office 2/19/2021 Germany Frankfurt . . Spain North_America Canada Vancouver. . . L. Chan . . . Data Mining: Concepts and Techniques . . . Mexico Toronto M. Wind 20

View of Warehouses and Hierarchies Specification of hierarchies n Schema hierarchy day < {month

View of Warehouses and Hierarchies Specification of hierarchies n Schema hierarchy day < {month < quarter; week} < year n Set_grouping hierarchy {1. . 10} < inexpensive 2/19/2021 Data Mining: Concepts and Techniques 21

Multidimensional Data Sales volume as a function of product, month, and region Dimensions: Product,

Multidimensional Data Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths gi on n Re Industry Region Year Product Category Country Quarter Product City Office Month Week Day Month 2/19/2021 Data Mining: Concepts and Techniques 22

2 Qtr 3 Qtr 4 Qtr Total annual sales sum of TV in U.

2 Qtr 3 Qtr 4 Qtr Total annual sales sum of TV in U. S. A. od TV PC VCR sum 1 Qtr Date Pr U. S. A Canada Mexico Country uc t A Sample Data Cube sum 2/19/2021 Data Mining: Concepts and Techniques 23

Cuboids Corresponding to the Cube all 0 -D(apex) cuboid product, date country product, country

Cuboids Corresponding to the Cube all 0 -D(apex) cuboid product, date country product, country 1 -D cuboids date, country 2 -D cuboids product, date, country 2/19/2021 Data Mining: Concepts and Techniques 3 -D(base) cuboid 24

Browsing a Data Cube n n n 2/19/2021 Visualization OLAP capabilities Interactive manipulation Data

Browsing a Data Cube n n n 2/19/2021 Visualization OLAP capabilities Interactive manipulation Data Mining: Concepts and Techniques 25

Typical OLAP Operations n Roll up (drill-up): summarize data n n Drill down (roll

Typical OLAP Operations n Roll up (drill-up): summarize data n n Drill down (roll down): reverse of roll-up n n reorient the cube, visualization, 3 D to series of 2 D planes. Other operations n n 2/19/2021 project and select Pivot (rotate): n n from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: n n by climbing up hierarchy or by dimension reduction drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its backend relational tables (using SQL) Data Mining: Concepts and Techniques 26

Chapter 2: Data Warehousing and OLAP Technology for Data Mining n What is a

Chapter 2: Data Warehousing and OLAP Technology for Data Mining n What is a data warehouse? n A multi-dimensional data model n Data warehouse architecture n Data warehouse implementation n Further development of data cube technology n From data warehousing to data mining 2/19/2021 Data Mining: Concepts and Techniques 27

Design of a Data Warehouse: A Business Analysis Framework n Four views regarding the

Design of a Data Warehouse: A Business Analysis Framework n Four views regarding the design of a data warehouse n Top-down view n n Data source view n n consists of fact tables and dimension tables Business query view n 2/19/2021 exposes the information being captured, stored, and managed by operational systems Data warehouse view n n allows selection of the relevant information necessary for the data warehouse sees the perspectives of data in the warehouse from the view of end-user Data Mining: Concepts and Techniques 28

Multi-Tiered Architecture other Metadata sources Operational DBs Extract Transform Load Refresh Monitor & Integrator

Multi-Tiered Architecture other Metadata sources Operational DBs Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources 2/19/2021 Data Storage OLAP Engine Front-End Tools Data Mining: Concepts and Techniques 29

Three Data Warehouse Models n n Enterprise warehouse n collects all of the information

Three Data Warehouse Models n n Enterprise warehouse n collects all of the information about subjects spanning the entire organization Data Mart n a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart n n Independent vs. dependent (directly from warehouse) data mart Virtual warehouse n A set of views over operational databases n Only some of the possible summary views may be materialized 2/19/2021 Data Mining: Concepts and Techniques 30

Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart

Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart Model refinement Enterprise Data Warehouse Model refinement Define a high-level corporate data model 2/19/2021 Data Mining: Concepts and Techniques 31

OLAP Server Architectures n n Relational OLAP (ROLAP) n Use relational or extended-relational DBMS

OLAP Server Architectures n n Relational OLAP (ROLAP) n Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces n Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services n greater scalability Multidimensional OLAP (MOLAP) n Array-based multidimensional storage engine (sparse matrix techniques) n fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) n User flexibility, e. g. , low level: relational, high-level: array Specialized SQL servers n specialized support for SQL queries over star/snowflake schemas 2/19/2021 Data Mining: Concepts and Techniques 32