Data Warehousing and OLAP Hector GarciaMolina Stanford University

  • Slides: 58
Download presentation
Data Warehousing and OLAP Hector Garcia-Molina Stanford University 1

Data Warehousing and OLAP Hector Garcia-Molina Stanford University 1

Warehousing Growing industry: $8 billion in 1998 l Range from desktop to huge: l

Warehousing Growing industry: $8 billion in 1998 l Range from desktop to huge: l u Walmart: 900 -CPU, 2, 700 disk, 23 TB Teradata system l Lots of buzzwords, hype u slice 2 & dice, rollup, MOLAP, pivot, . . . Hector Garcia Molina: Data Warehousing and OLAP

Outline What is a data warehouse? l Why a warehouse? l Models & operations

Outline What is a data warehouse? l Why a warehouse? l Models & operations l Implementing a warehouse l Future directions l 3 Hector Garcia Molina: Data Warehousing and OLAP

What is a Warehouse? l Collection of diverse data u subject oriented u aimed

What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e. g. , summaries, history) u integrated u time-varying u non-volatile more 4 Hector Garcia Molina: Data Warehousing and OLAP

What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating,

What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating, . . . u querying, reporting, analysis u data mining u monitoring, administering warehouse 5 Hector Garcia Molina: Data Warehousing and OLAP

Warehouse Architecture Client Query & Analysis Metadata Warehouse Integration Source 6 Source Hector Garcia

Warehouse Architecture Client Query & Analysis Metadata Warehouse Integration Source 6 Source Hector Garcia Molina: Data Warehousing and OLAP

Why a Warehouse? l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) ? Source

Why a Warehouse? l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) ? Source 7 Source Hector Garcia Molina: Data Warehousing and OLAP

Query-Driven Approach Client Mediator Wrapper Source 8 Wrapper Source Hector Garcia Molina: Data Warehousing

Query-Driven Approach Client Mediator Wrapper Source 8 Wrapper Source Hector Garcia Molina: Data Warehousing and OLAP

Advantages of Warehousing High query performance l Queries not visible outside warehouse l Local

Advantages of Warehousing High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse l u Modify, summarize (store aggregates) u Add historical information 9 Hector Garcia Molina: Data Warehousing and OLAP

Advantages of Query-Driven l No need to copy data u less storage u no

Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources l 10 Hector Garcia Molina: Data Warehousing and OLAP

OLTP vs. OLAP l OLTP: On Line Transaction Processing u Describes l OLAP: On

OLTP vs. OLAP l OLTP: On Line Transaction Processing u Describes l OLAP: On Line Analytical Processing u Describes 11 processing at operational sites processing at warehouse Hector Garcia Molina: Data Warehousing and OLAP

OLTP vs. OLAP OLTP l l l l 12 Mostly updates Many small transactions

OLTP vs. OLAP OLTP l l l l 12 Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical OLAP l l l Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users Hector Garcia Molina: Data Warehousing and OLAP

Data Marts Smaller warehouses l Spans part of organization l u e. g. ,

Data Marts Smaller warehouses l Spans part of organization l u e. g. , l Do not require enterprise-wide consensus u but 13 marketing (customers, products, sales) long term integration problems? Hector Garcia Molina: Data Warehousing and OLAP

Warehouse Models & Operators l Data Models u relations u stars & snowflakes u

Warehouse Models & Operators l Data Models u relations u stars & snowflakes u cubes l Operators u slice & dice u roll-up, drill down u pivoting u other 14 Hector Garcia Molina: Data Warehousing and OLAP

Star 15 Hector Garcia Molina: Data Warehousing and OLAP

Star 15 Hector Garcia Molina: Data Warehousing and OLAP

Star Schema 16 Hector Garcia Molina: Data Warehousing and OLAP

Star Schema 16 Hector Garcia Molina: Data Warehousing and OLAP

Terms Fact table l Dimension tables l Measures l 17 Hector Garcia Molina: Data

Terms Fact table l Dimension tables l Measures l 17 Hector Garcia Molina: Data Warehousing and OLAP

Dimension Hierarchies s. Type store city region èsnowflake schema èconstellations 18 Hector Garcia Molina:

Dimension Hierarchies s. Type store city region èsnowflake schema èconstellations 18 Hector Garcia Molina: Data Warehousing and OLAP

Cube Fact table view: Multi-dimensional cube: dimensions = 2 19 Hector Garcia Molina: Data

Cube Fact table view: Multi-dimensional cube: dimensions = 2 19 Hector Garcia Molina: Data Warehousing and OLAP

3 -D Cube Fact table view: Multi-dimensional cube: day 2 day 1 dimensions =

3 -D Cube Fact table view: Multi-dimensional cube: day 2 day 1 dimensions = 3 20 Hector Garcia Molina: Data Warehousing and OLAP

ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing l MOLAP: Multi-Dimensional On-Line Analytical Processing

ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing l MOLAP: Multi-Dimensional On-Line Analytical Processing l 21 Hector Garcia Molina: Data Warehousing and OLAP

Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM

Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81 22 Hector Garcia Molina: Data Warehousing and OLAP

Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM

Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date 23 Hector Garcia Molina: Data Warehousing and OLAP

Another Example • Add up amounts by day, product • In SQL: SELECT date,

Another Example • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prod. Id rollup drill-down 24 Hector Garcia Molina: Data Warehousing and OLAP

Aggregates Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension

Aggregates Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy l u average by region (within store) u maximum by month (within date) 25 Hector Garcia Molina: Data Warehousing and OLAP

Cube Aggregation Example: computing sums. . . day 2 day 1 129 rollup drill-down

Cube Aggregation Example: computing sums. . . day 2 day 1 129 rollup drill-down 26 Hector Garcia Molina: Data Warehousing and OLAP

Cube Operators day 2 . . . day 1 sale(c 1, *, *) 129

Cube Operators day 2 . . . day 1 sale(c 1, *, *) 129 sale(c 2, p 2, *) 27 sale(*, *, *) Hector Garcia Molina: Data Warehousing and OLAP

Extended Cube * day 2 day 1 28 sale(*, p 2, *) Hector Garcia

Extended Cube * day 2 day 1 28 sale(*, p 2, *) Hector Garcia Molina: Data Warehousing and OLAP

Aggregation Using Hierarchies day 2 day 1 customer region country (customer c 1 in

Aggregation Using Hierarchies day 2 day 1 customer region country (customer c 1 in Region A; customers c 2, c 3 in Region B) 29 Hector Garcia Molina: Data Warehousing and OLAP

Pivoting Fact table view: Multi-dimensional cube: day 2 day 1 30 Hector Garcia Molina:

Pivoting Fact table view: Multi-dimensional cube: day 2 day 1 30 Hector Garcia Molina: Data Warehousing and OLAP

Implementing a Warehouse Monitoring: Sending data from sources l Integrating: Loading, cleansing, . .

Implementing a Warehouse Monitoring: Sending data from sources l Integrating: Loading, cleansing, . . . l Processing: Query processing, indexing, . . . l Managing: Metadata, Design, . . . l 31 Hector Garcia Molina: Data Warehousing and OLAP

Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … l Incremental

Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … l Incremental vs. Refresh l new 32 Hector Garcia Molina: Data Warehousing and OLAP

Periodic snapshots l Database triggers l Log shipping l Data shipping (replication service) l

Periodic snapshots l Database triggers l Log shipping l Data shipping (replication service) l Transaction shipping l Polling (queries to source) l Screen scraping l Application level monitoring l 33 è Advantages & Disadvantages!! Monitoring Techniques Hector Garcia Molina: Data Warehousing and OLAP

Monitoring Issues l Frequency u periodic: daily, weekly, … u triggered: on “big” change,

Monitoring Issues l Frequency u periodic: daily, weekly, … u triggered: on “big” change, lots of changes, . . . l Data transformation u convert data to uniformat u remove & add fields (e. g. , add date to get history) Standards (e. g. , ODBC) l Gateways l 34 Hector Garcia Molina: Data Warehousing and OLAP

Integration Data Cleaning l Data Loading l Derived Data l Client Query & Analysis

Integration Data Cleaning l Data Loading l Derived Data l Client Query & Analysis Metadata Warehouse Integration Source 35 Source Hector Garcia Molina: Data Warehousing and OLAP

Data Cleaning l l l Migration (e. g. , yen ðdollars) Scrubbing: use domain-specific

Data Cleaning l l l Migration (e. g. , yen ðdollars) Scrubbing: use domain-specific knowledge (e. g. , social security numbers) Fusion (e. g. , mail list, customer merging) billing DB service DB l 36 customer 1(Joe) merged_customer(Joe) customer 2(Joe) Auditing: discover rules & relationships (like data mining) Hector Garcia Molina: Data Warehousing and OLAP

Loading Data Incremental vs. refresh l Off-line vs. on-line l Frequency of loading l

Loading Data Incremental vs. refresh l Off-line vs. on-line l Frequency of loading l u At l 37 night, 1 x a week/month, continuously Parallel/Partitioned load Hector Garcia Molina: Data Warehousing and OLAP

Derived Data l Derived Warehouse Data u indexes u aggregates u materialized views (next

Derived Data l Derived Warehouse Data u indexes u aggregates u materialized views (next slide) When to update derived data? l Incremental vs. refresh l 38 Hector Garcia Molina: Data Warehousing and OLAP

Materialized Views l Define new warehouse relations using SQL expressions does not exist at

Materialized Views l Define new warehouse relations using SQL expressions does not exist at any source 39 Hector Garcia Molina: Data Warehousing and OLAP

Processing ROLAP servers vs. MOLAP servers l Index Structures l What to Materialize? l

Processing ROLAP servers vs. MOLAP servers l Index Structures l What to Materialize? l Algorithms l Client Query & Analysis Metadata Warehouse Integration Source 40 Source Hector Garcia Molina: Data Warehousing and OLAP

ROLAP Server l Relational OLAP Server tools utilities ROLAP server Special indices, tuning; Schema

ROLAP Server l Relational OLAP Server tools utilities ROLAP server Special indices, tuning; Schema is “denormalized” relational DBMS 41 Hector Garcia Molina: Data Warehousing and OLAP

MOLAP Server Multi-Dimensional OLAP Server Ci ty l M. D. tools Product B A

MOLAP Server Multi-Dimensional OLAP Server Ci ty l M. D. tools Product B A milk soda eggs soap 1 utilities 42 multidimensional server Sales 2 3 4 Date could also sit on relational DBMS Hector Garcia Molina: Data Warehousing and OLAP

Index Structures l Traditional Access Methods u B-trees, l hash tables, R-trees, grids, …

Index Structures l Traditional Access Methods u B-trees, l hash tables, R-trees, grids, … Popular in Warehouses u inverted lists u bit map indexes u join indexes u text indexes 43 Hector Garcia Molina: Data Warehousing and OLAP

. . . Inverted Lists age index 44 inverted lists data records Hector Garcia

. . . Inverted Lists age index 44 inverted lists data records Hector Garcia Molina: Data Warehousing and OLAP

Using Inverted Lists l Query: u Get people with age = 20 and name

Using Inverted Lists l Query: u Get people with age = 20 and name = “fred” List for age = 20: r 4, r 18, r 34, r 35 l List for name = “fred”: r 18, r 52 l Answer is intersection: r 18 l 45 Hector Garcia Molina: Data Warehousing and OLAP

. . . Bit Maps age index 46 bit maps data records Hector Garcia

. . . Bit Maps age index 46 bit maps data records Hector Garcia Molina: Data Warehousing and OLAP

Using Bit Maps l Query: u Get people with age = 20 and name

Using Bit Maps l Query: u Get people with age = 20 and name = “fred” List for age = 20: 1101100000 l List for name = “fred”: 0100000001 l Answer is intersection: 0100000 l Good if domain cardinality small l Bit vectors can be compressed l 47 Hector Garcia Molina: Data Warehousing and OLAP

Join • “Combine” SALE, PRODUCT relations • In SQL: SELECT * FROM SALE, PRODUCT

Join • “Combine” SALE, PRODUCT relations • In SQL: SELECT * FROM SALE, PRODUCT 48 Hector Garcia Molina: Data Warehousing and OLAP

Join Indexes join index 49 Hector Garcia Molina: Data Warehousing and OLAP

Join Indexes join index 49 Hector Garcia Molina: Data Warehousing and OLAP

What to Materialize? Store in warehouse results useful for common queries l Example: total

What to Materialize? Store in warehouse results useful for common queries l Example: total sales l day 2 . . . day 1 129 materialize 50 Hector Garcia Molina: Data Warehousing and OLAP

Materialization Factors Type/frequency of queries l Query response time l Storage cost l Update

Materialization Factors Type/frequency of queries l Query response time l Storage cost l Update cost l 51 Hector Garcia Molina: Data Warehousing and OLAP

Cube Aggregates Lattice 129 city, product day 2 day 1 52 all product date

Cube Aggregates Lattice 129 city, product day 2 day 1 52 all product date city, date product, date city, product, date use greedy algorithm to decide what to materialize Hector Garcia Molina: Data Warehousing and OLAP

Dimension Hierarchies all state city 53 Hector Garcia Molina: Data Warehousing and OLAP

Dimension Hierarchies all state city 53 Hector Garcia Molina: Data Warehousing and OLAP

Dimension Hierarchies all city, product city, date city, product, date state, date state, product,

Dimension Hierarchies all city, product city, date city, product, date state, date state, product, date not all arcs shown. . . 54 Hector Garcia Molina: Data Warehousing and OLAP

Interesting Hierarchy all years weeks quarters months conceptual dimension table days 55 Hector Garcia

Interesting Hierarchy all years weeks quarters months conceptual dimension table days 55 Hector Garcia Molina: Data Warehousing and OLAP

Design What data is needed? l Where does it come from? l How to

Design What data is needed? l Where does it come from? l How to clean data? l How to represent in warehouse (schema)? l What to summarize? l What to materialize? l What to index? l 56 Hector Garcia Molina: Data Warehousing and OLAP

Tools l Development u l Planning & Analysis u l measure traffic (sources, warehouse,

Tools l Development u l Planning & Analysis u l measure traffic (sources, warehouse, clients) Workflow Management u 57 performance monitoring, usage patterns, exception reporting System & Network Management u l what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management u l design & edit: schemas, views, scripts, rules, queries, reports “reliable scripts” for cleaning & analyzing data Hector Garcia Molina: Data Warehousing and OLAP

Current State of Industry l Extraction and integration done off-line u Usually l in

Current State of Industry l Extraction and integration done off-line u Usually l in large, time-consuming, batches Everything copied at warehouse u Not selective about what is stored u Query benefit vs storage & update cost l Query optimization aimed at OLTP u High throughput instead of fast response u Process whole query before displaying anything 58 Hector Garcia Molina: Data Warehousing and OLAP