Data Warehousing Shivnath Babu Warehousing Growing industry 8
Data Warehousing Shivnath Babu
Warehousing Growing industry: $8 billion way back in 1998 l Range from desktop to huge: l u Walmart: 900 -CPU, 2, 700 disk, 23 TB Teradata system l Lots of buzzwords, hype u slice & dice, rollup, MOLAP, pivot, . . . 2
What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e. g. , summaries, history) u integrated u time-varying u non-volatile more 3
What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating, . . . u querying, reporting, analysis u data mining u monitoring, administering warehouse 4
Warehouse Architecture Client Query & Analysis Metadata Warehouse Integration Source 5
Motivating Examples Forecasting l Comparing performance of units l Monitoring, detecting fraud l Visualization l 6
Why a Warehouse? l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) ? Source 7
Query-Driven Approach Client Mediator Wrapper Source 8
Advantages of Warehousing High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse l u Modify, summarize (store aggregates) u Add historical information 9
Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources l 10
OLTP vs. OLAP l OLTP: On Line Transaction Processing u Describes l processing at operational sites OLAP: On Line Analytical Processing u Describes processing at warehouse 11
OLTP vs. OLAP OLTP l l l l Mostly updates Many small transactions Mb-Gb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical OLAP l l l Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users 12
Data Marts Smaller warehouses l Spans part of organization l u e. g. , l marketing (customers, products, sales) Do not require enterprise-wide consensus u but long term integration problems? 13
Warehouse Models & Operators l Data Models u relations u stars & snowflakes u cubes l Operators u slice & dice u roll-up, drill down u pivoting u other 14
Star 15
Star Schema 16
Terms Fact table l Dimension tables l Measures l 17
Dimension Hierarchies s. Type store city region èsnowflake schema èconstellations 18
Cube Fact table view: Multi-dimensional cube: dimensions = 2 19
3 -D Cube Fact table view: Multi-dimensional cube: day 2 day 1 dimensions = 3 20
ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing l MOLAP: Multi-Dimensional On-Line Analytical Processing l 21
Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81 22
Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date 23
Another Example • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prod. Id rollup drill-down 24
Aggregates Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy l u average by region (within store) u maximum by month (within date) 25
Cube Aggregation Example: computing sums. . . day 2 day 1 129 rollup drill-down 26
Cube Operators day 2 . . . day 1 sale(c 1, *, *) 129 sale(c 2, p 2, *) sale(*, *, *) 27
Aggregation Using Hierarchies day 2 day 1 customer region country (customer c 1 in Region A; customers c 2, c 3 in Region B) 29
Pivoting Fact table view: Multi-dimensional cube: day 2 day 1 Pivot turns unique values from one column into unique columns in the output 30
Derived Data l Derived Warehouse Data u indexes u aggregates u materialized views (next slide) When to update derived data? l Incremental vs. refresh l 31
Materialized Views l Define new warehouse relations using SQL expressions does not exist at any source 32
- Slides: 31