Advanced Database Topics Copyright Ellis Cohen 2002 2005
Advanced Database Topics Copyright © Ellis Cohen 2002 -2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-Non. Commercial-Share. Alike 2. 5 License. For more information on how you may use them, please see http: //www. openlineconsult. com/db © Ellis Cohen, 2002 -2005
Topics Overview Star Schema: Fact & Dimension Tables The Star Schema & Denormalization Viewing The Data Cube Drill Down & Rollup Cross Tabulations Data Visualization Trend & Rank Analysis ETL: Extraction, Transformation & Loading Materialized Views & Query Rewriting Indexing for Data Warehouses © Ellis Cohen, 2002 -2005 2
Operational vs Analytical DBs Operational Database Data needed and updated constantly to directly support business operations Focus on OLTP (on-line transaction processing): Transactional access & modification of relatively small # of data points at a time Analytical Database: Data Warehouse & Data Mart Copious amounts of relatively static data, culled & integrated across enterprise, cleansed & summarized, maintained historically, used for decision support and business intelligence (BI) Focus on OLAP (on-line analytical processing): Querying large amounts of data, scheduled modifications © Ellis Cohen, 2002 -2005 3
Operational vs Analytical DBs Operational Warehouse Usage Transactional (OLTP) Analytical (OLAP) Organized for Modifications Queries Modifications Continual Periodic Queries Narrow-scope Low-complexity Broad-scope High-complexity Database Relational/ Dimensional Normalized Denormalized Aggregated & Derived Data © Ellis Cohen, 2002 -2005 4
Central Data Warehouse (from Oracle 9 i Data Warehousing Guide) © Ellis Cohen, 2002 -2005 5
Warehouse Questions How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years? What are the top 25 selling products by category and region for this past quarter? What percent of the market do we own for each product we make? Which of our customer's zipcodes were responsible for the top 10% of total sales over the last year. © Ellis Cohen, 2002 -2005 6
Star Schema: Fact & Dimension Tables © Ellis Cohen, 2002 -2005 7
Star Schema Stores (Dimension) Products (Dimension) storid … prodid … Daily. Sales (Fact) Data Warehouses are organized using Star Schema models storid prodid date price units foreign key Measures what each fact measures A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables © Ellis Cohen, 2002 -2005 8
Subjects (Facts) & Dimensions Instead of thinking about entities & relationships, design a data warehouse by thinking about Subjects (represented by fact tables) Sales, Distribution, Purchases Dimensions (represented by dimension tables) How to uniquely identify the facts about each subject – Sales: Product, Stores, Dates (maybe also Employee, Customer: depends what you want to analyze) – Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks) – Purchases: Products, Vendors, Dates (maybe also Employees) © Ellis Cohen, 2002 -2005 9
Fact & Dimension Tables Fact Tables Composite primary key • identify dimensions • uniquely identify each fact (or measurement) Additional attributes: measures • what is measured about each fact Dimension Tables Primary key Surrogate key uniquely identifies each dimension value Additional attributes Properties of each dimension value © Ellis Cohen, 2002 -2005 10
Dimensions & Granularity Dimensions have different levels of granularity Products Stores Districts Regions Product. Types Sub. Categories Manufacturers Categories © Ellis Cohen, 2002 -2005 11
Snowflake Schema (with Normalized Dimensions) Stores (Dimension) storid stornam city state distid Districts distid distnam distarea regid Regions regid regnam Daily. Sales (Fact) storid prodid date price units Sub. Categories subcatid subnam subdescr catid Categories catid catnam catdescr © Ellis Cohen, 2002 -2005 Products (Dimension) prodid color size prodtyp Product. Types prodtyp prodnam prodescr subcatid manfid Manufacturers manfid manfnam 12
Typical Warehouse Query How many red Bally shoes did we sell in each region in 2002? SELECT r. regnam as region, sum(f. units) as sumunits FROM Daily. Sales f NATURAL JOIN Stores NATURAL JOIN Districts NATURAL JOIN Regions r NATURAL JOIN Products p NATURAL JOIN Product. Types NATURAL JOIN Sub. Categorie s NATURAL JOIN Manufacturers m WHERE to_char(f. date, 'YYYY') = '2002' AND p. color = 'red' AND m. manfnam = 'Bally' AND s. subnam = 'Shoe' GROUP BY r. regnam © Ellis Cohen, 2002 -2005 13
Aggregate Functions AVG: Average COUNT: Count MIN: Minimum Value MAX: Maximum Value STDDEV: Standard Deviation (and STDDEV_POP & STDDEV_SAMP) SUM: Sum VARIANCE: Variance (and VAR_POP & VAR_SAMP) © Ellis Cohen, 2002 -2005 14
The Star Schema & Denormalization © Ellis Cohen, 2002 -2005 15
Snowflake Schema is Normalized Snowflake Schema has normalized dimension tables • Each dimension is represented by multiple sub-dimension tables at different levels of granularity (Product, Product. Type, Category, etc. ) • Each sub-dimension table has attributes appropriate to the level of granularity – Product: color, size – Product. Type: prodnam, prodescr – etc. © Ellis Cohen, 2002 -2005 16
Denormalization Products (Dimension) prodid color size prodtyp Sub. Categories subcatid subnam subdescr catid Categories catid catnam catdescr Product. Types prodtyp prodnam prodescr subcatid manfid Manufacturers manfid manfnam © Ellis Cohen, 2002 -2005 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Why is there redundancy here? 17
Star Schema is Denormalized The Star Schema has denormalized dimension tables • Each dimension by joining together the sub -dimension table to form a single dimension table • The dimension table has attributes at different levels of granularity • The dimension tables contain lots of redundancy, but queries use far fewer joins • Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately) © Ellis Cohen, 2002 -2005 18
Star Schema (Fully Denormalized Dimensions) Stores (Dimension) storid stornam city state distid distnam distarea regid regnam Why should this be replaced by a dateid? Daily. Sales (Fact) storid prodid date price units Maybe catdescr not included here if it is a GIF or a 4000 byte description © Ellis Cohen, 2002 -2005 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 19
Schema Types Snowflake Schema Fact table with fully normalized dimension tables Star Schema Fact table with fully de-normalized dimension tables Starflake Schema Fact table with fully de-normalized dimension and (as needed) sub-dimension tables Constellation Schema Multiple fact tables with shared dimension tables © Ellis Cohen, 2002 -2005 20
Query with Denormalized Schema How many red Bally shoes did we sell in each region in 2002? SELECT s. regnam as region, sum(f. units) as sumunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p WHERE to_char(f. date, 'YYYY') = '2002' AND p. color = 'red' AND p. manfnam = 'Bally' AND p. subnam = 'Shoe' Costly GROUP BY s. regnam © Ellis Cohen, 2002 -2005 21
Typical Date Dimension Attributes It is common and almost always more efficient to treat Dates as a dimension with a number of attributes Field Example Value Year 2005 Month Feb Quarter 1 Day. Of. Month 12 Day. Of. Year 43 Week. Of. Year 7 Day. Of. Week Sat Note: Quarter is less granular than Month Also, Day. Of. Year, Week. Of. Year & Day. Of. Week can be derived form the other fields Requires Month + Year to identify a month within a year. Might want to add a single Month. Yr field to represent the pair © Ellis Cohen, 2002 -2005 22
Extended Date Dimension Hierarchy Date (e. g. Feb 12, 2005) Day. Of. Week (e. g. Sat) Day. Of. Year Day. Of. Month (e. g. 43) (e. g. 12) Week. Yr (e. g. 2005 Wk 7) Month. Yr (e. g. Feb 2005) Month Week. Of. Year (e. g. 7) Quarter. Yr (e. g. Feb) (e. g. 2005 Q 1) Year (e. g 2005) © Ellis Cohen, 2002 -2005 Quarter (e. g. 1) 23
Star Schema with Date Dimension Stores (Dimension) storid stornam city state Dates distid (Dimension) distnam dateid distarea date regid dayofweek regnam dayofmonth dayofyear weekyr weekofyear monthyr month quarteryr quarter year Daily. Sales (Fact) storid prodid dateid price units In general, represent dates by a Dates dimension table © Ellis Cohen, 2002 -2005 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 24
Query using Dates Dimension How many red Bally shoes did we sell in each region in 2002? SELECT s. regnam as region, sum(f. units) as sumunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE d. year = 2002 AND p. color = 'red' AND p. manfnam = 'Bally' AND p. subnam = 'Shoe' GROUP BY s. regnam Needs an extra join, but simpler query, Executes faster if Dates is indexed by year © Ellis Cohen, 2002 -2005 25
More Complex Query How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years? SELECT s. regnam as region, d. quarteryr, sum(f. units) as sumunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE p. color = 'red' AND p. manfnam = 'Bally' AND p. subnam = 'Shoe' GROUP BY s. regnam, d. quarteryr, d. quarter, d. year HAVING d. quarter = 3 AND d. year BETWEEN 1998 and 2002 © Ellis Cohen, 2002 -2005 26
The M: N Mapping Problem Products Product. Types Sub. Categories Daily. Sales (Fact) storid prodid dateid price units Manufacturers Categories Suppose a product type may have multiple associated subcategories. What do we do? © Ellis Cohen, 2002 -2005 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 27
M: N Mappings Daily. Sales (Fact) storid prodid dateid price units OK to keep the M: N bridge table Can't be a foreign key constraint, since prodtyp is not unique in Product Prod. Cat. Map Sub. Categories subcatid subnam subdescr catid catnam catdescr Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam prodtyp subcatid A product type can have more than subcategory © Ellis Cohen, 2002 -2005 28
Non-1 NF Denormalization Daily. Sales (Fact) Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam { subcatid } storid prodid dateid price units Sub. Categories Another reasonable approach (esp if DB support for lists) subcatid subnam subdescr catid catnam catdescr Represent a list of subcategories by • A (non-standard) list datatype • Delimited string – e. g. |314|209|812| © Ellis Cohen, 2002 -2005 29
Full Denormalization Daily. Sales (Fact) storid prodid dateid price units UGLY! BAD! Don't do this! Complicates joins SELECT prodtyp, sum(units) FROM Daily. Sales NATURAL JOIN Product GROUP BY prodtyp is not correct because of duplication. Must write WITH Just. Product AS (SELECT DISTINCT prodid, prodtyp FROM Product) SELECT prodtyp, sum(units) FROM Daily. Sales NATURAL JOIN Just. Product GROUP BY prodtyp © Ellis Cohen, 2002 -2005 Product (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 30
Limits to Denormalization Stores (Dimension) storid … Store. Promos storid startdateid enddateid promonam discount Daily. Sales (Fact) storid prodid dateid price units Dates (Dimension) dateid … © Ellis Cohen, 2002 -2005 Products (Dimension) prodid … Can't denormalize Store. Promos Unless you replace Store & Date with a single Store. Date dimension with storeid & dateid as primary keys: Way too big 31
Viewing The Data Cube © Ellis Cohen, 2002 -2005 32
Data Cube Representation Sales Cube Stores dimension Sales of Beanie Babies in Pittsburgh Store Today All Sales (of all products over time) in NYC Store Sales of Beanie Babies in Pittsburgh Store Yesterday Dates dimension Pgh NYC Products dimension © Ellis Cohen, 2002 -2005 33
Data Cube Characteristics Each axis represents a dimension – Elements along axis are at lowest granularity for that dimension Measures are the data within the cells at intersections of the cube – Information about the topic of the cube – e. g. units & price for each sales fact (i. e. sales in a store of a product on a date) © Ellis Cohen, 2002 -2005 34
Data Cube Views Slice View data relative to a point in one or more dimensions View sales today (for each store & each product category) View Bally shoe sales at the NYC store (for each date) Dice View data relative to (sets of) ranges in one or more dimensions View sales for the last 4 days (for each store & each product category) View sales for each type of shoes at all the NY and NJ stores for each of the last 10 quarters © Ellis Cohen, 2002 -2005 35
MDDB: Multi. Dimensional Data. Base Knows about Fact & Dimension Tables Uses direct (n dimensional) hypercube representation to provide fast access to fact elements in query Supports sparse representations – The Pittsburgh store doesn't sell lingerie – The Cape Cod store is not open in the winter – Baked Beanie Babies are only sold in the NE region Uses specialized query language e. g. MDX (used by Microsoft OLAP Server) w basic data types: cube, slice, dice © Ellis Cohen, 2002 -2005 36
Choosing a View Store Customers Month. Yr Products 1997 Q 1 Educ Level City State Brand Quarter. Yr Minor. Sub. Category City Store Type CA Minor. Category Year State Major. Sub. Category Drink Country Detailed Dicing Dimension Major. Category Country Slicing Dimensions © Ellis Cohen, 2002 -2005 37
Slicing & Dicing Examples use dynasight, www. arcplan. com Detailed Dicing Dimension Base Measures Derived Measures © Ellis Cohen, 2002 -2005 Slicing Attributes 38
Slice, Dice & Chart Slicing Attributes Charted Measures Different Dicing Dimension Measures © Ellis Cohen, 2002 -2005 39
Drill Down & Roll Up © Ellis Cohen, 2002 -2005 40
Slicing & Dicing Drill Down Detailed Dicing Dimension Measures © Ellis Cohen, 2002 -2005 Slicing Attributes 41
Drill Down Re-Slice © Ellis Cohen, 2002 -2005 42
Uniform Drill Down & Rollup Uniform Drill Down Uniformly drill down to a certain level Uniform Rollup Compute Aggregate Values at that level and all higher levels Can be computed with a single SELECT statement using the ROLLUP grouping function Non-uniform rollups (previous slide) require UNIONs © Ellis Cohen, 2002 -2005 43
Ordinary Group Aggregation SELECT c. country, c. state, c. city, sum(f. sale) as Store. Sales, sum(f. cost) as Store. Cost, Store. Sales - Store. Cost as Store. Net, 100* Store. Cost / Store. Sales as Pct. Cost FROM Facts f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. Quarter. Yr = '1997 Q 1' AND s. country = 'USA' Note: Constrain Store, not Customer AND s. State = 'CA' GROUP BY c. country, c. state, c. city © Ellis Cohen, 2002 -2005 44
Aggregate Query Results Per-City Rollups in CA Per-City Rollups in OR Country USA CA USA OR … … CANADA State City Store. Sales Altadena 96 … Arcadia 64 … … Woodland … … Beaverton 12 … Corvalis 21 … … Woodburn 4 … … BC Victoria … … … That's fine, but it does NOT give us • Aggregate store sales for CA, OR, BC, etc • Aggregate store sales for USA, CANADA • Aggregate store sales overall © Ellis Cohen, 2002 -2005 45
Rollup Query Results Rollup ALL USA Rollup CA Rollup Per-City Rollups in CA OR Rollup Per-City Rollups in OR Canada Rollup BC Rollup Country State City Store. Sales NULL 6310 … USA NULL 4310 … USA CA NULL 3310 … USA CA Altadena 96 … USA CA Arcadia 64 … USA CA … … … USA CA Woodland … … USA OR NULL 1000 … USA OR Beaverton 12 … USA OR Corvalis 21 … USA OR … … … USA OR Woodburn 4 … … … CANADA NULL … … CANADA BC Victoria … … © Ellis Cohen, 2002 -2005 … … 46
Rollup using Union SELECT c. country, c. state, c. city, … GROUP BY c. country, c. state, c. city UNION SELECT c. country, c. state, NULL AS city, … GROUP BY c. country, c. state UNION SELECT c. country, NULL AS state, NULL AS city, … GROUP BY c. country UNION SELECT NULL AS country, NULL AS state, NULL AS city, … © Ellis Cohen, 2002 -2005 47
GROUP BY ROLLUP SELECT c. country, c. state, c. city, sum(f. sale) as Store. Sales, sum(f. cost) as Store. Cost, Store. Sales - Store. Cost as Store. Net, 100* Store. Cost / Store. Sales as Pct. Cost FROM Facts f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. Quarter. Yr = '1997 Q 1' AND s. country = 'USA' Note: Constrain Store, not Customer AND s. State = 'CA' GROUP BY ROLLUP( c. country, c. state, c. city ) © Ellis Cohen, 2002 -2005 48
Cross Dimension Rollups SELECT c. state, d. Quarter. Yr, sum(f. cost) as Store. Cost FROM Fact f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. year = 1997 AND s. country = 'USA' AND s. State = 'CA' GROUP BY ROLLUP( c. state, d. Quarter. Yr ) © Ellis Cohen, 2002 -2005 49
Cross Dimension Rollup Results Rollup ALL CA Rollup Per-Qtr Rollups in CA OR Rollup Per-Qtr Rollups in OR WA Rollup State Quarter NULL CANULL 63, 530 CA 1997 Q 1 14, 431 CA 1997 Q 2 15, 332 CA 1997 Q 3 15, 673 CA 1997 Q 4 18, 094 OR NULL OR 1997 Q 1 OR 1997 Q 2 OR 1997 Q 3 OR 1997 Q 4 WA NULL WA 1997 Q 1 … Store. Sales 225, 627 56, 773 16, 081 12, 679 14, 274 13, 739 105, 324 25, 240 © Ellis Cohen, 2002 -2005 50
Cross Tabulations © Ellis Cohen, 2002 -2005 51
Cross Tab View of Rollup 1997 Q 1 1997 Q 2 1997 Q 3 1997 Q 4 ? 225, 627 CA 63, 530 14, 431 15, 332 15, 673 18, 094 OR 56, 773 16, 081 12, 679 14, 274 13, 739 WA 105, 324 25, 240 24, 953 25, 958 29, 173 © Ellis Cohen, 2002 -2005 52
Cross Tab with Sums © Ellis Cohen, 2002 -2005 53
GROUP BY CUBE SELECT c. state, d. Quarter. Yr, sum(f. cost) as Store. Cost FROM Fact f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. Year = 1997 AND s. country = 'USA' AND s. State = 'CA' GROUP BY CUBE( c. state, d. Quarter. Yr ) © Ellis Cohen, 2002 -2005 54
Cube Results Rollup ALL Qtr Rollups CA Rollup Per-Qtr Rollups in CA OR Rollup Per-Qtr Rollups in OR WA Rollup State Quarter NULL 1997 Q 1 NULL 1997 Q 2 NULL 1997 Q 3 NULL 1997 Q 4 CANULL 63, 530 CA 1997 Q 1 14, 431 CA 1997 Q 2 15, 332 CA 1997 Q 3 15, 673 CA 1997 Q 4 18, 094 OR NULL OR 1997 Q 1 OR 1997 Q 2 OR 1997 Q 3 OR 1997 Q 4 WA NULL WA 1997 Q 1 … Store. Sales 225, 627 55, 752 52, 964 55, 905 61, 006 © Ellis Cohen, 2002 -2005 56, 773 16, 081 12, 679 14, 274 13, 739 105, 324 25, 240 55
Detailed Cross Tab © Ellis Cohen, 2002 -2005 56
Data Visualization © Ellis Cohen, 2002 -2005 57
Charting Visualizations All: 1 dimension, 1 measure © Ellis Cohen, 2002 -2005 58
Volume Visualization Clustered data: 3 dimensions, 1 measure shown using color © Ellis Cohen, 2002 -2005 59
Colored Sphere Visualization White: colored measure unknown Sparse data: 3 dimensions, 2 measures: pt size & color © Ellis Cohen, 2002 -2005 60
Vector Glyph Visualization 2 dimensions, 4 measures: <x, y, z> & color © Ellis Cohen, 2002 -2005 61
Dimensional Stacking 4 (2+2)dimensions, 1 binary measure • could use color for continuous measure • could chart: 3 (2+1) dimensions, 1 measure © Ellis Cohen, 2002 -2005 62
Visualization Issues Dimensions How many dimension attributes How dimension attributes are represented Measures How many simultaneous measures How measures are represented Spatial, Color (hue/brightness/…), Texture, Audio, other sensory Transformations Measures & Dimension attributes 1 variable: sqr, sqrt, log, exp, 1/x N variables: linear combinations Drill Up , Drill Down, Pivot Interactivity & Immersiveness © Ellis Cohen, 2002 -2005 63
Trend & Rank Analysis © Ellis Cohen, 2002 -2005 64
Trend Example Month Year Total Smoothed Total Jan 1994 200 Feb 1994 344 272 Mar 1994 401 315 Apr 1994 443 347 May 1994 360 387 Jun 1994 402 Jul 1994 389 399 Aug 1994 451 401 Window In addition to calculating the total # of units sold by month we want to smooth that over the preceding 3 months © Ellis Cohen, 2002 -2005 65
Trend Analysis When you build a result set you may want to define a field that depends on a group of related rows in the same result set (the window) This is particularly useful for analyzing trends SELECT d. month, d. year, sum(f. units) as totunits, {moving average of totunits over 3 months preceding} as movavg FROM Daily. Sales f NATURAL JOIN Dates d GROUP BY d. year, d. month window © Ellis Cohen, 2002 -2005 66
Trends in Oracle SQL WITH Monthly. Units AS ( SELECT d. month, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Dates d GROUP BY d. year, d. month ) SELECT month, year, totunits, avg(totunit) OVER ( ORDER BY year, month Window ROWS 3 PRECEDING ) AS movavg FROM Monthly. Units © Ellis Cohen, 2002 -2005 67
Trends in SQL 99 WITH Monthly. Units AS ( SELECT d. month, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p GROUP BY d. year, d. month ) SELECT month, year, totunits, avg(totunits) OVER w AS movavg FROM Monthly. Units WINDOW w AS ( ORDER BY year, month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ) © Ellis Cohen, 2002 -2005 68
Trends using Subqueries WITH Monthly. Units AS ( SELECT d. month, d. year, 12*d. year + d. month as mknt, sum(f. units) as totunits, FROM Daily. Sales f NATURAL JOIN Product p GROUP BY d. year, d. month) SELECT m. month, m. year, m. totunits, (SELECT avg(mm. totunits) FROM Monthly. Units mm WHERE mm. mknt BETWEEN m. mknt – 3 AND m. mknt) AS movavg FROM Monthly. Units m ORDER BY m. year, m. month © Ellis Cohen, 2002 -2005 69
Rank Example Month Year Total Rank Jan 1994 200 8 Feb 1994 344 7 Mar 1994 401 4 Apr 1994 443 2 May 1994 360 6 Jun 1994 404 3 Jul 1994 389 5 Aug 1994 451 1 Window In addition to calculating the total # of units sold by month we want to rank it with respect to all the months © Ellis Cohen, 2002 -2005 70
Ranking in Oracle SQL WITH Monthly. Units AS ( SELECT d. month, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p GROUP BY d. year, d. month ) SELECT month, year, totunits, rank() OVER ( ORDER BY totunit DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS ranktotal FROM Monthly. Units The ugly idiom for 'ALL ROWS' © Ellis Cohen, 2002 -2005 71
Analytical Functions Ranking Functions RANK, DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE, ROW_NUMBER Rank current row within the window Inverse Percentile Functions PERCENTILE_CONT (continuous) PERCENTILE_DISC (discrete) Histogram Support WIDTH_BUCKET Lag/Lead Functions LAG, LEAD Return data from a row at a specified offset from the current row within the window © Ellis Cohen, 2002 -2005 72
ETL: Extraction, Transformation & Loading © Ellis Cohen, 2002 -2005 73
ETL: Extraction, Transformation & Loading Extraction Loading Transformation 80% of total cost of building warehouse © Ellis Cohen, 2002 -2005 74
Extraction Sources Multiple DB's Flat Files External Data Sources • e. g. Census, Geographic, Weather, Financial, Unemployment Data • Standard DB/Spreadsheet format or semistructured data from the web Frequency Periodic (hourly, daily, weekly, …) Triggered • Single event • #, sequence, pattern of events Mechanisms Snapshots / Materialized Views / Replication Database Triggers Process Logs Query Sources (full vs incremental) © Ellis Cohen, 2002 -2005 75
Transformation Cleaning Scrubbing Filtering Conformance Integration Renaming Fusion & Merging Determine Surrogate Keys Timestamping Summarization Schema Organization Dimension Tables Pre-Aggregation via Materialized Views Derivation © Ellis Cohen, 2002 -2005 76
(Transformation) Cleaning Scrubbing Use domain-specific knowledge e. g. SS#, phone-number, zipcode Filtering Check for inconsistent data Use data validation rules Conformance Map similarly typed data to standard representation Convert units (inch => cm, $ => euro) scale (mm => cm) formats (string => integer, string with/wo $) © Ellis Cohen, 2002 -2005 77
(Transformation) Integration Renaming Resolve name conflicts Fusion - e. g. merge – properties in city db – properties in developer lists Determine Surrogate Keys Do not use keys from operational data as primary key in warehouse data Timestamping Add timestamps to fact data where missing to enable historical queries Reorganization & Evolution Support Data Reorganization & Schema Evolution Summarization Summarize original operational data and combine into less detailed tables © Ellis Cohen, 2002 -2005 78
Integration (Data Reorganization) What do we do when attributes change? Suppose districts are reorganized and a store is now part of a different district Consistently changing mapping of store to district – – Allows new and old data to be compared reasonably by district But causes incorrect comparisons by district among older data alone Solutions 1. Keep fields for both old and new mapping -- in fact, potentially a separate field for each reorganization 2. Add effective date to store dimension. Have multiple rows for same store - each with different effective date © Ellis Cohen, 2002 -2005 79
(Integration) Summarization Point. Of. Sale. Terminals Daily. Sales (Fact) posid postyp storid loc Customer. Transaction transid custid empid posid time Item. Purchase transid lineno prodid price units storid prodid date price units Might build different fact tables for different purposes: e. g. ones involving Customers ones involving Store Locations Tradeoff Smaller Fact Tables vs. Missed Relationships © Ellis Cohen, 2002 -2005 80
Loading Alternatives – Incremental vs Full Refresh: most data is incrementally added to the warehouse – Off-line vs on-line – Frequency • Nightly • Weekly • Monthly – All-at-once vs Staged What indices to create or drop? What statistics to collect (& use)? © Ellis Cohen, 2002 -2005 81
Constellation Schema Data warehouses often are designed as constellations • Multiple fact tables • Shared/related dimension tables Examples – Sales: store, product, date – Distribution: distributor, store, product, carrier, period – Advertising: store, medium, product, period Query across same or related dimensions – Compare advertising and sales by store within various periods © Ellis Cohen, 2002 -2005 82
Data Marts Store different fact tables (or different groups of fact tables) in separate data marts © Ellis Cohen, 2002 -2005 83
Data Mart Architectures Subset of Data Warehouse Meets needs of subgroup of users • Top-down: – Extracted from Data Warehouse – Problem: early availability • Bottom-up: – Built directly from staging area – Can be combined to form warehouse – Problem: Conformance. ETL tool must provide metadata • Hybrid: – Some data marts built directly from staging area – Others extracted from Data Warehouse © Ellis Cohen, 2002 -2005 84
Metadata Management Identify & define each attribute – Source(s) – Transformation(s) applied – How aggregated – Description of what it represents – Relationships to other attributes – History © Ellis Cohen, 2002 -2005 85
Materialized Views & Query Rewriting © Ellis Cohen, 2002 -2005 86
What is an Ordinary View A view is not a table A view does not hold data A view is just a description used in expanding queries which refer to the view! © Ellis Cohen, 2002 -2005 87
View Expansion Suppose we define CREATE VIEW Hi. Emps AS SELECT * FROM Emps WHERE sal > 1500 and then execute the query SELECT ename, job FROM Hi. Emps The database engine automatically expands this into SELECT ename, job FROM Emps WHERE sal > 1500 © Ellis Cohen, 2002 -2005 88
Motivating Materialized Views Suppose a view is • Used frequently in an application • Somewhat expensive to compute • Based on tables that change infrequently It would be useful to • Store the contents of the view in a table • Use the table for queries • Arrange to update the table (automatically) when the base tables change [or, perhaps less frequently, if the view does not need to be perfectly up-to-date] © Ellis Cohen, 2002 -2005 89
Example Star Schema Stores (Dimension) storid stornam city state Dates distid (Dimension) distnam dateid distarea date regid dayofweek regnam dayofmonth dayofyear weekyr weekofyear monthyr month quarteryr quarter year Daily. Sales (Fact) storid prodid dateid price units © Ellis Cohen, 2002 -2005 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 90
Materialized Views Materialized views actually hold data CREATE MATERIALIZED VIEW Prod. Dist. Yr. Sum AS SELECT p. prodtyp, s. distid, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d GROUP BY p. prodtyp, s. distid, d. year A materialized view is • Like a table, in that it actually stores the result of the query • Like a view, in that it is possible to arrange for it be automatically updated when the underlying base data changes © Ellis Cohen, 2002 -2005 91
Updating Materialized Views During the loading phase new data is incrementally added to data warehouse tables Materialized Views (which are defined as part of architecting the data warehouse) are either – Recalculated from scratch based on the new base table contents – Incrementally updated based on incremental changes to the base tables. How is Prod. Dist. Sum. Yr incrementally updated when a new day's worth of data is added? © Ellis Cohen, 2002 -2005 92
Using Materialized Views Instead of writing, SELECT p. prodtyp, s. distid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE d. year = 2002 GROUP BY p. prodtyp, s. distid just write SELECT prodtyp, distid, totunits FROM Prod. Dist. Yr. Sum WHERE year = 2002 Because Prod. Dist. Yr. Sum is a materialized view, the database engine does NOT expand it, but just uses its materialized data © Ellis Cohen, 2002 -2005 93
Aggregating Materialized Views Instead of writing, SELECT p. prodtyp, s. distid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p GROUP BY p. prodtyp, s. distid just write SELECT prodtyp, distid, sum(totunits) AS totunits_overalltime FROM Prod. Dist. Yr. Sum GROUP BY prodtyp, distid © Ellis Cohen, 2002 -2005 94
Architecting Materialized Views Which views should be materialized? Many possible combinations: – – – district, region district/week, district/month, … region/week, region/month, … district/category, district/manufacturer, … category/week, category/month, … category/district/week, … Design balances – Cost of precalculating & storing view – Cost of calculating on the fly A heuristic optimization problem – Uses statistics of queries – Uses size of each combination – e. g. Benefit Per Unit Space (BPUS) © Ellis Cohen, 2002 -2005 95
Materialized View Evolution Problem As the data warehouse evolves, the set of materialized views needs to change. But, if the DW design already includes 1000 analysis queries, they would need to be rewritten to use the new set of materialized views. This is expensive! © Ellis Cohen, 2002 -2005 96
Query Rewriting Systems (like Oracle) that support query rewriting • Can automatically rewrite queries to use available materialized views (this can be complicated!) • Allow a subset of materialized views to be marked for use in query rewriting Query rewriting is the opposite of view expansion! If the data warehouse does not support query rewriting, the ETL tool could do it instead! © Ellis Cohen, 2002 -2005 97
Stores x Sub. Categories CREATE MATERIALIZED VIEW Store. Subcat. Sum AS SELECT p. storid, s. subcatid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Store s NATURAL JOIN Product p Products GROUP BY p. storid, s. subcatid Stores Product. Types Districts Sub. Categories Regions Categories © Ellis Cohen, 2002 -2005 98
Districts x Product. Types CREATE MATERIALIZED VIEW Dist. Prodtyp. Sum AS SELECT p. prodtyp, s. distid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p GROUP BY p. prodtyp, s. distid Products Stores Product. Types Districts Sub. Categories Regions Categories © Ellis Cohen, 2002 -2005 99
Multiple Materialized View Alternatives SELECT p. catid, s. regid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p Product GROUP BY p. catid, s. regid Should the optimizer rewrite this query in terms of Store. Subcat. Sum or Dist. Prodtyp. Sum ? Stores Product. Types Districts Sub. Categories Regions Categories In general, how good is the optimizer? Can it discover unions, etc. © Ellis Cohen, 2002 -2005 100
Automatic Result Caching Database can (potentially) – cache the results of any query automatically as a materialized view – use the query history to automatically define new materialized views Then, based on their size & usage statistics, the DB can automatically determine – whether to discard any of these views after a while – whether to discard or update any of these views when their underlying base tables are updated © Ellis Cohen, 2002 -2005 101
Materialized View References Stores (Dimension) storid stornam … city state distid distnam distarea Not regid Foreign Key Constraints regnam When dimensions are denormalized, primary keys of materialized do not refer to unique dimension attributes Daily. Sales (Fact) storid prodid dateid price units Dist. Prodtyp Sum prodtyp distid totunits This requires using DISTINCT queries © Ellis Cohen, 2002 -2005 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 102
Queries Requiring DISTINCT SELECT p. prodtyp, s. distnam, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p GROUP BY p. prodtyp, s. distid, s. distnam rewritten as SELECT DISTINCT prodtyp, distnam, totunits FROM Dist. Prod. Type. Sum NATURAL JOIN Stores DISTINCT is only needed because distid is not unique in Stores Fix by adding a denormalized subdimension table for Districts © Ellis Cohen, 2002 -2005 103
Starflake Schema (Fully Denormalized Dimensions and Sub. Dimensions as needed) Store (Dimension) storid stornam city state distid distnam distarea regid regnam Districts (Sub. Dimension) distid distnam distarea regid regnam Daily. Sales (Fact) storid prodid dateid price units … Dist. Prodtyp Sum prodtyp distid totunits Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Product. Types (Sub. Dimension) prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam Keep catdescr Denormalized © Ellis Cohen, 2002 -2005 104
Indexing for Data Warehouses © Ellis Cohen, 2002 -2005 105
Implementing Warehouse Queries Products (Dimension) SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Products p WHERE p. catid = 5 Assume that • Products is indexed by catid • Daily Sales is indexed by prodid For a specific value of catid Get all rowids in Products with that catid Extract the prodid's Get all rowids in Daily. Sales with those prodid's Extract the units from the rows & sum Daily. Sales (Fact) storid prodid dateid price units Index by prodid © Ellis Cohen, 2002 -2005 prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Index by catid 106
Using Join Indexing Products (Dimension) SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Products p WHERE p. catid = 5 A join index is an index on one table based (in part) on values of fields of other tables Assume that • Daily Sales is indexed by Products. catid For a specific value of catid Get all rowids in Daily. Sales with that catid Extract the units from the rows & sum Daily. Sales (Fact) storid prodid dateid price units prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Index by Product. catid © Ellis Cohen, 2002 -2005 107
Multi-Table Joins SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Products p NATURAL JOIN Dates d WHERE p. catid = 5 AND d. year = 2002 Assume that • Daily Sales is indexed by Product. catid • Daily Sales is indexed by Dates. year For a specific value of catid Get all rowids in Daily. Sales with that catid For a specific year Get all rowids in Daily. Sales with that year Intersect the two lists of rowids (lots!) Extract the units from the rows & sum © Ellis Cohen, 2002 -2005 Daily. Sales (Fact) storid prodid dateid price units Index by Product. catid Index by Dates. year 108
Multi-Table Join Indexes SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p NATURAL JOIN Dates d WHERE p. catid = 5 AND d. year = 2002 Assume that • Daily Sales is indexed by ( Product. catid, Dates. year ) For a distinct pair of (catid, year) Get all rowids in Daily. Sales with that catid & year Extract the units from the rows & sum Since issue as for aggregates. Lots of possible different combinations of multi-table join indices: Which ones are worth building? © Ellis Cohen, 2002 -2005 Daily. Sales (Fact) storid prodid dateid price units Index by (Product. catid, Dates. year) 109
Bitmap Indexing Product Bitmap Index prodid … catid … 603942 … 5 … 603947 … 2 … 1 603950 … 2 … 1 603951 … 2 … 1 603964 … 3 … 603968 … 5 … 1 1 1 … 1 2 3 4 5 © Ellis Cohen, 2002 -2005 110
Using Bitmap Indices SELECT min(size), max(size) FROM Product WHERE catid = 5 Assume that • Product has bitmap index on catid (faster than doing full scan or using B+ tree) Implement Query By Scan all tuples in Products with the bit set for catid: 5 Extract the size from each tuple and compute min(size) and max(size) © Ellis Cohen, 2002 -2005 Product (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Bitmap Index by category 111
Bitmap Intersection SELECT min(size), max(size) FROM Product WHERE catid = 5 AND color = 'fuschia' Assume that • Product has bitmap indices on catid and on color Implement Query By Construct the bit vector which has bits set for both catid: 5 and color: fuschia (very fast!) Scan all tuples in Product with the bit set in the resulting bit vector Extract the size from each tuple and compute min(size) and max(size) © Ellis Cohen, 2002 -2005 Product (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Bitmap Indices by category & by color 112
Bitmap Join Indexing SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p WHERE p. catid = 5 Assume that • Daily Sales has bitmap index on Product. catid For a specific value of catid Get all rowids in Daily. Sales with that catid Extract the units from the rows & sum © Ellis Cohen, 2002 -2005 Daily. Sales (Fact) storid prodid dateid price units Bitmap Index by Product. catid 113
Multi-Table Joins with Bitmap Indices SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p NATURAL JOIN Store s WHERE p. catid = 5 AND s. city = ‘Boston’ Assume that • Daily Sales has bitmap index on Product. catid • Daily Sales has bitmap index on Store. city Implement Query By Construct the bit vector which has bits set for both category: 5 and city: Boston (very fast!) Get the rowids of all rows with bit set in the resulting bit vector Extract the units from the rows & sum © Ellis Cohen, 2002 -2005 Daily. Sales (Fact) storid prodid dateid price units Bitmap Index by Product. catid Bitmap Index by Store. city 114
Indexing vs Materialization Indexing Less space Usable with different kinds of aggregation and analysis operations More opportunities for combining Materialized Views (esp. Aggregates) Avoid recomputation, esp. recalculation of aggregates © Ellis Cohen, 2002 -2005 115
- Slides: 115