Theory Practice Methodology of Relational Database Design and
Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002 -2006 SQL for Data Warehouses These slides are licensed under a Creative Commons Attribution-Non. Commercial-Share. Alike 2. 5 License. For more information on how you may use them, please see http: //www. openlineconsult. com/db © Ellis Cohen, 2003 -2006
Topics Overview The Star Schema Viewing The Data Cube Drill Down & Rollup Cross Tabulations Data Visualization Trend & Rank Analysis Materialized Views & Query Rewriting Indexing for Data Warehouses © Ellis Cohen, 2003 -2006 2
The Star Schema © Ellis Cohen, 2003 -2006 3
Star Schema Stores (Dimension) Products (Dimension) storid … prodid … Daily. Sales (Fact) Data Warehouses are organized using Star Schema models storid prodid date price units foreign key Measures what each fact measures A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables © Ellis Cohen, 2003 -2006 4
Subjects (Facts) & Dimensions Instead of thinking about entities & relationships, design a data warehouse by thinking about Subjects (represented by fact tables) Sales, Distribution, Purchases Dimensions (represented by dimension tables) How to uniquely identify the facts about each subject – Sales: Product, Stores, Dates (maybe also Employee, Customer: depends what you want to analyze) – Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks) – Purchases: Products, Vendors, Dates (maybe also Employees) © Ellis Cohen, 2003 -2006 5
Fact & Dimension Tables Fact Tables Composite primary key • identify dimensions • uniquely identify each fact (or measurement) Additional attributes: measures • what is measured about each fact Dimension Tables Primary key Surrogate key uniquely identifies each dimension value Additional attributes Properties of each dimension value © Ellis Cohen, 2003 -2006 6
Dimensions & Granularity Dimensions have different levels of granularity Products Stores Districts Regions Product. Types Sub. Categories Manufacturers Categories © Ellis Cohen, 2003 -2006 7
Star Schema is Denormalized The Star Schema has denormalized dimension tables • Each dimension by joining together the sub -dimension table to form a single dimension table • The dimension table has attributes at different levels of granularity • The dimension tables contain lots of redundancy, but queries use far fewer joins • Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately) © Ellis Cohen, 2003 -2006 8
Star Schema with Date Dimension Stores (Dimension) storid stornam city state Dates distid (Dimension) distnam dateid distarea date regid dayofweek regnam dayofmonth dayofyear weekyr weekofyear monthyr month quarteryr quarter year Daily. Sales (Fact) storid prodid dateid price units In general, represent dates by a Dates dimension table © Ellis Cohen, 2003 -2006 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 9
Query using Dates Dimension How many red Bally shoes did we sell in each region in 2002? SELECT s. regnam as region, sum(f. units) as sumunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE d. year = 2002 AND p. color = 'red' AND p. manfnam = 'Bally' AND p. subnam = 'Shoe' GROUP BY s. regnam Needs an extra join, but simpler query, Executes faster if Dates is indexed by year © Ellis Cohen, 2003 -2006 10
Viewing The Data Cube © Ellis Cohen, 2003 -2006 11
Data Cube Representation Sales Cube Stores dimension Sales of Beanie Babies in Pittsburgh Store Today All Sales (of all products over time) in NYC Store Sales of Beanie Babies in Pittsburgh Store Yesterday Dates dimension Pgh NYC Products dimension © Ellis Cohen, 2003 -2006 12
Data Cube Characteristics Each axis represents a dimension – Elements along axis are at lowest granularity for that dimension Measures are the data within the cells at intersections of the cube – Information about the topic of the cube – e. g. units & price for each sales fact (i. e. sales in a store of a product on a date) © Ellis Cohen, 2003 -2006 13
Data Cube Views Slice View data relative to a point in one or more dimensions View sales today (for each store & each product category) View Bally shoe sales at the NYC store (for each date) Dice View data relative to (sets of) ranges in one or more dimensions View sales for the last 4 days (for each store & each product category) View sales for each type of shoes at all the NY and NJ stores for each of the last 10 quarters © Ellis Cohen, 2003 -2006 14
MDDB: Multi. Dimensional Data. Base Knows about Fact & Dimension Tables Uses direct (n dimensional) hypercube representation to provide fast access to fact elements in query Supports sparse representations – The Pittsburgh store doesn't sell lingerie – The Cape Cod store is not open in the winter – Baked Beanie Babies are only sold in the NE region Uses specialized query language e. g. MDX (used by Microsoft OLAP Server) w basic data types: cube, slice, dice © Ellis Cohen, 2003 -2006 15
Visualization Issues Dimensions How many dimension attributes How dimension attributes are represented Measures How many simultaneous measures How measures are represented Spatial, Color (hue/brightness/…), Texture, Audio, other sensory Transformations Measures & Dimension attributes 1 variable: sqr, sqrt, log, exp, 1/x N variables: linear combinations Drill Up , Drill Down, Pivot Interactivity & Immersiveness © Ellis Cohen, 2003 -2006 16
Choosing a View Store Customers Month. Yr Products 1997 Q 1 Educ Level City State Brand Quarter. Yr Minor. Sub. Category City Store Type CA Minor. Category Year State Major. Sub. Category Drink Country Detailed Dicing Dimension Major. Category Country Slicing Dimensions © Ellis Cohen, 2003 -2006 17
Slicing & Dicing Examples use dynasight, www. arcplan. com Detailed Dicing Dimension Base Measures Derived Measures © Ellis Cohen, 2003 -2006 Slicing Attributes 18
Slice, Dice & Chart Slicing Attributes Charted Measures Different Dicing Dimension Measures © Ellis Cohen, 2003 -2006 19
Drill Down & Roll Up © Ellis Cohen, 2003 -2006 20
Slicing & Dicing Drill Down Detailed Dicing Dimension Measures © Ellis Cohen, 2003 -2006 Slicing Attributes 21
Drill Down Re-Slice © Ellis Cohen, 2003 -2006 22
Uniform Drill Down & Rollup Uniform Drill Down Uniformly drill down to a certain level Uniform Rollup Compute Aggregate Values at that level and all higher levels Can be computed with a single SELECT statement using the ROLLUP grouping function Non-uniform rollups (previous slide) require UNIONs © Ellis Cohen, 2003 -2006 23
Ordinary Group Aggregation SELECT c. country, c. state, c. city, sum(f. sale) as Store. Sales, sum(f. cost) as Store. Cost, Store. Sales - Store. Cost as Store. Net, 100* Store. Cost / Store. Sales as Pct. Cost FROM Facts f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. Quarter. Yr = '1997 Q 1' AND s. country = 'USA' Note: Constrain Store, not Customer AND s. State = 'CA' GROUP BY c. country, c. state, c. city © Ellis Cohen, 2003 -2006 24
Aggregate Query Results Per-City Rollups in CA Per-City Rollups in OR Country USA CA USA OR … … CANADA State City Store. Sales Altadena 96 … Arcadia 64 … … Woodland … … Beaverton 12 … Corvalis 21 … … Woodburn 4 … … BC Victoria … … … That's fine, but it does NOT give us • Aggregate store sales for CA, OR, BC, etc • Aggregate store sales for USA, CANADA • Aggregate store sales overall © Ellis Cohen, 2003 -2006 25
Rollup Query Results Rollup ALL USA Rollup CA Rollup Per-City Rollups in CA OR Rollup Per-City Rollups in OR Canada Rollup BC Rollup Country NULL USA CA USA CA USA OR USA OR … … CANADA … State City Store. Sales NULL 6310 … NULL 4310 … NULL 3310 … Altadena 96 … Arcadia 64 … … Woodland … … NULL 1000 … Beaverton 12 … Corvalis 21 … … Woodburn 4 … … NULL … … BC Victoria … © Ellis Cohen, 2003 -2006 … … 26
Rollup using Union SELECT c. country, c. state, c. city, … GROUP BY c. country, c. state, c. city UNION SELECT c. country, c. state, NULL AS city, … GROUP BY c. country, c. state UNION SELECT c. country, NULL AS state, NULL AS city, … GROUP BY c. country UNION SELECT NULL AS country, NULL AS state, NULL AS city, … © Ellis Cohen, 2003 -2006 27
GROUP BY ROLLUP SELECT c. country, c. state, c. city, sum(f. sale) as Store. Sales, sum(f. cost) as Store. Cost, Store. Sales - Store. Cost as Store. Net, 100* Store. Cost / Store. Sales as Pct. Cost FROM Facts f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. Quarter. Yr = '1997 Q 1' AND s. country = 'USA' Note: Constrain Store, not Customer AND s. State = 'CA' GROUP BY ROLLUP( c. country, c. state, c. city ) © Ellis Cohen, 2003 -2006 28
Cross Dimension Rollups SELECT c. state, d. Quarter. Yr, sum(f. cost) as Store. Cost FROM Fact f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. year = 1997 AND s. country = 'USA' AND s. State = 'CA' GROUP BY ROLLUP( c. state, d. Quarter. Yr ) © Ellis Cohen, 2003 -2006 29
Cross Dimension Rollup Results Rollup ALL CA Rollup Per-Qtr Rollups in CA OR Rollup Per-Qtr Rollups in OR WA Rollup State Quarter NULL CANULL 63, 530 CA 1997 Q 1 14, 431 CA 1997 Q 2 15, 332 CA 1997 Q 3 15, 673 CA 1997 Q 4 18, 094 OR NULL OR 1997 Q 1 OR 1997 Q 2 OR 1997 Q 3 OR 1997 Q 4 WA NULL WA 1997 Q 1 … Store. Sales 225, 627 56, 773 16, 081 12, 679 14, 274 13, 739 105, 324 25, 240 © Ellis Cohen, 2003 -2006 30
Cross Tabulations © Ellis Cohen, 2003 -2006 31
Cross Tab View of Rollup 1997 Q 1 1997 Q 2 1997 Q 3 1997 Q 4 ? 225, 627 CA 63, 530 14, 431 15, 332 15, 673 18, 094 OR 56, 773 16, 081 12, 679 14, 274 13, 739 WA 105, 324 25, 240 24, 953 25, 958 29, 173 © Ellis Cohen, 2003 -2006 32
Cross Tab with Sums © Ellis Cohen, 2003 -2006 33
GROUP BY CUBE SELECT c. state, d. Quarter. Yr, sum(f. cost) as Store. Cost FROM Fact f NATURAL JOIN Customers c NATURAL JOIN Stores s NATURAL JOIN Dates d NATURAL JOIN Products p WHERE p. Major. Category = 'Drink' AND d. Year = 1997 AND s. country = 'USA' AND s. State = 'CA' GROUP BY CUBE( c. state, d. Quarter. Yr ) © Ellis Cohen, 2003 -2006 34
Cube Results Rollup ALL Qtr Rollups CA Rollup Per-Qtr Rollups in CA OR Rollup Per-Qtr Rollups in OR WA Rollup State Quarter NULL 1997 Q 1 NULL 1997 Q 2 NULL 1997 Q 3 NULL 1997 Q 4 CANULL 63, 530 CA 1997 Q 1 14, 431 CA 1997 Q 2 15, 332 CA 1997 Q 3 15, 673 CA 1997 Q 4 18, 094 OR NULL OR 1997 Q 1 OR 1997 Q 2 OR 1997 Q 3 OR 1997 Q 4 WA NULL WA 1997 Q 1 … Store. Sales 225, 627 55, 752 52, 964 55, 905 61, 006 © Ellis Cohen, 2003 -2006 56, 773 16, 081 12, 679 14, 274 13, 739 105, 324 25, 240 35
Detailed Cross Tab © Ellis Cohen, 2003 -2006 36
Trend & Rank Analysis © Ellis Cohen, 2003 -2006 37
Trend Example Month Year Total Smoothed Total Jan Feb Mar Apr May Jun Jul Aug 1994 1994 200 344 401 443 360 404 389 451 200 272 315 347 387 402 399 401 Window In addition to calculating the total # of units sold by month we want to smooth that over the preceding 3 months © Ellis Cohen, 2003 -2006 38
Trend Analysis When you build a result set you may want to define a field that depends on a group of related rows in the same result set (the window) This is particularly useful for analyzing trends SELECT d. month, d. year, sum(f. units) as totunits, {moving average of totunits over 3 months preceding} as movavg FROM Daily. Sales f NATURAL JOIN Dates d GROUP BY d. year, d. month window © Ellis Cohen, 2003 -2006 39
Trends in Oracle SQL WITH Monthly. Units AS ( SELECT d. month, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Dates d GROUP BY d. year, d. month ) SELECT month, year, totunits, avg(totunit) OVER ( ORDER BY year, month Window ROWS 3 PRECEDING ) AS movavg FROM Monthly. Units © Ellis Cohen, 2003 -2006 40
Trends in SQL 99 WITH Monthly. Units AS ( SELECT d. month, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p GROUP BY d. year, d. month ) SELECT month, year, totunits, avg(totunits) OVER w AS movavg FROM Monthly. Units WINDOW w AS ( ORDER BY year, month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ) © Ellis Cohen, 2003 -2006 41
Trends using Subqueries WITH Monthly. Units AS ( SELECT d. month, d. year, 12*d. year + d. month as mknt, sum(f. units) as totunits, FROM Daily. Sales f NATURAL JOIN Product p GROUP BY d. year, d. month) SELECT m. month, m. year, m. totunits, (SELECT avg(mm. totunits) FROM Monthly. Units mm WHERE mm. mknt BETWEEN m. mknt – 3 AND m. mknt) AS movavg FROM Monthly. Units m ORDER BY m. year, m. month © Ellis Cohen, 2003 -2006 42
Rank Example Month Year Total Rank Jan Feb Mar Apr May Jun Jul Aug 1994 1994 200 344 401 443 360 404 389 451 8 7 4 2 6 3 5 1 Window In addition to calculating the total # of units sold by month we want to rank it with respect to all the months © Ellis Cohen, 2003 -2006 43
Ranking in Oracle SQL WITH Monthly. Units AS ( SELECT d. month, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p GROUP BY d. year, d. month ) SELECT month, year, totunits, rank() OVER ( ORDER BY totunit DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS ranktotal FROM Monthly. Units The ugly idiom for 'ALL ROWS' © Ellis Cohen, 2003 -2006 44
Analytical Functions Ranking Functions RANK, DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE, ROW_NUMBER Rank current row within the window Inverse Percentile Functions PERCENTILE_CONT (continuous) PERCENTILE_DISC (discrete) Histogram Support WIDTH_BUCKET Lag/Lead Functions LAG, LEAD Return data from a row at a specified offset from the current row within the window © Ellis Cohen, 2003 -2006 45
Materialized Views & Query Rewriting © Ellis Cohen, 2003 -2006 46
What is an Ordinary View A view is not a table A view does not hold data A view is just a description used in expanding queries which refer to the view! © Ellis Cohen, 2003 -2006 47
View Expansion Suppose we define CREATE VIEW Hi. Emps AS SELECT * FROM Emps WHERE sal > 1500 and then execute the query SELECT ename, job FROM Hi. Emps The database engine automatically expands this into SELECT ename, job FROM Emps WHERE sal > 1500 © Ellis Cohen, 2003 -2006 48
Motivating Materialized Views Suppose a view is • Used frequently in an application • Somewhat expensive to compute • Based on tables that change infrequently It would be useful to • Store the contents of the view in a table • Use the table for queries • Arrange to update the table (automatically) when the base tables change [or, perhaps less frequently, if the view does not need to be perfectly up-to-date] © Ellis Cohen, 2003 -2006 49
Example Star Schema Stores (Dimension) storid stornam city state Dates distid (Dimension) distnam dateid distarea date regid dayofweek regnam dayofmonth dayofyear weekyr weekofyear monthyr month quarteryr quarter year Daily. Sales (Fact) storid prodid dateid price units © Ellis Cohen, 2003 -2006 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 50
Materialized Views Materialized views actually hold data CREATE MATERIALIZED VIEW Prod. Dist. Yr. Sum AS SELECT p. prodtyp, s. distid, d. year, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d GROUP BY p. prodtyp, s. distid, d. year A materialized view is • Like a table, in that it actually stores the result of the query • Like a view, in that it is possible to arrange for it be automatically updated when the underlying base data changes © Ellis Cohen, 2003 -2006 51
Updating Materialized Views During the loading phase new data is incrementally added to data warehouse tables Materialized Views (which are defined as part of architecting the data warehouse) are either – Recalculated from scratch based on the new base table contents – Incrementally updated based on incremental changes to the base tables. How is Prod. Dist. Sum. Yr incrementally updated when a new day's worth of data is added? © Ellis Cohen, 2003 -2006 52
Using Materialized Views Instead of writing, SELECT p. prodtyp, s. distid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE d. year = 2002 GROUP BY p. prodtyp, s. distid just write SELECT prodtyp, distid, totunits FROM Prod. Dist. Yr. Sum WHERE year = 2002 Because Prod. Dist. Yr. Sum is a materialized view, the database engine does NOT expand it, but just uses its materialized data © Ellis Cohen, 2003 -2006 53
Aggregating Materialized Views Instead of writing, SELECT p. prodtyp, s. distid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p GROUP BY p. prodtyp, s. distid just write SELECT prodtyp, distid, sum(totunits) AS totunits_overalltime FROM Prod. Dist. Yr. Sum GROUP BY prodtyp, distid © Ellis Cohen, 2003 -2006 54
Architecting Materialized Views Which views should be materialized? Many possible combinations: – – – district, region district/week, district/month, … region/week, region/month, … district/category, district/manufacturer, … category/week, category/month, … category/district/week, … Design balances – Cost of precalculating & storing view – Cost of calculating on the fly A heuristic optimization problem – Uses statistics of queries – Uses size of each combination – e. g. Benefit Per Unit Space (BPUS) © Ellis Cohen, 2003 -2006 55
Materialized View Evolution Problem As the data warehouse evolves, the set of materialized views needs to change. But, if the DW design already includes 1000 analysis queries, they would need to be rewritten to use the new set of materialized views. This is expensive! © Ellis Cohen, 2003 -2006 56
Query Rewriting Systems (like Oracle) that support query rewriting • Can automatically rewrite queries to use available materialized views (this can be complicated!) • Allow a subset of materialized views to be marked for use in query rewriting Query rewriting is the opposite of view expansion! If the data warehouse does not support query rewriting, the ETL tool could do it instead! © Ellis Cohen, 2003 -2006 57
Stores x Sub. Categories CREATE MATERIALIZED VIEW Store. Subcat. Sum AS SELECT p. storid, s. subcatid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Store s NATURAL JOIN Product p Products GROUP BY p. storid, s. subcatid Stores Product. Types Districts Sub. Categories Regions Categories © Ellis Cohen, 2003 -2006 58
Districts x Product. Types CREATE MATERIALIZED VIEW Dist. Prodtyp. Sum AS SELECT p. prodtyp, s. distid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p GROUP BY p. prodtyp, s. distid Products Stores Product. Types Districts Sub. Categories Regions Categories © Ellis Cohen, 2003 -2006 59
Multiple Materialized View Alternatives SELECT p. catid, s. regid, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p Product GROUP BY p. catid, s. regid Should the optimizer rewrite this query in terms of Store. Subcat. Sum or Dist. Prodtyp. Sum ? Stores Product. Types Districts Sub. Categories Regions Categories In general, how good is the optimizer? Can it discover unions, etc. © Ellis Cohen, 2003 -2006 60
Automatic Result Caching Database can (potentially) – cache the results of any query automatically as a materialized view – use the query history to automatically define new materialized views Then, based on their size & usage statistics, the DB can automatically determine – whether to discard any of these views after a while – whether to discard or update any of these views when their underlying base tables are updated © Ellis Cohen, 2003 -2006 61
Materialized View References Stores (Dimension) storid stornam … city state distid distnam distarea Not regid Foreign Key Constraints regnam When dimensions are denormalized, primary keys of materialized do not refer to unique dimension attributes Daily. Sales (Fact) storid prodid dateid price units Dist. Prodtyp Sum prodtyp distid totunits This requires using DISTINCT queries © Ellis Cohen, 2003 -2006 Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr 62
Queries Requiring DISTINCT SELECT p. prodtyp, s. distnam, sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Stores s NATURAL JOIN Products p GROUP BY p. prodtyp, s. distid, s. distnam rewritten as SELECT DISTINCT prodtyp, distnam, totunits FROM Dist. Prod. Type. Sum NATURAL JOIN Stores DISTINCT is only needed because distid is not unique in Stores Fix by adding a denormalized subdimension table for Districts © Ellis Cohen, 2003 -2006 63
Starflake Schema (Fully Denormalized Dimensions and Sub. Dimensions as needed) Store (Dimension) storid stornam city state distid distnam distarea regid regnam Districts (Sub. Dimension) distid distnam distarea regid regnam Daily. Sales (Fact) storid prodid dateid price units … Dist. Prodtyp Sum prodtyp distid totunits Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Product. Types (Sub. Dimension) prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam Keep catdescr Denormalized © Ellis Cohen, 2003 -2006 64
Indexing for Data Warehouses © Ellis Cohen, 2003 -2006 65
Implementing Warehouse Queries Products (Dimension) SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Products p WHERE p. catid = 5 Assume that • Products is indexed by catid • Daily Sales is indexed by prodid For a specific value of catid Get all rowids in Products with that catid Extract the prodid's Get all rowids in Daily. Sales with those prodid's Extract the units from the rows & sum Daily. Sales (Fact) storid prodid dateid price units Index by prodid © Ellis Cohen, 2003 -2006 prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Index by catid 66
Using Join Indexing Products (Dimension) SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Products p WHERE p. catid = 5 A join index is an index on one table based (in part) on values of fields of other tables Assume that • Daily Sales is indexed by Products. catid For a specific value of catid Get all rowids in Daily. Sales with that catid Extract the units from the rows & sum Daily. Sales (Fact) storid prodid dateid price units prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Index by Product. catid © Ellis Cohen, 2003 -2006 67
Multi-Table Joins SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Products p NATURAL JOIN Dates d WHERE p. catid = 5 AND d. year = 2002 Assume that • Daily Sales is indexed by Product. catid • Daily Sales is indexed by Dates. year For a specific value of catid Get all rowids in Daily. Sales with that catid For a specific year Get all rowids in Daily. Sales with that year Intersect the two lists of rowids (lots!) Extract the units from the rows & sum © Ellis Cohen, 2003 -2006 Daily. Sales (Fact) storid prodid dateid price units Index by Product. catid Index by Dates. year 68
Multi-Table Join Indexes SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p NATURAL JOIN Dates d WHERE p. catid = 5 AND d. year = 2002 Assume that • Daily Sales is indexed by ( Product. catid, Dates. year ) For a distinct pair of (catid, year) Get all rowids in Daily. Sales with that catid & year Extract the units from the rows & sum Since issue as for aggregates. Lots of possible different combinations of multi-table join indices: Which ones are worth building? © Ellis Cohen, 2003 -2006 Daily. Sales (Fact) storid prodid dateid price units Index by (Product. catid, Dates. year) 69
Bitmap Indexing Product prodid … catid … 603942 603947 603950 603951 603964 603968 … … … … 5 2 2 2 3 5 … … … Product Bitmap Index 1 1 1 1 2 3 4 5 © Ellis Cohen, 2003 -2006 70
Using Bitmap Indices SELECT min(size), max(size) FROM Product WHERE catid = 5 Assume that • Product has bitmap index on catid (faster than doing full scan or using B+ tree) Implement Query By Scan all tuples in Products with the bit set for catid: 5 Extract the size from each tuple and compute min(size) and max(size) © Ellis Cohen, 2003 -2006 Product (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Bitmap Index by category 71
Bitmap Intersection SELECT min(size), max(size) FROM Product WHERE catid = 5 AND color = 'fuschia' Assume that • Product has bitmap indices on catid and on color Implement Query By Construct the bit vector which has bits set for both catid: 5 and color: fuschia (very fast!) Scan all tuples in Product with the bit set in the resulting bit vector Extract the size from each tuple and compute min(size) and max(size) © Ellis Cohen, 2003 -2006 Product (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Bitmap Indices by category & by color 72
Bitmap Join Indexing SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p WHERE p. catid = 5 Assume that • Daily Sales has bitmap index on Product. catid For a specific value of catid Get all rowids in Daily. Sales with that catid Extract the units from the rows & sum © Ellis Cohen, 2003 -2006 Daily. Sales (Fact) storid prodid dateid price units Bitmap Index by Product. catid 73
Multi-Table Joins with Bitmap Indices SELECT sum(f. units) as totunits FROM Daily. Sales f NATURAL JOIN Product p NATURAL JOIN Store s WHERE p. catid = 5 AND s. city = ‘Boston’ Assume that • Daily Sales has bitmap index on Product. catid • Daily Sales has bitmap index on Store. city Implement Query By Construct the bit vector which has bits set for both category: 5 and city: Boston (very fast!) Get the rowids of all rows with bit set in the resulting bit vector Extract the units from the rows & sum © Ellis Cohen, 2003 -2006 Daily. Sales (Fact) storid prodid dateid price units Bitmap Index by Product. catid Bitmap Index by Store. city 74
Indexing vs Materialization Indexing Less space Usable with different kinds of aggregation and analysis operations More opportunities for combining Materialized Views (esp. Aggregates) Avoid recomputation, esp. recalculation of aggregates © Ellis Cohen, 2003 -2006 75
- Slides: 75