Database Management Systems Chapter 8 Data Warehouses and
Database Management Systems Chapter 8 Data Warehouses and Data Mining Jerry Post Copyright © 2003 1
D A T A B A S E Sequential Storage and Indexes ² We picture tables as simple rows and columns, but they cannot be stored this way. ª It takes too many operations to find an item. ª Insertions require reading and rewriting the entire table. ID Last. Name First. Name Date. Hired 1 Reeves Keith 1/29/98 2 Gibson Bill 3/31/98 3 Reasoner Katy 2/17/98 4 Hopkins Alan 2/8/98 5 James Leisha 1/6/98 6 Eaton Anissa 8/23/98 7 Farris Dustin 3/28/98 8 Carpenter Carlos 12/29/98 9 O'Connor Jessica 7/23/98 10 Shields Howard 7/13/98 2
D A T A B A S E Operations on Sequential Tables ² Read entire table ª Easy and fast ² Sequential retrieval ª Easy and fast for one order. ² Random Read/Sequential ª ª Very weak Probability of any row = 1/N Sequential retrieval 1, 000 rows means 500, 000 retrievals per lookup! Row A B C D E … Prob. 1/N 1/N 1/N # Reads 1 2 3 4 5 i ² Delete ª Easy ² Insert/Modify ª Very weak 3
D A T A B A S E Insert into Sequential Table ² Insert Inez: ª ª Find insert location. Copy top to new file. At insert location, add row. Copy rest of file. ID 8 6 7 2 Last. Name Carpenter Eaton Farris Gibson First. Name Date. Hired Carlos 12/29/98 Anissa 8/23/98 Dustin 3/28/98 Bill 3/31/98 11 Inez Maria 1/15/99 5 9 3 1 10 James O'Connor Reasoner Reeves Shields Leisha Jessica Katy Keith Howard 1/6/98 7/23/98 2/17/98 1/29/98 7/13/98 ID 8 6 7 2 4 5 9 3 1 10 Last. Name Carpenter Eaton Farris Gibson Hopkins James O'Connor Reasoner Reeves Shields First. Name Date. Hired Carlos 12/29/98 Anissa 8/23/98 Dustin 3/28/98 Bill 3/31/98 Alan 2/8/98 Leisha 1/6/98 Jessica 7/23/98 Katy 2/17/98 Keith 1/29/98 Howard 7/13/98 4
D A T A B A S E Binary Search ² Given a sorted list of names. ² How do you find Jones. ² Sequential search ª Jones = 10 lookups ª Average = 15/2 = 7. 5 lookups ª Min = 1, Max = 14 ² Binary search ª ª ª Find midpoint (14 / 2) = 7 Jones > Goetz Jones < Kalida Jones > Inez Jones = Jones (4 lookups) ² Max = log 2 (N) Adams Brown Cadiz Dorfmann Eaton Farris 1 Goetz Hanson 3 Inez 4 Jones 2 Kalida Lomax Miranda Norman 14 entries ª N = 1000 Max = 10 ª N = 1, 000 Max = 20 5
D A T A B A S E Indexed Sequential Storage ² Common uses Address ª Large tables. ª Need many sequential lists. ª Some random search--with one or two key columns. ª Mostly replaced by B+-Tree. ID 1 2 3 4 5 6 7 8 9 10 Pointer A 11 A 22 A 32 A 47 A 58 A 63 A 67 A 78 A 83 Last. Name Pointer Carpenter A 67 Eaton A 58 Farris A 63 Gibson A 22 Hopkins A 42 James A 47 O'Connor A 78 Reasoner A 32 Reeves A 11 Shields A 83 A 11 A 22 A 32 A 47 A 58 A 63 A 67 A 78 A 83 ID 1 2 3 4 5 6 7 8 9 10 Last. Name Reeves Gibson Reasoner Hopkins James Eaton Farris Carpenter O'Connor Shields First. Name Date. Hired Keith 1/29/98 Bill 3/31/98 Katy 2/17/98 Alan 2/8/98 Leisha 1/6/98 Anissa 8/23/98 Dustin 3/28/98 Carlos 12/29/98 Jessica 7/23/98 Howard 7/13/98 Indexed for ID and Last. Name 7
D A T A B A S E Index Options: Bitmaps and Statistics ² Bitmap index ª A compressed index designed for non-primary key columns. Bit-wise operations can be used to quickly match WHERE criteria. ² Analyze statistics ª By collecting statistics about the actual data within the index, the DBMS can optimize the search path. For example, if it knows that only a few rows match one of your search conditions in a table, it can apply that condition first, reducing the amount of work needed to join tables. 10
D A T A B A S E Problems with Indexes ² Each index must be updated when rows are inserted, deleted or modified. ² Changing one row of data in a table with many indexes can result in considerable time and resources to update all of the indexes. ² Steps to improve performance ª Index primary keys ª Index common join columns (usually primary keys) ª Index columns that are searched regularly ª Use a performance analyzer 11
D A T A B A S E Data Warehouse Predefined reports Interactive data analysis Operations data Daily data transfer OLTP Database 3 NF tables Data warehouse Star configuration Flat files 12
D A T A B A S E Data Warehouse Goals ² Existing databases optimized for Online Transaction Processing (OLTP) ² Online Analytical Processing (OLAP) requires fast retrievals, and only bulk writes. ² Different goals require different storage, so build separate dta warehouse to use for queries. ² Extraction, Transformation, Transportation (ETT) ² Data analysis ª Ad hoc queries ª Statistical analysis ª Data mining (specialized automated tools) 13
D A T A B A S E Extraction, Transformation, and Transportation (ETT) Customers Convert Client to Customer Apply standard product numbers Convert currencies Fix region codes Transaction data from diverse systems. Data warehouse: All data must be consistent. 14
D A T A B A S E OLTP v. OLAP 15
D A T A B A S E Multidimensional Cube y r go e t a C Pet Store Item Sales Amount = Quantity*Sale Price Customer Location Time Sale Date 16
D A T A B A S E Sales Date: Time Hierarchy Year Levels Quarter Roll-up To get higher-level totals Month Week Drill-down To get lower-level details Day 17
D A T A B A S E Star Design Dimension Tables Products Sales Date Fact Table Sales Quantity Amount=Sale. Price*Quantity Customer Location 18
D A T A B A S E Snowflake Design Merchandise Sale Item. ID Description Quantity. On. Hand List. Price Category Sale. ID Sale. Date Employee. ID Customer. ID Sales. Tax OLAPItems Sale. ID Item. ID Quantity Sale. Price Amount City. ID Zip. Code City State Customer. ID Phone First. Name Last. Name Address Zip. Code City. ID Dimension tables can join to other dimension tables. 19
D A T A B A S E OLAP Computation Issues Compute Quantity*Price in base query, then add to get $23. 00 If you use Calculated Measure in the Cube, it will add first and multiply second to get $45. 00, which is wrong. 20
D A T A B A S E OLAP Data Browsing 21
D A T A B A S E Microsoft Pivot Table 22
D A T A B A S E Category Month Amount OLAP in SQL 99 Bird 1 $135. 00 Bird 2 $45. 00 GROUP BY two columns Bird 3 $202. 50 Gives you totals for each month within each category. Bird 6 $67. 50 Bird 7 $90. 00 Bird 9 $67. 50 Cat 1 $396. 00 Cat 2 $113. 85 Cat 3 $443. 70 Cat 4 $2. 25 You do not get superaggregate totals for the category, or the month, or the overall total. SELECT Category, Month(Sale. Date) AS Month, Sum(Quantity*Sale. Price) AS Amount FROM Sale INNER JOIN (Merchandise INNER JOIN Sale. Item ON Merchandise. Item. ID = Sale. Item. ID) ON Sale. ID = Sale. Item. Sale. ID GROUP BY Category, Month(Sale. Date); 23
D A T A B A S E SQL ROLLUP SELECT Category, Month…, Sum … FROM … GROUP BY ROLLUP (Category, Month. . . ) Category Month Amount Bird … Bird Cat … (null) 1 2 135. 00 45. 00 (null) 1 2 607. 50 396. 00 113. 85 (null) 1293. 30 (null) 8451. 79 24
D A T A B A S E Missing Values Cause Problems If there are missing values in the groups, it can be difficult to identify the super-aggregate rows. Category Month Amount Bird … Bird Cat … (null) 1 2 135. 00 45. 00 (null) 1 2 32. 00 607. 50 396. 00 113. 85 (null) 1293. 30 (null) 8451. 79 Missing date Super-aggregate 25
D A T A B A S E GROUPING Function SELECT Category, Month…, Sum …, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY ROLLUP (Category, Month. . . ) Category Month Amount Bird 1 135. 00 Bird 2 45. 00 … Bird (null) 32. 00 Bird (null) 607. 50 Cat 1 396. 00 Cat 2 113. 85 … Cat (null) 1293. 30 … (null) 8451. 79 Gc 0 0 Gm 0 0 0 1 0 1 1 26
D A T A B A S E CUBE Option SELECT Category, Month, Sum, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY CUBE (Category, Month. . . ) Category Month Bird 1 Bird 2 … Bird (null) Cat 1 Cat 2 … Cat (null) 1 (null) 2 (null) 3 … (null) Amount 135. 00 45. 00 Gc 0 0 Gm 0 0 32. 00 607. 50 396. 00 113. 85 0 1 0 0 0 1293. 30 1358. 8 1508. 94 2362. 68 1 0 0 1 1 1 8451. 79 1 1 27
D A T A B A S E GROUPING SETS: Hiding Details SELECT Category, Month, Sum FROM … GROUP BY GROUPING SETS ( ROLLUP (Category), ROLLUP (Month), () ) Category Month Bird (null) Cat (null) … (null) 1 (null) 2 (null) 3 … (null) Amount 607. 50 1293. 30 1358. 8 1508. 94 2362. 68 8451. 79 28
D A T A B A S E SQL OLAP Analytical Functions VAR_POP VAR_SAMP STDDEV_POP STDEV_SAMP COVAR_POP COVAR_SAMP CORR REGR_R 2 REGR_SLOPE REGR_INTERCEPT variance standard deviation covariance correlation regression r-square regression data (many) 29
D A T A B A S E SQL RANK Functions SELECT Employee, Sales. Value RANK() OVER (ORDER BY Sales. Value DESC) AS rank DENSE_RANK() OVER (ORDER BY Sales. Value DESC) AS dense FROM Sales ORDER BY Sales. Value DESC, Employee; Employee Sales. Value rank dense Jones 18, 000 1 1 Smith 16, 000 2 2 Black 16, 000 2 2 White 14, 000 4 3 DENSE_RANK does not skip numbers 30
D A T A B A S E SQL OLAP Windows SELECT Category, Sale. Month, Month. Amount, AVG(Month. Amount) OVER (PARTITION BY Category ORDER BY Sale. Month ASC ROWS 2 PRECEDING) AS MA FROM qry. OLAPSQL 99 ORDER BY Sale. Month ASC; Category Bird … Cat Cat … Sale. Month 200101 200102 200103 200104 Month. Amount 1500. 00 1700. 00 2000. 00 2500. 00 200101 200102 200103 200104 4000. 00 5000. 00 6000. 00 7000. 00 MA 1600. 00 1850. 00 4500. 00 5500. 00 31
D A T A B A S E Ranges: OVER SELECT Sale. Date, Value SUM(Value) OVER (ORDER BY Sale. Date) AS running_sum, SUM(Value) OVER (ORDER BY Sale. Date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum 2, SUM (Value) OVER (ORDER BY Sale. Date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS remaining_sum; FROM … Sum 1 computes total from beginning through current row. Sum 2 does the same thing, but more explicitly lists the rows. Sum 3 computes total from current row through end of query. 32
D A T A B A S E LAG and LEAD Functions LAG or LEAD: (Column, # rows, default) SELECT Sale. Date, Value, LAG (Value 1, 0) OVER (ORDER BY Sale. Date) AS prior_day LEAD (Value 1, 0) OVER (ORDER BY Sale. Date) AS next_day FROM … ORDER BY Sale. Date 1/1/2003 1/2/2003 1/3/2003 … 1/31/2003 Value prior_day 1000 0 1500 1000 2000 1500 3200 next_day 1500 2000 2300 Prior is 0 from default value 0 Not part of standard yet? But are in SQL Server and Oracle. 33
D A T A B A S E Data Mining ² Goal: To discover unknown relationships in the data that can be used to make better decisions. Transactions and operations Reports Specific ad hoc questions Queries Aggregate, compare, drill down OLAP Databases Unknown relationships Data Mining 34
D A T A B A S E Exploratory Analysis ² Data Mining usually works autonomously. ª Supervised/directed ª Unsupervised ª Often called a bottom-up approach that scans the data to find relationships ² Some statistical routines, but they are not sufficient ª Statistics relies on averages ª Sometimes the important data lies in more detailed pairs 35
D A T A B A S E Common Techniques ² Classification/Prediction/Regression ² Association Rules/Market Basket Analysis ² Clustering ª Data points ª Hierarchies ² Neural Networks ² Deviation Detection ² Sequential Analysis ª Time series events ª Websites ² Textual Analysis ² Spatial/Geographic Analysis 36
D A T A B A S E Classification Examples ² Examples ª Which borrowers/loans are most likely to be successful? ª Which customers are most likely to want a new item? ª Which companies are likely to file bankruptcy? ª Which workers are likely to quit in the next six months? ª Which startup companies are likely to succeed? ª Which tax returns are fraudulent? 37
D A T A B A S E Classification Process ² Clearly identify the outcome/dependent variable. ² Identify potential variables that might affect the outcome. ª Supervised (modeler chooses) ª Unsupervised (system scans all/most) ² Use sample data to test and validate the model. ² System creates weights that link independent variables to outcome. Income Married Credit History Job Stability Success 50000 Yes Good Yes 25000 Yes Bad No 75000 No Good No 38
D A T A B A S E Classification Techniques ² Regression ² Bayesian Networks ² Decision Trees (hierarchical) ² Neural Networks ² Genetic Algorithms ² Complications ª Some methods require categorical data ª Data size is still a problem 39
D A T A B A S E Association/Market Basket ² Examples ª What items are customers likely to buy together? ª What Web pages are closely related? ª Others? ² Classic (early) example: ª Analysis of convenience store data showed customers often buy diapers and beer together. ª Importance: Consider putting the two together to increase crossselling. 40
D A T A B A S E Association Details (two items) ² Rule evaluation (A implies B) ª Support for the rule is measured by the percentage of all transactions containing both items: P(A ∩ B) ª Confidence of the rule is measured by the transactions with A that also contain B: P(B | A) ª Lift is the potential gain attributed to the rule—the effect compared to other baskets without the effect. If it is greater than 1, the effect is positive: © P(A ∩ B) / ( P(A) P(B) ) © P(B|A)/P(B) ² Example: Diapers implies Beer ª Support: P(D ∩ B) =. 6 ª Confidence: P(B|D) =. 857 ª Lift: P(B|D) / P(B) = 1. 714 P(D) =. 7 P(B) =. 5 = P(D ∩ B)/P(D) =. 6/. 7 =. 857 /. 5 41
D A T A B A S E Association Challenges ² If an item is rarely purchased, any other item bought with it seems important. So combine items into categories. Item Freq. 1 “ nails 2% Hardware 15% 2” nails 1% Dim. Lumber 20% 3” nails 1% Plywood 15% 4” nails 2% Finish lumber 15% Lumber 50% ² Some relationships are obvious. ª Burger and fries. ² Some relationships are meaningless. ª Hardware store found that toilet rings sell well only when a new store first opens. But what does it mean? 42
D A T A B A S E Cluster Analysis ² Examples ª Are there groups of customers? (If so, we can cross-sell. ) ª Do the locations for our stores have elements in common? (So we can search for similar clusters for new locations. ) ª Do our employees (by department? ) have common characteristics? (So we can hire similar, or dissimilar, people. ) ² Problem: Many dimensions and large datasets Large intercluster distance Small intracluster distance 43
D A T A B A S E Geographic/Location ² Examples ª Customer location and sales comparisons ª Factory sites and cost ª Environmental effects ² Challenge: Map data, multiple overlays 44
- Slides: 41