Mining for Queries Florin Rusu Mentors Vijayshankar Raman


![Exploratory Analysis l Hypothesis-driven Online aggregation [HHW 97] l Data cube exploration l l Exploratory Analysis l Hypothesis-driven Online aggregation [HHW 97] l Data cube exploration l l](https://slidetodoc.com/presentation_image_h2/60b7d6c304748d7ab928380d01215529/image-3.jpg)

















![Implementation l Use Blink [RSQ 08] as query engine l Use queries extensively Function Implementation l Use Blink [RSQ 08] as query engine l Use queries extensively Function](https://slidetodoc.com/presentation_image_h2/60b7d6c304748d7ab928380d01215529/image-21.jpg)

- Slides: 22

Mining for Queries Florin Rusu Mentors: Vijayshankar Raman, Lin Qiao, Peter Haas Manager: Guy Lohman

Motivation l Hardware trends Multi-core processors l Memory capacity l l Query execution improvement [RSQ 08] Parallel, in-memory databases l Compression, scans l l Goal l Use the query power 2
![Exploratory Analysis l Hypothesisdriven Online aggregation HHW 97 l Data cube exploration l l Exploratory Analysis l Hypothesis-driven Online aggregation [HHW 97] l Data cube exploration l l](https://slidetodoc.com/presentation_image_h2/60b7d6c304748d7ab928380d01215529/image-3.jpg)
Exploratory Analysis l Hypothesis-driven Online aggregation [HHW 97] l Data cube exploration l l Discovery-driven l Materialized data cube [SAM 98] l Goals l Automatic exploratory analysis 3

Cell Phone Call Data Warehouse Table Calls Latitude Longitude Call Time Call Duration Call Type Call Status 38 N 123 W 9: 20 10: 03 International OK 38 N 122 W 20: 48 20: 20 Long Drop 38 N 120 W 16: 22 0: 48 Local OK … … … 38 N 70 W 12: 29 0: 32 Cell Drop 4

Interesting Query (1) What are the time intervals for which the duration of long distance calls is significantly high? 5

Interesting Query (2) What are the areas with large fractions of dropped calls? Non-axes aligned hyper-rectangles 6

Query Pattern Example SELECT SUM(Call Duration) FROM Calls WHERE (16: 00 < Call Time < 22: 00) AND (10 < 0. 65*Latitude-0. 35*Longitude < 100) GROUP BY Call Type General pattern SELECT AGG(G) FROM T WHERE L 1 < P 1 < U 1 AND … AND Ln < Pn < Un GROUP BY G 7

Problem Given a data warehouse find the most interesting regions in the attribute space (P 1, …, Pn) according to a function F and that have sufficient support l Searching problem over hyper-rectangular regions in the attribute space l argmax. P 1, …, Pn F(P 1, …, Pn, G) COUNT(P 1, …, Pn) > S l Function F l Value one group / Average value other groups 8

Exhaustive Search l Try l all axes-aligned hyper-rectangles 10 attributes with 10 values in the domain give 1020 hyper-rectangles l Alternatives Run independent queries l Pre-computed prefix-sum array [HAMS 97] l l Top-down l search Iceberg pruning on support [BR 99] 9

Incremental Approach l Solve for each group separately and combine the results l F is assumed locally smooth l Bottom-Up search l l Find local maxima points for F Extend region around local maxima Verify function F Verify support 10

Axes-Aligned Find local maxima points for F 11

Axes-Aligned For each local maxima 12

Axes-Aligned Extend along axes 13

Axes-Aligned Maximal region 14

Principal Component Analysis l Find a new base with vectors corresponding to variance along axes in the original data l Steps Mean-center data along each dimension l Compute covariance matrix l Find eigen-vectors and eigen-values l 15

Non-Axes-Aligned Find local maxima points for F 16

Non-Axes-Aligned For each local maxima 17

Non-Axes-Aligned Extend along eigen-vectors computed by PCA 18

Non-Axes-Aligned Maximal region 19

Query Parameters l Aggregate l fraction AGG(g) / ΣAGG(G) l Support l COUNT(g) l Density l COUNT DISTINCT(g) / TOTAL POINTS 20
![Implementation l Use Blink RSQ 08 as query engine l Use queries extensively Function Implementation l Use Blink [RSQ 08] as query engine l Use queries extensively Function](https://slidetodoc.com/presentation_image_h2/60b7d6c304748d7ab928380d01215529/image-21.jpg)
Implementation l Use Blink [RSQ 08] as query engine l Use queries extensively Function evaluation l Region expansion l Overlap multiple queries l 21

Questions Comments Suggestions 22