Mining for Queries Florin Rusu Mentors Vijayshankar Raman

  • Slides: 22
Download presentation
Mining for Queries Florin Rusu Mentors: Vijayshankar Raman, Lin Qiao, Peter Haas Manager: Guy

Mining for Queries Florin Rusu Mentors: Vijayshankar Raman, Lin Qiao, Peter Haas Manager: Guy Lohman

Motivation l Hardware trends Multi-core processors l Memory capacity l l Query execution improvement

Motivation l Hardware trends Multi-core processors l Memory capacity l l Query execution improvement [RSQ 08] Parallel, in-memory databases l Compression, scans l l Goal l Use the query power 2

Exploratory Analysis l Hypothesis-driven Online aggregation [HHW 97] l Data cube exploration l l

Exploratory Analysis l Hypothesis-driven Online aggregation [HHW 97] l Data cube exploration l l Discovery-driven l Materialized data cube [SAM 98] l Goals l Automatic exploratory analysis 3

Cell Phone Call Data Warehouse Table Calls Latitude Longitude Call Time Call Duration Call

Cell Phone Call Data Warehouse Table Calls Latitude Longitude Call Time Call Duration Call Type Call Status 38 N 123 W 9: 20 10: 03 International OK 38 N 122 W 20: 48 20: 20 Long Drop 38 N 120 W 16: 22 0: 48 Local OK … … … 38 N 70 W 12: 29 0: 32 Cell Drop 4

Interesting Query (1) What are the time intervals for which the duration of long

Interesting Query (1) What are the time intervals for which the duration of long distance calls is significantly high? 5

Interesting Query (2) What are the areas with large fractions of dropped calls? Non-axes

Interesting Query (2) What are the areas with large fractions of dropped calls? Non-axes aligned hyper-rectangles 6

Query Pattern Example SELECT SUM(Call Duration) FROM Calls WHERE (16: 00 < Call Time

Query Pattern Example SELECT SUM(Call Duration) FROM Calls WHERE (16: 00 < Call Time < 22: 00) AND (10 < 0. 65*Latitude-0. 35*Longitude < 100) GROUP BY Call Type General pattern SELECT AGG(G) FROM T WHERE L 1 < P 1 < U 1 AND … AND Ln < Pn < Un GROUP BY G 7

Problem Given a data warehouse find the most interesting regions in the attribute space

Problem Given a data warehouse find the most interesting regions in the attribute space (P 1, …, Pn) according to a function F and that have sufficient support l Searching problem over hyper-rectangular regions in the attribute space l argmax. P 1, …, Pn F(P 1, …, Pn, G) COUNT(P 1, …, Pn) > S l Function F l Value one group / Average value other groups 8

Exhaustive Search l Try l all axes-aligned hyper-rectangles 10 attributes with 10 values in

Exhaustive Search l Try l all axes-aligned hyper-rectangles 10 attributes with 10 values in the domain give 1020 hyper-rectangles l Alternatives Run independent queries l Pre-computed prefix-sum array [HAMS 97] l l Top-down l search Iceberg pruning on support [BR 99] 9

Incremental Approach l Solve for each group separately and combine the results l F

Incremental Approach l Solve for each group separately and combine the results l F is assumed locally smooth l Bottom-Up search l l Find local maxima points for F Extend region around local maxima Verify function F Verify support 10

Axes-Aligned Find local maxima points for F 11

Axes-Aligned Find local maxima points for F 11

Axes-Aligned For each local maxima 12

Axes-Aligned For each local maxima 12

Axes-Aligned Extend along axes 13

Axes-Aligned Extend along axes 13

Axes-Aligned Maximal region 14

Axes-Aligned Maximal region 14

Principal Component Analysis l Find a new base with vectors corresponding to variance along

Principal Component Analysis l Find a new base with vectors corresponding to variance along axes in the original data l Steps Mean-center data along each dimension l Compute covariance matrix l Find eigen-vectors and eigen-values l 15

Non-Axes-Aligned Find local maxima points for F 16

Non-Axes-Aligned Find local maxima points for F 16

Non-Axes-Aligned For each local maxima 17

Non-Axes-Aligned For each local maxima 17

Non-Axes-Aligned Extend along eigen-vectors computed by PCA 18

Non-Axes-Aligned Extend along eigen-vectors computed by PCA 18

Non-Axes-Aligned Maximal region 19

Non-Axes-Aligned Maximal region 19

Query Parameters l Aggregate l fraction AGG(g) / ΣAGG(G) l Support l COUNT(g) l

Query Parameters l Aggregate l fraction AGG(g) / ΣAGG(G) l Support l COUNT(g) l Density l COUNT DISTINCT(g) / TOTAL POINTS 20

Implementation l Use Blink [RSQ 08] as query engine l Use queries extensively Function

Implementation l Use Blink [RSQ 08] as query engine l Use queries extensively Function evaluation l Region expansion l Overlap multiple queries l 21

Questions Comments Suggestions 22

Questions Comments Suggestions 22