Topk Query Processing and Optimization 198 541 slides

Outline Motivation with Examples n Query Model n Top-k Selection n Top-k Join n

Motivation n n Information Retrieval/ Database systems integration IR: uncertainty and ranking for effective

Motivation n Applications n n n New requirements n n Multimedia search by contents

Query Example 1: Ranking in Multimedia Retrieval Color Histogram Edge Histogram Texture Video Database

Example 2 SELECT h. id , s. name FROM houses h , schools s

Example 2 (Cont’d) ID Location Price ID Location Tuition 1 3 150000 1 2

Outline Motivation with Examples n Theoretical Foundation of Rank Aggregation n Query Model n

Query Model: Top-k Selection n One relation R with: n attributes A 1, …

Query Model: Top-k Join n m Relations R 1, …. . , Rm |

Top-k Selection Queries n Approaches: Filter/Restart method n Rank aggregation from multiple lists n

Top-k Selection Filter/Restart Method [Carey and Kossman SIGMOD’ 97] [Donjerkovic and Ramakrishnan VLDB’ 99]

Top-k Selection Rank Aggregation from Multiple Lists L 1. ID = L 2. ID

Top-k Selection Rank Aggregation from Multiple Lists n Assumptions: n n n Differ in

Top-k Selection Rank Aggregation from Multiple Lists FA, TA, Quick-combine, Multi-Step Sorted Access Only

Top-k Selection Rank Aggregation from Multiple Lists n Sorted + Random Access Available n

Top-k Selection Rank Aggregation from Multiple Lists TA [Fagin et al. PODS’ 01], Quick-combine

Top-k Selection Rank Aggregation from Multiple Lists n No Random Access available n The

Top-k Selection Rank Aggregation from Multiple Lists NRA [Fagin et al. PODS’ 01], Stream-combine

Top-k Selection Rank Aggregation from Multiple Lists n At least one attribute with sorted-access

Top-k Selection Rank Aggregation from Multiple Lists Upper [Bruno et al. ICDE’ 02] MPro

Top-k Selection Materialized Views (PREFER) [Hristidis et al. SIGMOD’ 01] n Ranking function m

Top-k Selection Materialized Views (PREFER) n The prefix: n ID 1 2 3 4

Top-k Selection Materialized Views (PREFER) n Multiple views n n For each query q,

Top-k Join n m Relations R 1, …. . , Rm | Ri has:

Supporting Ranking in Relational Databases Approaches: n Map to a multi-dimensional range query (Filter/Restart)

Ranking in Relational Databases Range queries [Bruno et al. TODS’ 02] n Filter/Restart Method

Ranking in Relational Databases Range queries n Using histograms n n Create a small

Supporting Ranking in Relational Databases As a Query Operator User-defined Function (+) Under the

Summary n Wide applicability of ranking queries as essential functionality in many applications warrants

Summary n Rank Aggregation has solid theoretical background from voting and social selection n

References (1) n [ACE+03] Walid G. Aref, Ann C. Catlin, Ahmed K. Elmagarmid, J.

References (2) n [CH 02] Kevin Chen. Chuan Chang and Seung won Hwang. Minimal

References (3) n [GBK 01] Ulrich G˜untzer, Wolf. Tilo Balke, and Werner Kießling. Towards

References (4) n [LCI+05] Chengkai Li, Kevin. C. -C. Chang, Ihab F. Ilyas, and

Slides: 39

Download presentation

Top-k Query Processing and Optimization 198: 541 (slides courtesy of Ihab F. Ilyas and Walid G. Aref)

Outline Motivation with Examples n Query Model n Top-k Selection n Top-k Join n Ranking in Relational Query Engines n Summary n Open Research Topics n

Motivation n n Information Retrieval/ Database systems integration IR: uncertainty and ranking for effective retrieval Database Systems: advanced data management Supporting new data types is certainly not enough True integration includes (among other things): n n n Indexing Query processing and optimization Query Language

Motivation n Applications n n n New requirements n n Multimedia search by contents (multi-features/examples) Middleware Search engines Data mining Multi-criteria ranking Rank aggregation from external sources Joining ranked infinite streams Most applications are interested in the top-k results

Query Example 1: Ranking in Multimedia Retrieval Color Histogram Edge Histogram Texture Video Database Color Histogram Edge Histogram Texture

Example 2 SELECT h. id , s. name FROM houses h , schools s WHERE h. location = s. location ORDER BY h. price+10 x s. tuition STOP AFTER 10 RANK( ) OVER in SQL 99

Example 2 (Cont’d) ID Location Price ID Location Tuition 1 3 150000 1 2 3 4 5 6 90, 000 111, 000 118, 000 125, 000 154, 000 1 2 3 4 5 6 7 8 3000 3500 6000 6200 7000 7900 8200 1 4 15200 2 2 014500 Lafayette W. Lafayette Indianapolis Kokomo Lafayette Kokomo …… Houses Indianapolis W. Lafayette Indianapolis Kokomo Schools 3 1 014100 0

Outline Motivation with Examples n Theoretical Foundation of Rank Aggregation n Query Model n Top-k Selection n Top-k Join n Ranking in Relational Query Engines n Summary n Open Research Topics n

Query Model: Top-k Selection n One relation R with: n attributes A 1, … , An n m scores s 1, …. , sm n Scores are expressions over the attributes n Example: s 1 = A 1 and s 2=A 2+A 3 n n n A combining function F to compute total score An example template: SELECT some_attributes FROM R WHERE selection_condition ORDER BY F(s 1, …, sm) STOP AFTER k

Query Model: Top-k Join n m Relations R 1, …. . , Rm | Ri has: n n n attributes score attribute, si (can be an expression over other attributes) A global score for a join result is computed as F(s 1, …. , sm) An example template: SELECT some_attributes FROM R 1, …. . , Rm WHERE join_condition ORDER BY F(s 1, …. . , sm) STOP AFTER k

Top-k Selection Queries n Approaches: Filter/Restart method n Rank aggregation from multiple lists n Using indexes and materialized views n

Top-k Selection Filter/Restart Method [Carey and Kossman SIGMOD’ 97] [Donjerkovic and Ramakrishnan VLDB’ 99] [Bruno et al. TODS’ 02] [Chaudhuri et al. TKDE’ 04] n n n Map the top-k query to a selection predicate x > t For multi-criteria ranking (multiple scoring attributes) a range query t 11< x 1 <t 12 and t 21 < x 2 < t 22 Estimate the cut-off t based on k and the data distribution

Top-k Selection Rank Aggregation from Multiple Lists L 1. ID = L 2. ID ID S 1 S 2 ID S 2 15 12 3 24 45 50 40 20 40 30 10 20 50 10 30 40 50 20 10 3 2 1 4 5 50 40 30 20 10

Top-k Selection Rank Aggregation from Multiple Lists n Assumptions: n n n Differ in the access capabilities on the lists n n Sorted Access Only Sorted + Random Access Only ! Differ in pipeline support n n Sorted Inputs on the individual scores The combining function is monotone The output can serve as input to another instance Most Algorithms can be cast as a specialization of the A* algorithm

Top-k Selection Rank Aggregation from Multiple Lists FA, TA, Quick-combine, Multi-Step Sorted Access Only Random Access/ BOTH Probes NRA, Stream-combine Only MPro, Upper, Pick

Top-k Selection Rank Aggregation from Multiple Lists n Sorted + Random Access Available n n n Go in the lists in parallel Keep track of all “seen” objects Update scores of objects at each step Maintain a threshold T: an upper-bound for all the unknown scores An object is qualified as a top-k if n The object’s combined score is known and is greater than T

Top-k Selection Rank Aggregation from Multiple Lists TA [Fagin et al. PODS’ 01], Quick-combine [Güntzer et al. VLDB’ 00], Multi-step [Nepal and Ramakrishna ICDE’ 99] Algorithm Sketch: F= S 1 + S 2 ID S 1 ID S 2 5 1 3 2 4 50 40 30 20 10 3 2 1 4 5 50 40 30 20 10 Random Access T == 80 100 T 3: 3: (80) 1: 5: (70) (60) 5: (60) 2: (60) Buffer

Top-k Selection Rank Aggregation from Multiple Lists n No Random Access available n The combined score of an object is two parts: n n “Seen”: From lists where we encountered the object “Unseen”: Upper bound of all missing scores Sounds familiar? A* search An object is qualified as a top-k if n The object’s lower-bound score is greater than the upperbound score of all other objects

Top-k Selection Rank Aggregation from Multiple Lists NRA [Fagin et al. PODS’ 01], Stream-combine [Güntzer et al. ITCC’ 00] F= S 1 + S 2 Algorithm Sketch: ID S 1 ID S 2 5 1 3 2 4 50 40 30 20 10 3 2 1 4 5 50 40 30 20 10 5: 3: (50 (80 – 100) 90) 80) 3: 1: (50 (70 – 100) 90) 70) 1: (50 5: (40 – 80) 2: (40 – 70) 80) Buffer

Top-k Selection Rank Aggregation from Multiple Lists n At least one attribute with sorted-access n Some with no Sorted Access (Probe Attributes) n Schedule the Probe Attributes using statistics in ascending order of their “probing” cost

Top-k Selection Rank Aggregation from Multiple Lists Upper [Bruno et al. ICDE’ 02] MPro [Chang and Hwang SIGMOD’ 02], Algorithm Sketch: ID a b c d e s 0. 90 0. 80 0. 70 0. 60 0. 50 p 1. 85. 78. 75. 90. 70 p 2 Min(s, p 1, p 2). 75. 90. 78. 20. 90. 60. 80. 50 Upper-bound of the unseen scores Uunseen= 0. 9 0. 8 0. 7 a: 0. 85 a: 0. 9 b: 0. 8 b: 0. 78 b: 0. 8 a: 0. 75 c: 0. 7 Candidates Queue

Top-k Selection Materialized Views (PREFER) [Hristidis et al. SIGMOD’ 01] n Ranking function m f v = å vi. Score i i =1 n Materialize a view that ranks the entire input relation on fv | v =(v 1, v 2, . . . , vm) (predefined weights) n For an input query weights q =(q 1, q 2, …, qm) n n n Get a prefix of the view based on v and q Sort the prefix on fq The top results are guaranteed to be in the prefix

Top-k Selection Materialized Views (PREFER) n The prefix: n ID 1 2 3 4 5 6 Determine a threshold Tv, q| "t Î R, f v (t ) < Tv , q Þ f q (t ) < f q (tv 1 ) A 1 10 20 17 15 5 15 A 2 17 20 18 10 10 10 A 3 20 11 12 8 12 5 fv(t) 16. 8 16. 4 15. 4 10. 2 9. 8 9 fq(t) 17. 2 17. 3 16. 1 9. 9 10. 1 9 v =(0. 2, 0. 4) q =(0. 1, 0. 6, 0. 3) Maximize fv(t) while maintaining inequality Tv, q = 14. 26

Top-k Selection Materialized Views (PREFER) n Multiple views n n For each query q, a view v that covers q with some prefix constraint 10 to 100 views is a typical number to cover the space of queries Heuristic to cover the maximum number of queries using n views View selection for an input query n n n Store view coverage in an R-tree A query is a point in the space Get the view that contains the query

Outline Motivation with Examples n Theoretical Foundation of Rank Aggregation n Query Model n Top-k Selection n Top-k Join n Ranking in Relational Query Engines n Summary n Open Research Topics n

Top-k Join n m Relations R 1, …. . , Rm | Ri has: n n n attributes score attribute, si (can be an expression over other attributes) A global score for a join result is computed as F(s 1, …. , sm) An example template: SELECT some_attributes FROM R 1, …. . , Rm WHERE join_condition ORDER BY F(s 1, …. . , sm) STOP AFTER k

Outline Motivation with Examples n Theoretical Foundation of Rank Aggregation n Query Model n Top-k Selection n Top-k Join n Ranking in Relational Query Engines n Summary n Open Research Topics n

Supporting Ranking in Relational Databases Approaches: n Map to a multi-dimensional range query (Filter/Restart) n User defined function n Core implementation as a query operator n New Algebra n The notion of ranked relations n Algorithms implement the ranking process

Ranking in Relational Databases Range queries [Bruno et al. TODS’ 02] n Filter/Restart Method Given a top-k selection query q over a relation R n Use multidimensional histogram H to estimate a search distance dq n The region (q, dq) contains is expected to include the top-k answers n Perform a range query on (q, dq) n Return top-k answers of the results n If #results < k, RESTART n

Ranking in Relational Databases Range queries n Using histograms n n Create a small synthetic Relation S consistent with the histogram on R Compute the Dist(q, t) for every tuple t in S dq is the maximum distance between q and the top-k tuples in S Building S n n No restarts: dq is large enough more results need to be filtered Restarts: restarts possible less filtering

Supporting Ranking in Relational Databases As a Query Operator User-defined Function (+) Under the optimizer's control (+)Can be shuffled with other operators in a query evaluation plan for better performance (+)General enough and highly applicable (-)Changes to query engine (+)Easy to implement and ready-to-go solution (-)Implementation outside the SQL engine lose efforts of the query optimizer

Outline Motivation with Examples n Theoretical Foundation of Rank Aggregation n Query Model n Top-k Selection n Top-k Join n Ranking in Relational Query Engines n Summary n Open Research Topics n

Summary n Wide applicability of ranking queries as essential functionality in many applications warrants efficient handling by database systems n One step towards integrating IR effective retrieval and DB effective handling of data n Defined two flavors: n n n Top-k Selection Queries Top-k Join Queries Basic Techniques: n n n Filter / Restart Rank Aggregation Using Indexes and Materialized Views

Summary n Rank Aggregation has solid theoretical background from voting and social selection n Many rank-aggregation algorithms available with similar core idea n Challenges n n Unified framework for rank-aware query processing Integrating rank-aggregation as a basic operation in practical relational database systems

References (1) n [ACE+03] Walid G. Aref, Ann C. Catlin, Ahmed K. Elmagarmid, J. Fan, Moustafa A. Hammad, Ihab F. Ilyas, Mirette Marzouk, Sunil Prabhakar, and X. Zhu. VDBMS: A testbed facility for research in video database bench marking. ACM Multimedia Systems Journal, Special Issue on Multimedia Document Management Systems, 2003. n [ASC 02] Sihem Amer-Yahia, Sung. Ran Cho, Divesh Srivastava, Tree Pattern Relaxation, In EDBT, 2002 n [BCG 02] Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. TODS, 27(2), 2002. n [BGM 02] Nicolas Bruno, Luis Gravano, Amélie Marian: Evaluating Top-k Queries over Web. Accessible Databases. In ICDE, 2002 n [CBC+00] Yuan. Chi Chang, Lawrence Bergman, Vittorio Castelli, Chung. Sheng Li, Ming. Ling Lo, and John R. Smith. The onion technique: indexing for linear optimization queries. In SIGMOD, 2000. n [CK 97] Michael J. Carey and Donald Kossmann, On saying ``Enough already !” in SQL, SIGMOD, 1997 Tucson, Arizona

References (2) n [CH 02] Kevin Chen. Chuan Chang and Seung won Hwang. Minimal probing: supporting expensive predicates for top-k queries. In SIGMOD, 2002. n [Con 85] M. J. Condorcet. Essai sur l'application de l'analyse a la probabilit e des decisions rendues a la puralite des voix, 1785. n [DKN+01] Cynthia Dwork, S. Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In World Wide Web, 2001. n [DR 99] Donko Donjerkovic, Raghu Ramakrishnan: Probabilistic Optimization of Top N Queries. In VLDB 1999 n [Fag 99] Ronald Fagin. Combining fuzzy information from multiple systems. Journal of Computer and System Sciences (JCSS), 58(1), Feb 1999. n [FLN 01] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. In PODS, Santa Barbara, California, May 2001. n [GBK 00] Ulrich G˜untzer, Wolf. Tilo Balke, and Werner Kießling. Optimizing multifeature queries for image databases. In VLDB, September 10 --14, Cairo, Egypt, 2000.

References (3) n [GBK 01] Ulrich G˜untzer, Wolf. Tilo Balke, and Werner Kießling. Towards efficient multifeature queries in heterogeneous environments. In ITCC, 2001. n [HGP 03] Vagelis Hristidis, Luis Gravano, and Yannis Papakonstantinou. Efficient IR-style keyword search over relational databases. In VLDB, Berlin, Germany, 2003. n [HKP 01] Vagelis Hristidis, Nick Koudas, and Yannis Papakonstantinou. PREFER: A system for the efficient execution of multiparametric ranked queries. In SIGMOD, Santa Barbara, California, 2001 n [IAE 02] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Joining Ranked Inputs in Practice. In VLDB, Honk-Kong, China, 2002. n [IAE 03] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supporting top-k join queries in relational databases. In VLDB, Berlin, Germany, 2003. n [ISA+04] Ihab F. Ilyas, Rahul Shah, Walid G. Aref, Jeff Vitter, and Ahmed K. Elmagarmid. Rank-aware Query Optimization. SIGMOD, Paris, France, 2004

References (4) n [LCI+05] Chengkai Li, Kevin. C. -C. Chang, Ihab F. Ilyas, and Sumin Song Rank. SQL: Query n [NCS+01] Apostol Natsev, Yuan. Chi Chang, John R. Smith, Chung. Sheng Li, and Jeffrey Scott Vitter. Supporting incremental join queries on ranked inputs. In VLDB, Rome, Italy, 2001. n [NR 99] Surya Nepal and M. V. Ramakrishna. Query processing issues in image (multimedia) databases. In ICDE, Sydney, Australia, 1999. n [RGM 03] Sriram Raghavan and Hector Garcia. Molina. Complex queries over web repositories. In VLDB, Berlin, Germany, 2003. n [TPK+03] Panayiotis Tsaparas, Themistoklis Palpanas, Yannis Kotidis, Nick Koudas, and Divesh Srivastava. Ranked join indices, ICDE 2003. n [UF 01] Tolga Urhan and Michael J. Franklin. Dynamic pipeline scheduling for improving interactive query performance. In VLDB, Roma, Italy, 2001. Algebra and Optimization for Relational Top-k Queries. In Proceedings of the 2005 ACM SIGMOD Conference on Management of Data, Baltimore, Maryland (To Appear)