BestEffort Topk Query Processing Under Budgetary Constraints Michal
Best-Effort Top-k Query Processing Under Budgetary Constraints Michal Shmueli-Scheuer (IBM Haifa Research Lab and UCI) Yosi Mass, Haggai Roitman Chen Li Ralf Schenkel, Gerhard Weikum
Motivating Example Mediation Systems Achieve high query throughput. Top-k queries Top-k results Engine Mobile Applications 2 Highly impatient users, need fast results. Online Analytics (e. g. logs) Achieve high query throughput. Michal Shmueli-Scheuer
Traditional top-k query R 2 R 1 a 0. 9 • Pre-computed lists over multiple attributes. Rm d 0. 87 c 0. 9 0. 6 a 0. 85 b 0. 6 c 0. 5 f 0. 5 g 0. 5 … . . d 0. 4 c 0. 2 a 0. 4 sorted b m n • Combine scores by some monotonic aggregation function. • Two accesses modes: – sorted access (Cs) – random access (Cr) • Objective: Compute k objects with highest scores. 3 Michal Shmueli-Scheuer
NRA algorithm (Fagin et al. ) highi f = SUM R 2 R 1 a 0. 9 d 0. 87 b 0. 6 a 0. 85 c 0. 5 f … . . d 0. 4 …. c 0. 5. . 0. 2 Worst score Top-2 a [0. 9, 1. 77] Best score d [0. 87, 1. 77] mink candidates mink > best-score of candidates 4 Michal Shmueli-Scheuer
NRA algorithm (Fagin et al. ) R 2 R 1 highi a 0. 9 d 0. 87 b 0. 6 a 0. 85 c 0. 5 f 0. 25 … . . d 0. 4 …. c . . 0. 2 Worst score Top-2 a [1. 75, 1. 75] Best score d [0. 87, 1. 47] mink candidates b [0. 6, 1. 45] mink > best-score of candidates 5 Michal Shmueli-Scheuer
NRA algorithm (Fagin et al. ) R 2 R 1 highi a 0. 9 d 0. 87 b 0. 6 a 0. 85 c 0. 5 f 0. 25 … . . d 0. 4 …. c . . 0. 2 Worst score Top-2 a [1. 75, 1. 75] Best score d [0. 87, 1. 37] mink candidates b [0. 6, 0. 85] c [0. 5, 0. 75] f [0. 25, 0. 75] mink > best-score of candidates 6 Michal Shmueli-Scheuer
Top-k with Budget Constraints Top-2 d 1. 7 t 1. 52 Cs=1, Cr =3 f = SUM R 2 Access Costs a 1. 0 s 0. 95 R 1 NRA: 12 Cs = 12 Sorted access cost- Cs precision =0. 5 b 0. 9 u 0. 93 Random access cost- Cr t 0. 92 c 0. 85 TA: 7 Cs +7 Cr = 28 precision =0 Given budget B, d 0. 8 d 0. 9 maximize result e 0. 7 quality x 0. 5 y Budget =10 ? t 0. 6 0. 4 z 0. 2 … f 0. 4. . 7 Michal Shmueli-Scheuer
Contributions • Sorted Accesses – Efficient Plan – Solution with Adaptive • Sorted and Random Accesses – Efficient Plan – Solution with Adaptive • Experiments 8 Michal Shmueli-Scheuer
Results Under Limited Budget K results for unlimited budget Results for limited budget 9 Michal Shmueli-Scheuer
Efficient Plan- Sorted Accesses • Assume that we know the k results for unlimited budget (REXACT). L 1 L 2 o 8, SL 1 o 2, SL 2 P 1 o 1, SL 1 o 4, SL 2 • Interesting positionswhere the k objects P 2 appear in the lists. o 6, SL 1 o 5, SL 2 o 5, SL 1 o 3, SL 2 • Plan – {L 1, 4} {L 2, 2} o 1, SL 2 10 Michal Shmueli-Scheuer Top-2 o 1 o 5 Q 1 Q 2
Efficient Plan- Sorted Accesses • Goal: find plan t, such that : Plans for B=5 L 1 L 2 o 8, SL 1 o 2, SL 2 o 1, SL 1 o 4, SL 2 o 6, SL 1 o 5, SL 2 P 2 o 5, SL 1 o 3, SL 2 P 1 o 1, SL 2 Plan: {L 1, 2} {L 2, 3} Denoted as ROPT 11 Michal Shmueli-Scheuer Q 1 Q 2
Sorted Accesses • Observations: L 1 L 2 O 1, SL 1 O 1, SL 2 O 2, SL 1 O 2, SL 2 L 3 O 2, SL 3 Prefer high scores 12 Michal Shmueli-Scheuer
Observations – contd. title=“war” description=“weapon” Prefer large score reductions 13 Michal Shmueli-Scheuer
Score Utilities Score gain: Score reduction: o 2, 1 y =3 o 4, 0. 9 o 5, 0. 8 o 3, 0. 7 o 1, 0. 6 Michal Shmueli-Scheuer
Optimization Problem • Bi-objective optimization problem: util(Li, x) = * gain +(1 - )* reduction Heuristics: • Fair Heuristic • Rank Heuristic Where m is the number of lists 15 Michal Shmueli-Scheuer
Adaptive gain ( ) reduction (1 - ) time 16 Michal Shmueli-Scheuer
Adaptive top-k o 1 [ws, bs] o 2 [ws, bs] o 3 [0. 8, bs] candidates L 1 d(o 4) = 0. 8 -0. 6=0. 2 L 3 O 1, SL 1 hight 1 o 4 [0. 6, bs] o 6 [ws, bs] hight 2 O 1, SL 3 Theobald et al. VLDB 04 Michal Shmueli-Scheuer
Adaptive TREC query, k=100 Michal Shmueli-Scheuer
Efficient Plan- Random Accesses • Observations: – random accesses occur always after sorted accesses have been finished. schedule 1: {SA……RA……SA…. } schedule 2: {SA……RA…. } precision(schedule 1) = precision(schedule 2) 19 Michal Shmueli-Scheuer
. Observations- contd • Random accesses are only useful to objects in REXACT. top-k Precision reduced Precision remains the same o 1 [ws, bs] o 2 [ws, bs] o 5 o 3 [ws, bs] o 2 L 2 o 2, SL 2 candidates o 5, SL 2 o 4 [ws, bs] o 5 [ws, bs] o 1, SL 2 20 Michal Shmueli-Scheuer o 5, Not in REXACT
Random Accesses • When to switch from SA to RA? Gathering with Sorted ( ) Not enough good candidates, RA is wasted Probing with Random (1 - ) Not enough RAs to prune the candidates 21 time Michal Shmueli-Scheuer
Random Accesses • Switch from Sorted to Random: R= (1 - )*S S – total cost of sorted accesses. S+R > B R – total cost for random accesses. • Which items to access ? – maximize expected score. 22 Michal Shmueli-Scheuer
Experimental Data • TREC Terabyte – 25 M webpages – 50 queries with average length of 3 words. • IMDB – 375, 000 movies – 20 queries , each with 4 attributes: {Title, Genre, Actors, Description} • Synthetic data – Zipf, #lists =[2, 6], #objects =[10000, 1000000] • Aggregate Function : Sum 23 Michal Shmueli-Scheuer
Evaluation Methods • percentage of optimal precision Ropt Ralg Rexact • SME Michal Shmueli-Scheuer Ropt
Results- Sorted Accesses TREC, k=100 • Less budget, more improvement 25 Michal Shmueli-Scheuer
Varied k IMDB, B=400 • Lower K, more improvement. 26 Michal Shmueli-Scheuer
Number of Lists Zipf, K=100, B=4000 • More lists, more improvement. 27 Michal Shmueli-Scheuer
Results- Random Accesses TREC, k=100, Cr=10 TREC, K=100, Cr=100 28
Related Works • Minimize budget for optimal results: – the algorithm computes the exact results with minimum cost. (Bast et al. VLDB 06, Bruno et al. ICDE 02, Chang et al. SIGMOD 02) – Dual problem. • Anytime top-k : – The algorithm collects statistics during processing, which can be used to provide probabilistic guarantees at any time during processing. (Aray et al. VLDB 07) – Do not do any optimizations. • Approximate top-k: 29 – approximate results with probabilistic guarantees. (Theobald et al. VLDB 04, Fagin et al. 2001) Michal Shmueli-Scheuer
Conclusions • First attempt to deal with budget constraints. • For SA only, average precision around 70%. • Tradeoff between RAs and SAs, for relatively low cost of RA, RA schedules are improved. 30 Michal Shmueli-Scheuer
Thank You ! 31
32
Top-k query • Given a set of n objects and m scoring lists sorted in decreasing order, find the top-k objects according to a scoring function f • top-k: a set T of k objects such that f(rj 1, …, rjm) ≤ f(ri 1, …, rim) for every object Xi in T and every object Xj not in T • Assumption: The scoring function f is monotone – f(r 1, …, rm) ≤ f(r 1’, …, rm’) if ri ≤ ri’ for all I – Two accesses modes: • sorted access – Cs • random access - Cr • Objective: Compute top-k with the minimum cost 33
Sorted Accesses • Observations: – object with high scores has higher potential to be part of the top-k. – object with “mediocre” scores does not help. L 1 O 1, SL 1 Prefer high scores 34 L 2 O 1, SL 2 L 3 O 1, SL 3
Example Wireless zone Q 35 useless
Applications • Mobile Applications – Highly impatient users, need fast results. • Mediation Systems – Achieve high query throughput. • Online analytics (e. g. logs) – Achieve high query throughput. 36 Michal Shmueli-Scheuer
Motivating Example Query throughput Given #queries per All o ea cate ch qu time ery for Servers Mediator Engine time unit User query 37
Terminology 1. 2. 3. 4. 5. 6. 7. 8. 38 Sorted Access Random Access highi Top-k queue Candidates queue mink worst. Score(d) best. Score(d)
Efficient Offline Solution- Sorted • Goal: find trace t, such that : L 1 B=5 39 L 2 L 1 L 2 o 8, SL 1 o 2, SL 2 o 1, SL 1 o 4, SL 2 t 1 0 5 t 2 1 4 t 3 2 3 t 4 3 2 o 6, SL 1 o 5, SL 2 t 5 4 1 t 6 5 0 P 2 o 5, SL 1 Denoted as ROPT o 3, SL 2 P 1 o 1, SL 2 P 1 P 2
Efficient Offline Solution- Sorted • Goal: find trace t, such that : L 1 L 2 B =5 L 2 o 8, SL 1 o 2, SL 2 o 1, SL 1 o 4, SL 2 2 o 6, SL 1 o 5, SL 2 4 1 5 0 P 2 o 5, SL 1 o 3, SL 2 t 1 0 5 t 2 1 4 t 3 2 3 t 4 3 t 5 t 6 P 1 o 1, SL 2 • Feasible for K up to 100, and m up to 10. 40 P 1 P 2
Efficient Offline Solution- Sorted • Proof: (in negation) – Assume that t does not exists, and chose trace s that within the budget and has optimal precision. Assume s` with traces s`i that are largest position of Pi less or equal to si. – By construction the score of any object in S is the same to S` 41
Fair Heuristic • Assume budget =b Runs in batches
Efficient Offline Solution- Random • Budget for RAs =(B-|t|*Cs) Top-k o 1, S o 5, S best(o)-mink o 2, S o 8, S o 7, S (best(o) = wosrt(o)+RA) o 3, S o 9, S o 4, S …. o 10, S …. o 14, S …. 43 d Rexact
Motivation • Many applications work in budgeted constraint environments. Still, they wish to perform top-k queries. Servers Mediator Engine User query 44 Budget-aware Query processing
Future work • Different access costs for different lists • Time-aware top-k • Top-k with budget constraints for P 2 P 45
- Slides: 45