Rank Aggregation Rank Aggregation Settings Multiple items Webpages

Rank Aggregation

Rank Aggregation: Settings • Multiple items – Web-pages, cars, apartments, …. • Multiple scores for each item – By different reviewers, users, according to different features… • Some aggregation function on the scores – Sum, Average, Max… • Goal: compute the top-k items

Rank Aggregation Example Model Price. Rank Model Comfort. Rank Honda 9 Honda 7 Volvo 3 Volvo 10 Subaru 9 Subaru 5 Model Beauty. Rank Honda 3 Volvo 8 Subaru 4 Model Total. Rank(min) Model Total. Rank(avg) Honda 3 Honda 6. 333 Volvo 7 Subaru 4 Subaru 6

Naïve Algorithm • Compute the aggregated rank for all items • Find the best one, then the second best one… the k best one • Good for small-scale problems • Still not feasible for web scales…

Can we do any better? • An assumption to help us: each individual list comes sorted – Reasonable for search engines, user rankings… • Another assumption: monotonicity of the aggregation function • Now can we do any better?

Fagin's algorithm (FA) • Do sorted access on all lists in parallel • For every item do random access to the other lists to fetch all of its values • Stop when at least k items were seen (in the sorted access) in all lists • Sort the list • Why is this enough?

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 C 3 A 4 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 B 9. 5 C 3 A 4 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 B 9. 5 C 3 A 4 C 4 D 3 D 1

Example (top-3) Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 B 9. 5 C 3 A 4 C 4 D 3 D 1 How do we know not to look further?

Complexity • Probabilistic analysis on the order of items can be used to show better bounds (with good probability) • Can we do even better?

Cost model • This is a very simple settings so we can define a finer cost model than worst case complexity • In a web context it is important to do so – Since the scale is huge • We associate some cost Cs with every sorted access , and some cost Cr with every random access • Denote the cost for algorithm A on input instance I by cost(A, I)

Instance-optimality • An algorithm A is instance-optimal if for every input instance I, cost(A, I) = O(cost(A', I)) for every algorithm A' • A very strong notion • But we can realize it here!

Threshold Algorithm (TA) • Idea: sometimes we can stop before seeing k objects in every list • Use a threshold on how good can a score of an unseen object be. • Based on aggregating the minimal score seen so far in all lists

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 C 3 A 4 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 C 3 A 4 D 3 D 1 T=9. 5

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 B 9. 5 C 3 A 4 D 3 D 1 T=9. 5

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 B 9. 5 C 3 A 4 C 4 D 3 D 1 T=7

Example Beauty Comfort Average Item Score A 9 B 10 A 6. 5 B 9 C 5 B 9. 5 C 3 A 4 C 4 D 3 D 1 One step less! T=4

Theorem • Assume that the aggregation function t is monotone. Let D be the class of all databases. Let A be the class of all algorithms that correctly find the top k answers for t for every database and that do not make wild guesses. Then TA is instance optimal over A and D

Proof • Assume that algorithm A halts at depth d (that is, if di is the number of objects seen under sorted access to list i; then d =max di). • Assume that A sees a distinct objects (some possibly multiple times). In particular, a>= d: Since A makes no wild guesses, and sees a distinct objects, it must make at least a sorted accesses

• Claim: TA halts on D by depth a +k • Note that for each choice of d’ TA sees at least d 0 objects by depth d’ – By depth d’ it has made m*d’ sorted accesses, and each object is accessed at most m times under sorted access. • If there at most k objects that A does not see, then TA halts by depth a + k (after having seen every object), and we are done.

• Now assume that there at least k + 1 objects that A does not see. • Let Y be the output set of A • Since Y is of size k; there is some object V that A does not see and that is not in Y • Let t be threshold value when algorithm A halts – I. e. the aggregation of the lowest scores observed

• Call object R big if it has grade better than t, otherwise small • Claim: Every R in Y is big – Proof: Add another item with “lowest” di values in di, it is not seen by A thus not outputted; by correctness of A the claim follows • Now TA will see all elements in Y after depth d and will halt – d <= a and so we are done.

Restricted Sorted Access • Some rankings are not available as sorted – E. g. distances from a map site • Then we can revise TA to do sorted access only on the list where it is possible • And still instance-optimal! (Against algorithms that work under the same restrictions, of course)

No Random Access • Maintain bottom and upper bounds for every item (worst and best grades) • Best is the aggregation of what we have seen and the worst we have seen in every list, Worst is the aggregation with what we have seen and zeros • Keep in the list those with top-K "worst" grades – Break ties by "best" grades • Halt if we have k items in the list, and the best grade for every item out of the list is less than the k'th in the list

Example Beauty Comfort Average Item Score A 9 B 10 A 4. 5<S<9. 5 B 9 C 5 C 3 A 4 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 A 4. 5<S<9. 5 B 9 C 5 B 5<S<9. 5 C 3 A 4 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 A 4. 5<S<9. 5 B 9 C 5 B 9. 5 C 3 A 4 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 A 4. 5<S<9. 5 B 9 C 5 B 9. 5 C 3 A 4 C 2. 5<S<7 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 AA 4. 5<S<9. 5 6. 5 B 9 C 5 BB 9. 5 C 3 A 4 CC 44 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 AA 6. 5 B 9 C 5 BB 9. 5 C 3 A 4 CC 44 D 3 D 1

Example Beauty Comfort Average Item Score A 9 B 10 AA 6. 5 B 9 C 5 BB 9. 5 C 3 A 4 CC 44 D 3 D 1 D)<3 Score(