Topk and Clustering with Noisy Comparisons 1 VOLUME

  • Slides: 64
Download presentation
Top-k and Clustering with Noisy Comparisons 1 VOLUME 39, NUMBER 4, DECEMBER 2014 2015/6/15

Top-k and Clustering with Noisy Comparisons 1 VOLUME 39, NUMBER 4, DECEMBER 2014 2015/6/15 B 4 KOSAKA

Main author 2 �SUSAN DAVIDSON - Ph. D at Department of Computer and Information

Main author 2 �SUSAN DAVIDSON - Ph. D at Department of Computer and Information Science University of Pennsylvania � She mainly researches - databases - bioinformatics - workflows - provenance

Introduction 3 �Max/top-k and clustering/group-by The implementation of these operations involves - Value comparisons

Introduction 3 �Max/top-k and clustering/group-by The implementation of these operations involves - Value comparisons (“Is a > b? ”) for max/top-k - Type comparisons (“Are a and b of the same type? ”) for clustering

Introduction 4 As an example , consider the Photo. DB l Each photo is

Introduction 4 As an example , consider the Photo. DB l Each photo is an element l Type. . . the person appearing in the photo l Value. . . the age of the person in the photo (or the date when the photo was taken) l The number of clusters J is the number of distinct people in the Photo. DB database l The clusters are not necessarily balanced

Introduction 5 �Using Photo. DB, they wish to group the photos and find the

Introduction 5 �Using Photo. DB, they wish to group the photos and find the most recent photo - Using an SQL-like syntax and assuming that Photo. DB has a single attribute photo, this query could be represented as follows

Introduction 6 �Photo. DB Q. Group the photos of individual players Group-By Queries! What

Introduction 6 �Photo. DB Q. Group the photos of individual players Group-By Queries! What if name/date is missing Use “Name” attribute Image processing? Photo forensics? Q. Find their most recent photos Max/Top-k Queries! Use “Date” attribute

Introduction 7 �Ask the “Crowd”!!

Introduction 7 �Ask the “Crowd”!!

Introduction 8 �Crowdsourcing - Using human intelligence to do tasks which are harder to

Introduction 8 �Crowdsourcing - Using human intelligence to do tasks which are harder to automate - A recent topic of interest in database and other research communities - Many crowdsourcing platforms

Introduction 9 �Author’s application - Modeling such functions as oracles whose answers may be

Introduction 9 �Author’s application - Modeling such functions as oracles whose answers may be erroneous, and which are used by the system in one or more rounds of interaction

Introduction 10 �Error model - Constant error model (Feige et al [1994]) Each type

Introduction 10 �Error model - Constant error model (Feige et al [1994]) Each type or value comparison is answered correctly with a constant probability > ½ - Variable error model(proposal model) Change the probability of error for each question

Introduction 11 �Variable Error Model f is photo? - Hard Older than the first

Introduction 11 �Variable Error Model f is photo? - Hard Older than the first 1. strictly monotone 2. • x 1 > x 2 f(x 1) > f(x 2) f(x) ≥ 2 e. g. f( ) = e f( ) = + 1 Older than the first photo? - Easy f( ) = log + 2 ≦ 1/e (Δ= 1) ≦ 1/e 2 e. g. Error probability when f( ) = e (Δ= 2) Error probability ≤ 1/ f(�) - �

Introduction 12 �How to assign a cost per question - Fixed cost Asking one

Introduction 12 �How to assign a cost per question - Fixed cost Asking one question has a cost of 1 and asking N questions has a cost of N - Concave cost function The cost of asking N questions followed by M questions is more expensive than asking N + M questions at once

Introduction 13 �The author’s goal Minimize the total comparisons while outputting the correct answer

Introduction 13 �The author’s goal Minimize the total comparisons while outputting the correct answer (exact top-k or clusters) w. p. 1 – δ (given constant δ > 0)

Preliminaries 14 �Top-k and Group-by Database Queries The goal of. . . - a

Preliminaries 14 �Top-k and Group-by Database Queries The goal of. . . - a simple top-k query is to find the maximum or the top-k elements having the highest values sample query: - a simple group-by query is to group together those elements having same types sample query:

15 �Questions Asked to the Comparison Oracle - A comparison oracle is used to

15 �Questions Asked to the Comparison Oracle - A comparison oracle is used to compute the functions “Most- recent(photo)” or “GROUP BY Person(photo)” in the preceding queries - This is done by posing questions to the oracle that ask it to compare types or values of two elements - The answers to these questions are always “yes” or “no”

16 �Questions Asked to the Comparison Oracle - Independence assumption u. They assume that

16 �Questions Asked to the Comparison Oracle - Independence assumption u. They assume that the answers of two different questions asked of the oracle are mutually independent u. This simulates a crowdsourcing setting

Preliminaries 17 �Error Model - For the max/top-k problem, the authors will see the

Preliminaries 17 �Error Model - For the max/top-k problem, the authors will see the effect of a more refined variable error model for value questions, where the probability of error decreases when two elements that are far apart in the total order on values are compared.

Preliminaries 18 �Error Model - A function f : N → R≥ 0 is

Preliminaries 18 �Error Model - A function f : N → R≥ 0 is monotone (respectively, strictly monotone) if, for all n 1 ≥ n 2, f (n 1) ≥ f (n 2). - In the variable error model, given two distinct elements xi, xj such that xi > xj, the probability of error where f is a strictly growing function, f(1) ≥ 2+ε′, ε′> 0 is a constant

Preliminaries 19 �Error Model - The conditions f(1) ≥ 2+ε′ensure that , even if

Preliminaries 19 �Error Model - The conditions f(1) ≥ 2+ε′ensure that , even if xi and xj are consecutive elements in the total order, that is, j = i + 1, the probability of error is ≤ 1/2 − ε for some constant ε > 0 - When f (Δ) = 2 + ε′ for all inputs Δ, the variable error model is the same as the constant error model for value questions

Preliminaries 20 �Error Model - There is a natural “value-based” alternative to the “ranking-based”

Preliminaries 20 �Error Model - There is a natural “value-based” alternative to the “ranking-based” variable error model described previously - In this value-based variable error model, inequality (1) becomes ( )

Preliminaries 21 �Problem Statements The problems studied in this article are as follows i.

Preliminaries 21 �Problem Statements The problems studied in this article are as follows i. Max and top-k ii. Clustering iii. Clustering with correlated types and values iv. MAX AND TOP-K FOR CONCAVE COST FUNCTIONS

MAX AND TOP-K 23 �Finding Max Input: n elements - same type - only

MAX AND TOP-K 23 �Finding Max Input: n elements - same type - only value comparisons are used Output: All max/top-k elements The author focus on Max - algorithm for max extends algorithm for top-k

MAX AND TOP-K 24 �Finding Max If all answers are correct - n− 1

MAX AND TOP-K 24 �Finding Max If all answers are correct - n− 1 questions are necessary and sufficient to find x 1 If each value question is answered correctly with probability ≥ 1/2 + ε (constant error model) - this bound is tight as stated in theorem 4. 1

MAX AND TOP-K 25 �Finding Max Moreover, the proof of the lower bound shows

MAX AND TOP-K 25 �Finding Max Moreover, the proof of the lower bound shows a stronger result when δ + ε ≤ 1/2 Also Theorem 4. 2 holds in the variable error model

MAX AND TOP-K 26 �The author’s results for max = distance in sorted order

MAX AND TOP-K 26 �The author’s results for max = distance in sorted order Required confidence: 1 – δ (δ = constant), Exact Constant Error Comparisons [Feige et al ‘ 94] Error prob < ½ Upper Bound n-1 O(n) Lower Bound n-1 Ω(n) ≧ (1+c)n , c>0 when High success prob. required with high error Variable Error [This paper] Error prob < 1/f( ) n+o(n) for any strictly growing functions f

MAX AND TOP-K 27 • Tournament Tree for Max Exact comparisons: Binary tree structure

MAX AND TOP-K 27 • Tournament Tree for Max Exact comparisons: Binary tree structure is not necessary Noisy comparisons: Repaet comparison + majority vote Constant error model: θ(n) algorithm (Feige et al ‘ 94) The author’s goal: Total number of comparisons = n + o(n)

MAX AND TOP-K 28 �The author’s algorithm for Max • Max does not lose

MAX AND TOP-K 28 �The author’s algorithm for Max • Max does not lose in the upper or lower levels with high prob. • Total number of comparisons = n + θ(Y) = n + o(n) Upper levels • Use Feige’s algorithm Upper • Y nodes ⇒θ(Y) cost Levels Binary tournament tree Y nodes Lower levels • Just 1 comparisons at Lower Levels each internal node • No majority vote n nodes Key idea Start with a random permutation of elements at the leaves

MAX AND TOP-K 29 �Analysis - The number of nodes at level L is

MAX AND TOP-K 29 �Analysis - The number of nodes at level L is 2 logn−L, for L = 1 to logn. The total number of comparisons performed by the algorithm is

MAX AND TOP-K 30 �Analysis - They analyze the upper log n levels and

MAX AND TOP-K 30 �Analysis - They analyze the upper log n levels and the lower log X levels separately: i. in the X upper levels, we use the algorithm from Feige et al. [1994] that returns the maximum element with probability ≥ 1−δ ii. in the lower levels, they show that x 1 does not lose in any comparison with probability ≥ 1− 5δ, even when only one comparison is performed at each internal node in the lower levels

MAX AND TOP-K 31 �Analysis - Therefore, by union bound, the maximum element x

MAX AND TOP-K 31 �Analysis - Therefore, by union bound, the maximum element x 1 will be returned with probability ≥ 1 − 6δ

MAX AND TOP-K 32 �Analysis of the upper levels - Each internal node in

MAX AND TOP-K 32 �Analysis of the upper levels - Each internal node in levels l = 1 to logn uses Sl =(2 l− 1)×O(1/ε 2 log 1/δ) comparisons, and NL = SL-log. X - The total number of comparisons is

MAX AND TOP-K 33 �Analysis of the upper levels - Therefore, given constant ε,

MAX AND TOP-K 33 �Analysis of the upper levels - Therefore, given constant ε, δ > 0, to find the maximum element in the upper log X levels with probability ≥ 1 − δ,

MAX AND TOP-K 34 �Analysis of the lower levels - The number of comparisons

MAX AND TOP-K 34 �Analysis of the lower levels - The number of comparisons in the lower log X levels is bounded by n - There exists a value of X such that n/X = o( n/δ ) for any strictly growing function f , and the maximum element does not lose in any comparison with probability ≥ 1 − 5δ in the lower levels

MAX AND TOP-K 35 �Extension to top-k - The algrithm which is given by

MAX AND TOP-K 35 �Extension to top-k - The algrithm which is given by Feige et al. uses O(n log (min(k, n−k)/δ) ) comparisons to find the k-th largest element with probability ≥ 1−δ - Following corollary to Theorem 4. 2 that solves the top-k problem with high probability

CLUSTERING 37 �The authors prove the following theorem which gives a bound on the

CLUSTERING 37 �The authors prove the following theorem which gives a bound on the number of type questions that are necessary and sufficient to find the exact J clusters. ※J is not known priori

CLUSTERING 38 • The author’s algorithm head L Repeat same operation later 1. Move

CLUSTERING 38 • The author’s algorithm head L Repeat same operation later 1. Move the. Repeat first 4. Move element the first elementthe same operation later 2. Compare 3. Compare thethe type of two elements C 1 C 2 C 3

CLUSTERING 39 �Proof of upper bound O(1/ε 2 log(n/δ)) times ≦n times J times

CLUSTERING 39 �Proof of upper bound O(1/ε 2 log(n/δ)) times ≦n times J times Upper bound: J × n × O(1/ε 2 log(n/δ)) = O(n. J log n)

Clustering With Correlated Types and Values 42 �To cluster n elements into J clusters,

Clustering With Correlated Types and Values 42 �To cluster n elements into J clusters, questions are necessary and sufficient �However, types and values can be correlated in some scenarios and elements of the same type can form contiguous blocks in the sorted order according to the values �They formalized this idea assuming at most α changes in types between any two elements of the same type

Clustering With Correlated Types and Values 43 �Here, this bound improves to O(nlog J)

Clustering With Correlated Types and Values 43 �Here, this bound improves to O(nlog J) when α is small and both type and value questions are asked � Note that both value and type questions are answered correctly with probability ≥ 1/2+ ε, given a constant ε > 0 �Generally, the following theorem holds

Clustering With Correlated Types and Values 44 �Clustering for full Correlation - Here, they

Clustering With Correlated Types and Values 44 �Clustering for full Correlation - Here, they give an algorithm that improves the number of value and type questions from O(n log n) to O(n log J), where J is the number of clusters and typically much smaller than n

Clustering With Correlated Types and Values 45 � Algorithm 3 for Full Correlation

Clustering With Correlated Types and Values 45 � Algorithm 3 for Full Correlation

Clustering With Correlated Types and Values 46 �Analysis - The following lemma bounds on

Clustering With Correlated Types and Values 46 �Analysis - The following lemma bounds on the total number of type and value questions

Clustering With Correlated Types and Values 47 �Analysis Handling erroneous answers to type and

Clustering With Correlated Types and Values 47 �Analysis Handling erroneous answers to type and value questions - When type and value comparisons are correct, cn log J questions suffice for some constant c - When the comparisons are erroneous, but correct answers are returned with probability ≥ 1/2 + ε, for constant ε > 0

Clustering With Correlated Types and Values 48 �In this case we repeat each type

Clustering With Correlated Types and Values 48 �In this case we repeat each type or value comparison performed by Algorithm 3 between two elements O(1/ε 2 log n/δ) times and take the majority vote - to decide whether they have the same type   or - to order them according to their values

Clustering With Correlated Types and Values 49 �Moreover, they abort the algorithm after comparing

Clustering With Correlated Types and Values 49 �Moreover, they abort the algorithm after comparing cn 2 pairs of elements The total bad probability in the cn log J ≤ cn 2 comparisons �The expected number of questions asked by the algorithm is O(nlog J) × O( 1/ε 2 log n/δ)

Clustering With Correlated Types and Values 50 • Extension to Partial Correlation For arbitrary

Clustering With Correlated Types and Values 50 • Extension to Partial Correlation For arbitrary α, there at most α changes in types between any two elements of the same type. Partial correlation: - Elements of same type form almost contiguous blocks In the example of figure(b) - Type A. . . x 3→x 6 x 5→x 4 - Type B. . . x 5→x 4 x 4→x 8 x 8→x 7 - Type C. . . x 8→x 7 x 7→x 9 There at most α − 1 = 3 changes between any two elements of the same type

Clustering With Correlated Types and Values 51 � Algorithm for Partial Correlation To group

Clustering With Correlated Types and Values 51 � Algorithm for Partial Correlation To group elements of the same types, consider the list of remaining elements L returned by Algorithm 3 1. While L is not empty, select the first element y in L 2. For the next α elements z in L, check whether y and z have the same type. For all elements z with type(y) = type(z), set link(z) = y 3. Delete these elements from S. 4. Then repeat the procedure with the remaining elements in L (in order)

Clustering With Correlated Types and Values 52 �The number of consecutive blocks for general

Clustering With Correlated Types and Values 52 �The number of consecutive blocks for general α is ≤ αJ, therefore Algorithm 3 asks O(n log(αJ)) questions �The additional step to group elements of the same type needs J iterations �O(n log(α J) + α J) comparisons suffice when the answers to type and value questions are exact

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 53 �So far the authors have used

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 53 �So far the authors have used the fixed-cost model in which each question incurs unit cost, resulting in a total cost that is the number of comparisons performed �Here, the author introduces nonnegative monotone concave functions as the cost function

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 54 • Therefore, if many questions are

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 54 • Therefore, if many questions are asked together, the authors pay less than asking the questions one by one • This is reasonable in a crowdsourced setting

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 55 �Optimization problem - Under the fixed-cost

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 55 �Optimization problem - Under the fixed-cost model, the author’s goal was to minimize the total number of comparisons performed - However, instead of minimizing the total number of comparisons, their goal is now to minimize the sum of the cost under g - They denote the cost of an optimal algorithm by OPT

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 56 �Finding an optimal algorithm for an

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 56 �Finding an optimal algorithm for an arbitrary concave cost function g is nontrivial even for the simple problem of finding max, and even if there are no comparison errors �μ(n)-approximation algorithm - for some nondecreasing function μ if, for every input of size n, it can find the solution with a cost ≤ μ(n) × OPT

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 57 �In the unit cost model, for

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 57 �In the unit cost model, for no comparison errors, n − 1 comparisons are necessary and sufficient to find the max from n elements, and hence the cost incurred is also n − 1 �g(n − 1) serves as a lower bound for OPT

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 58 �No Comparison Error Model THEOREM 7.

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 58 �No Comparison Error Model THEOREM 7. 2 - Algorithms that give better than logn approximation for any concave cost function g exist. The authors prove this by giving two algorithms (Algorithms 4 and 5) that achieve the desired approximation

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 59 �Algorithm 4 - Bh. . .

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 59 �Algorithm 4 - Bh. . . the number of children of a node in level h (h≧ 1) - Nh. . . the total number of nodes in level h (h ≥ 0) - N 0 = n/2 , B 1 = 2, Bh+1 = Bh 2 - The max of these Bh elements from level h − 1 is propagated to level h - Doing all Bh. C 2 comparisons in h-1 level

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 60 �Algorithm 4

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 60 �Algorithm 4

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 61 �The following lemma gives the required

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 61 �The following lemma gives the required bound for Theorem 7. 2. �Note that, to prove theorem, it suffices to show that the max can be found in O(log n) rounds, where in each round ≤ n comparisons are performed. Since g is subadditive, g(n− 1) + g(1) ≥ g(n) and therefore OPT ≥ g(n − 1) ≥ g(n) − const.

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 62 �Pippenger’s Algorithm - Algorithm 5 is

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 62 �Pippenger’s Algorithm - Algorithm 5 is based on Pippenger [1987], who uses a fixed-cost model and gives upper bounds on the number of comparisons for finding max given a bound on the number of rounds - That paper proves the following theorem

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 63 �Algorithm 5 - Algorithm 5 combines

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 63 �Algorithm 5 - Algorithm 5 combines Pippenger’s algorithm with the standard tournament one (using a balanced binary comparison tree) to find the max. The following lemma gives the required bound for Theorem 7. 2.

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 64 �Extension to top-k - The k-th

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 64 �Extension to top-k - The k-th largest element xk can be found with an expected cost of O(log n) · OPT - Then the top-k elements can be found by simply scanning the rest of the n − 1 elements and finding those greater than xk

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 65 �Constant Error Model - There are

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 65 �Constant Error Model - There are two main issues that prohibit them from getting a similar approximation for an arbitrary concave cost function g: (1) as they argued earlier, now they need to make payments level by level, for every batch of questions asked (2) if each of N comparisons in a level is repeated M times, they must not be performed in the same batch so that M independent answers are obtained

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 66 �Although they perform MN comparisons in

MAX AND TOP-K FOR CONCAVE COST FUNCTIONS 66 �Although they perform MN comparisons in that level, they have to call the oracle M times with batches of size N and therefore we have to pay Mg(N), which can be much larger than g(MN) �The authors prove following theorem

Conclusion 67

Conclusion 67

68 Thank you for your listhening.

68 Thank you for your listhening.