Correlation Clustering Nikhil Bansal Joint Work with Avrim

Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla

Introduction 2

Previous Approaches Word 1 Word 2 Word 3 Doc 1 1 0 1 Doc 2 0 1 1 Doc 1 -> (1, 0, 1) + Distance among points Documents mapped to points 3

Previous Approaches k-min clustering K-min sum, k-median … Approximation algorithms, Matrix methods, AI Techniques … K=3 K-min clustering: Minimize Max. Diameter K-min sum : Minimize sum of distances within clusters 4

Some Limitations 1) Have to specify “k” If k not restricted: Best to just put each vertex in its own individual cluster 5

Some Limitations 2) Restrictions on Edge Weights Edge weights form metric 6

Some Limitations 3) No Clean notion of quality of clustering E. g. Minimize distance sum within clusters. What really is my Cluster quality? 7

Outline Introduction Our Approach + Problem Formulation Approximating Agreements Approximating Disagreements Conclusion 8

Our Approach Classifier: takes 2 documents and Returns a weight in [-1, +1] indicating their similarity W +1: Similar -1: Dissimilar In this talk, W= -1 or +1 9

Our Approach Classifier: takes 2 documents and Returns a weight in [-1, +1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 -1 +1 10

Our Approach Classifier: takes 2 documents and Returns a weight in [-1, +1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 -1 Our Goal: Find a clustering which agrees with this labeling +1 11

A Disagreement -1 -1 +1: Similar -1: Dissimilar +1 2 edges have disagreements!! -1 -1 Disagreement: -1 edge with in a cluster +1 edge crossing a cluster +1 Our Goal: Minimize number of disagreements 12

Comparison: 1) Clean notion of quality of clustering # disagreements -> Quality 13

Comparison: 2) Do not have to specify “k” +1 +1 +1 K determined by Edge labels -1 -1 -1 14

Comparison: 3) Arbitrary Edge Weights No metric No dependence 15

A Closer Look Goal: Given graph with +1, -1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements? 16

A Closer Look Goal: Given graph with +1, -1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements? Answer: No. +1 +1 -1 Any clustering has at least 1 disagreement 17

Minimizing Disagreements +1 +1 -1 1 Disagreement 18

Minimizing Disagreements +1 +1 -1 1 Disagreement Minimizing disagreements is NP-Hard Will look for approximation algorithms 19

Agreements vs. Disagreements Observation: Agreements + Disagreements = Minimizing disagreements , Maximizing agreements Very different in terms of approximation: Opt: 1 disagreement We: n disagreements Disagreements : Ratio n Agreements : Ratio ¼ 1 20

Outline Introduction Our Approach + Problem Formulation Approximating Agreements Approximating Disagreements Conclusion 21

Maximizing Agreements A 2 approximation is easy. Algorithm: If #(+1 edges) > #(-1 edges), put all in single cluster Else, individual cluster for each point. Proof: Opt’s agreements at most We agree on at least 22

Our Result A PTAS for max. agreements: (1+ ) approximation, Time = n. O(poly(1/ )) 23

Outline Introduction Our Approach + Problem Formulation Approximating Agreements Approximating Disagreements Conclusion 24

Our Result An O(1) approximation for minimizing disagreements 25

Approximation for Disagreements To prove: Dalg · c Dopt Dalg : Our Disagreements Dopt : Opt Disagreements Roadmap: 1) Notation 2) Show existence of Opt( ) 3) Describe the Algorithm 4) Show our clustering close to Opt( ) 26

Notation: Given a clustering, vertex -good if few disagreements Within C < |C| Outside C < |C| C v is -good C +1: Similar -1: Dissimilar A -bad vertex has ¸ |C| disagreements v is -bad Cluster C -clean if all v 2 C are -good 27

Approximation for Disagreements To prove: Dalg · c Dopt Dalg : Our Disagreements Dopt : Opt Disagreements Roadmap: 1) Notations 2) Show existence of Opt( ) 3) Describe the Algorithm 4) Show our clustering close to Opt( ) 28

Existence of Opt( ) Main Idea: Opt -> Opt( ): 1) All “non-singleton” clusters -clean 2) Constant times worse than Opt Dopt( ) = O(1/ 2) Dopt 29

Transforming OPT to OPT( ) Optimum clustering C 1 C 2 An Imaginary Procedure applied to Opt 30

Transforming OPT to OPT( ) Identify /3 -bad vertices Optimum clustering C 1 C 2 /3 -bad vertices 31

Transforming OPT to OPT( ) 1) Move /3 -bad vertices out 2) If “many” (¸ /3) /3 -bad, “split” Optimum clustering C 1 C 2 OPT( ) Vertex moves out 32

Transforming OPT to OPT( ) 1) Move /3 -bad vertices out 2) If “many” (¸ /3) /3 -bad, “split” Optimum clustering C 1 Split C 2 OPT( ) 33

Transforming OPT to OPT( ) Disagreements of OPT( ) Split Disagreements: Earlier ¸ ( /3)2|C 1|2 Add · |C 1|2/2 Do not split Disagreements: Earlier: Each had ¸ /3|C 2| Add : Each has · |C 2| So, total disagreements increase by O(1/ 2) times 34

Transforming OPT to OPT( ) “Non-Singleton” clusters are -clean Optimum clustering C 1 C 2 Earlier /3 good vertex Still -good OPT( ) 35

Approximation for Disagreements To prove: Dalg · c Dopt Dalg : Our Disagreements Dopt : Opt Disagreements Roadmap: 1) Notations 2) Show existence of Opt( ) 3) Describe the Algorithm 4) Show our clustering close to Opt( ) 36

Main Result -clean Opt( ) Clustering produced by Algorithm 11 -clean 37

The algorithm Input: Graph G Output: A clustering of G 1) Pick arbitrary v 2 G, let C=+1 neighbors of v 2) Vertex Removal Phase: Remove bad vertices from C 3) Vertex Addition Phase: Add good vertices to C 4) Repeat on G-C 38

Step 1 Choose v, C= +1 neighbors of v C 1 C 2 v C 39

Step 2 Vertex Removal Phase: If x is 3 bad, C=C-{x} C 1 C 2 v C 40

Step 2 Vertex Removal Phase: If x is 3 bad, C=C-{x} C 1 C 2 v C 1) No vertex in C 1 removed. 2) All vertices in C 2 removed 41

Step 3 Vertex Addition Phase: Add 7 -good vertices to C C 1 C 2 v C 42

Step 3 Vertex Addition Phase: Add 7 -good vertices to C C 1 C 2 v C 1) All remaining vertices in C 1 will be added 2) None in C 2 added 3) Cluster C is 11 -clean 43

Case 2: v Singleton in OPT( ) Choose v, C= +1 neighbors of v C v Same idea works 44

Main Result Opt( ) -clean Algorithm 11 -clean 45

Approximation for Disagreements To prove: Dalg · c Dopt Dalg : Our Disagreements Dopt : Opt Disagreements Roadmap: 1) Notations 2) Show existence of Opt( ) 3) Describe the Algorithm 4) Show our clustering close to Opt( ) 46

Our Disagreements C 1 Opt( ) C 2 +1 +1 Disagreements: 1) Involving Singletons 2) In Non-Singletons Algorithm +1 +1 11 - clean Type 1 · Dopt( ) 47

Disagreements in Non-Singletons Lemma: If < ¼, disagreements in clean clusterings are · 8 dopt Erroneous Triangle: +1 +1 -1 Disagreements of OPT ¸ # of edge disjoint Erroneous D 48

Disagreements in Non-Singletons Lemma: If < ¼, errors in -clean clusterings are · 8 dopt Proof Idea: For each disagreement will find an edge disjoint erroneous D Lots of these (¸ ½|C|) +1 +1 All cannot be used up -1 C -clean cluster 49

Disagreements in Non-Singletons Lemma: If < ¼, disagreements in clean clusterings are · 8 dopt +1 -1 Identical argument works +1 -clean clusters 50

Putting it all together Choose < 1/44 1) Errors involving singletons = O(dopt( )) = O(dopt) 2) Errors in non singleton clusters = O(dopt) Total errors = O(dopt) 11 -clean Thus we have an O(1) approximation!! 51

Outline Introduction Our Approach + Problem Formulation Approximating Agreements Approximating Disagreements Conclusion 52

Conclusion Defined notion of correlation clustering Natural measure of quality, agreements and disagreements Obtained provably good algorithms for these. 53

Future Work Performance in Practice More Heuristics Improve theoretical bounds More notions of clusterings which have provably good algorithms 54

Thank You! Paper: http: //www. cs. cmu. edu/~nikhil Comments: nikhil@cs. cmu. edu 55

Notation: -good vertex Let N+(v) = +ve neighbors of v Given a clustering, vertex is -good if Cluster C -clean if all v 2 C are -good |N+(v) Å : C|< |C| |N+(v) Å C| > (1 - )|C| v C A -bad vertex has ¸ |C| disagreements 56

Recap: Opt-> Opt( ) -clean clusters Opt( ) 57

The algorithm Input: Graph G Output: A cluster L 1) Pick arbitrary v 2 G, let L=N+(v). 2) Vertex Removal Phase 3) Vertex Addition Phase: Y={y| y is 7 good wrt L}. 4) Return L=L [ Y and repeat on G-L. 58

Step 2 Vertex Removal Phase: If x is 3 bad, L=L-{x} C 1 C 2 v L=N+(v) Subcase 1) All vertices except in C 1 removed Important observation: There must be a Lot of vertices from C 1 in L(v) 59

Step 3 Vertex Addition Phase: Add 7 -good vertices wrt L C 1 C 2 v L=N+(v) 1) All remaining vertices in C 1 will be added Can show again that the clustering 11 - clean 60

Step 3 Vertex Addition Phase: Add 7 -good vertices wrt L C 1 v L=N+(v) C 2 1) All in C 1 added 2) None in C 2 3) added Not many singletons added either ) Our cluster is 11 - clean 61

Case 2: Vertex Removal Phase: If x is 3 bad, L=L-{x} C 1 C 2 L=N+(v) v Subcase 1: All vertices in C 1 and C 2 removed Can show none from C 1 and C 2 will be addition Phase either 62

Case 2: v singleton in Opt( ) C 1 C 2 L=N+(v) v C 1 C 2 v L=N+(v) v 63

Step 3 Vertex Addition Phase: Add 7 -good vertices wrt L C 1 C 2 v L=N+(v) 1) All remaining vertices in C 1 will be added Can show again that the clustering 11 - clean 64

Comparing Disagreements Algorithm Disagreements: 1) Involving Singletons 2) In Non-Singletons 11 -clean 65