Correlation Clustering Nikhil Bansal Joint Work with Avrim
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla
Clustering Say we want to cluster n objects of some kind (documents, images, text strings) But we don’t have a meaningful way to project into Euclidean space. Idea of Cohen, Mc. Callum, Richman: use past data to train up f(x, y)=same/different. Then run f on all pairs and try to find most consistent clustering. 2
The problem Harry Bovik H. Bovik Tom X. 3
The problem Harry B. + + Harry Bovik H. Bovik +: Same -: Different + Tom X. Train up f(x)= same/different Run f on all pairs 4
The problem Harry B. + + Harry Bovik H. Bovik 1. 2. +: Same -: Different + Tom X. Totally consistent: + edges inside clusters – edges outside clusters 5
The problem + +: Same -: Different + Harry Bovik Harry B. H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs 6
The problem + +: Same -: Different + Harry Bovik Harry B. Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering 7
The problem + +: Same -: Different + Harry Bovik Harry B. Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering 8
The problem + + Harry Bovik Disagreement Harry B. +: Same -: Different H. Bovik Tom X. Problem: Given a complete graph on n vertices. Each edge labeled + or -. Goal is to find partition of vertices as consistent as possible with edge labels. Max #(agreements) or Min #( disagreements) There is no k : # of clusters could be anything 9
The Problem Noise Removal: There is some true clustering. However some edges incorrect. Still want to do well. Agnostic Learning: No inherent clustering. Try to find the best representation using hypothesis with limited representation power. Eg: Research communities via collaboration graph 10
Our results Constant-factor approx for minimizing disagreements. PTAS for maximizing agreements. Results for random noise case. 11
PTAS for maximizing agreements Easy to get ½ of the edges Goal: additive apx of en 2. Standard approach: n Draw small sample, n Guess partition of sample, n Compute partition of remainder. Can do directly, or plug into General Property Tester of [GGR]. n Running time doubly exp’l in e, or singly with bad exponent. 12
Minimizing Disagreements Goal: Get a constant factor approx. Problem: Even if we can find a cluster that’s as good as best cluster of OPT, we’re headed toward O(log n) [Set-Cover like analysis] Need a way of lower-bounding Dopt. 13
Lower bounding idea: bad triangles Consider + + We know any clustering has to disagree with at least one of these edges. 14
Lower bounding idea: bad triangles If several such disjoint, then mistake on each one + 1 + 5 2 + + 4 + 2 Edge disjoint Bad Triangles (1, 2, 3), (3, 4, 5) 3 Dopt ¸ #{Edge disjoint bad triangles} (Not Tight) 15
Lower bounding idea: bad triangles If several such, then mistake on each one + 1 + 5 2 + + 4 + Edge disjoint Bad Triangles (1, 2, 3), (3, 4, 5) 3 Dopt ¸ #{Edge disjoint bad triangles} How can we use this ? 16
-clean Clusters Given a clustering, vertex -good if few disagreements N-(v) Within C < |C| N+(v) Outside C < |C| C v is -good +: Similar -: Dissimilar Essentially, N+(v) ¼ C(v) Cluster C -clean if all v 2 C are -good 17
Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles w Intuitively, enough choices of w for each wrong edge (u, v) + + u - v 18
Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles Intuitively, enough choices of w for each wrong edge (u, v) Can find edge-disjoint bad triangles + u’ w + + u - v Similar argument for +ve edges between 2 clusters 19
General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a -clean clustering for <1/4 20
General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a -clean clustering for <1/4 Bad News: May not be possible !! 21
General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a -clean clustering for <1/4 Bad News: May not be possible !! Approach: 1) Clustering of a special form: Opt( ) Clusters either clean or singletons Not many more mistakes than Opt 2) Produce something “close’’ to Opt( ) 22
Existence of OPT( ) Opt C 1 C 2 23
Existence of OPT( ) Identify /3 -bad vertices Opt C 1 C 2 /3 -bad vertices 24
Existence of OPT( ) 1) Move /3 -bad vertices out 2) If “many” (¸ /3) /3 -bad, “split” Opt( ) : Singletons and -clean clusters C 1 C 2 DOpt( ) = O(1) DOpt OPT( ) Split 25
Main Result Opt( ) -clean 26
Main Result Opt( ) -clean Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt( ) Algorithm 11 -clean 27
Main Result Opt( ) -clean Algorithm Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt( ) Choose =1/44 Non-singletons: ¼-clean Bound mistakes among nonsingletons by bad trianges Involving singletons By those of Opt( ) 11 -clean 28
Main Result Opt( ) -clean Algorithm 11 -clean Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt( ) Choose =1/44 Non-singletons: ¼-clean Bound mistakes among nonsingletons by bad trianges Involving singletons By those of Opt( ) Approx. ratio= 9/ 2 + 8 29
Open Problems How about a small constant? Is Dopt · 2 #{Edge disjoint bad triangles} ? Is Dopt · 2 #{Fractional e. d. bad triangles}? Extend to {-1, 0, +1} weights: apx to constant or even log factor? Extend to weighted case? Clique Partitioning Problem Cutting plane algorithms [Grotschel et al] 30
31
32
The problem Harry B. + Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering 33
Nice features of formulation There’s no k. (OPT can have anywhere from 1 to n clusters) If a perfect solution exists, then it’s easy to find: C(v) = N +(v). Easy to get agreement on ½ of edges. 34
Algorithm 1. Pick vertex v. Let C(v) = N (v) 2. Modify C(v) (a) (b) + Remove 3 -bad vertices from C(v). Add 7 good vertices into C(v). 3. Delete C(v). Repeat until done, or above always makes empty clusters. 4. Output nodes left as singletons. 35
Step 1 Choose v, C= + neighbors of v C 1 C 2 v C 36
Step 2 Vertex Removal Phase: If x is 3 bad, C=C-{x} C 1 C 2 v C 1) No vertex in C 1 removed. 2) All vertices in C 2 removed 37
Step 3 Vertex Addition Phase: Add 7 -good vertices to C C 1 C 2 v C 1) All remaining vertices in C 1 will be added 2) None in C 2 added 3) Cluster C is 11 -clean 38
Case 2: v Singleton in OPT( ) Choose v, C= +1 neighbors of v C v Same idea works 39
The problem 1 4 2 3 40
- Slides: 40