CSCI B 609 Foundations of Data Science Lecture
CSCI B 609: “Foundations of Data Science” Lecture 24: Correlation Clustering Slides at http: //grigory. us/data-science-class-f 17. html With Shuchi Chawla (University of Wisconsin, Madison), Konstantin Makarychev (Microsoft Research), Tselil Schramm (University of California, Berkeley)
Correlation Clustering • Inspired by machine learning at • Practice: [Cohen, Mc. Callum ‘ 01, Cohen, Richman ’ 02] • Theory: [Blum, Bansal, Chawla ’ 04]
Correlation Clustering: Example • Minimize # of incorrectly classified pairs: # Covered non-edges + # Non-covered edges 4 incorrectly classified = 1 covered non-edge + 3 non-covered edges • Min-CSP, but # labels is unbounded
Approximating Correlation Clustering •
Correlation Clustering One of the most successful clustering methods: • Only uses qualitative information about similarities • # of clusters unspecified (selected to best fit data) • Applications: document/image deduplication (data from crowds or black-box machine learning) • NP-hard [Bansal, Blum, Chawla ‘ 04], admits simple approximation algorithms with good provable guarantees • Agnostic learning problem
Correlation Clustering More: • Survey [Wirth] • KDD’ 14 tutorial: “Correlation Clustering: From Theory to Practice” [Bonchi, Garcia-Soriano, Liberty] http: //francescobonchi. com/CCtuto_kdd 14. pdf • Wikipedia article: http: //en. wikipedia. org/wiki/Correlation_cluste ring
Data-Based Randomized Pivoting •
Data-Based Randomized Pivoting • 8 incorrectly classified = 2 covered non-edges + 6 non-covered edges
Data-Based Randomized Pivoting • Def: “Bad triangle” = triple of vertices that has exactly two edges • Any clustering makes at least one mistake for one of the edges of a bad triangle
Linear Program for Bad Triangles •
Analysis •
Integer Program
Linear Program •
Integrality Gap • … …
• Can the LP be rounded optimally?
LP-based Pivoting Algorithm [ACN] •
LP-based Pivoting Algorithm [ACN] … …
LP-based Pivoting Algorithm … …
Our (Data + LP)-Based Pivoting • {
• { 0 0, 04 0, 08 0, 12 0, 16 0, 24 0, 28 0, 32 0, 36 0, 44 0, 48 0, 52 0, 56 0, 64 0, 68 0, 72 0, 76 0, 84 0, 88 0, 92 0, 96 1 Our (Data + LP)-Based Pivoting { 1 0, 9 0, 8 0, 7 0, 6 0, 5 0, 4 0, 3 0, 2 0, 1 0
Analysis •
Analysis •
Triangle-Based Analysis: Algorithm • {
Triangle-Based Analysis: LP • {
Triangle-Based Analysis •
Triangle-Based Analysis •
Our Results: Complete Graphs • 2. 06 -approximation for complete graphs • Can be derandomized (previous: [Hegde, Jain, Williamson, van Zuylen ‘ 08]) • Also works for real weights satisfying probability constraints
Our Results: Triangle Inequalities •
Thanks! •
- Slides: 30