Information Bottleneck presented by Boris Epshtein Lena Gorelick
Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Motivation Clustering Problem
Motivation • “Hard” Clustering – partitioning of the input data into several exhaustive and mutually exclusive clusters • Each cluster is represented by a centroid
Motivation • “Good” clustering – should group similar data points together and dissimilar points apart • Quality of partition – average distortion between the data points and corresponding representatives (cluster centroids)
Motivation • “Soft” Clustering – each data point is assigned to all clusters with some normalized probability • Goal – minimize expected distortion between the data points and cluster centroids
Motivation… Complexity-Precision Trade-off • Too simple model Poor precision • Higher precision requires more complex model
Motivation… Complexity-Precision Trade-off • Too simple model Poor precision • Higher precision requires more complex model • Too complex model Overfitting
Motivation… Complexity-Precision Trade-off • Too Complex Model – can lead to overfitting – is hard to learn Poor generalization • Too Simple Model – can not capture the real structure of the data • Examples of approaches: – SRM Structural Risk Minimization – MDL Minimum Description Length – Rate Distortion Theory
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Definitions… Entropy • The measure of uncertainty about the random variable
Definitions… Entropy - Example – Fair Coin: – Unfair Coin:
Definitions… Entropy - Illustration Highest Lowest
Definitions… Conditional Entropy • The measure of uncertainty about the random variable given the value of the variable
Definitions… Conditional Entropy Example
Definitions… Mutual Information • The reduction in uncertainty of to the knowledge of – Nonnegative – Symmetric – Convex w. r. t. for a fixed due
Definitions… Mutual Information - Example
Definitions… Kullback Leibler Distance Over the same alphabet • A distance between distributions – Nonnegative – Asymmetric
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Rate Distortion Theory Introduction • Goal: obtain compact clustering of the data with minimal expected distortion • Distortion measure is a part of the problem setup • The clustering and its quality depend on the choice of the distortion measure
Rate Distortion Theory Data ? • Obtain compact clustering of the data with minimal expected distortion given fixed set of representatives Cover & Thomas
Rate Distortion Theory - Intuition • – zero distortion – not compact • – high distortion – very compact
Rate Distortion Theory – Cont. • The quality of clustering is determined by – Complexity is measured by (a. k. a. Rate) – Distortion is measured by
Rate Distortion Plane D - distortion constraint Minimal Distortion Ed(X, T) Maximal Compression
Rate Distortion Function • Let be an upper bound constraint on the expected distortion Higher values of mean more relaxed distortion constraint Stronger compression levels are attainable • Given the distortion constraint find the most compact model (with smallest complexity )
Rate Distortion Function • Given – Set of points with prior – Set of representatives – Distortion measure • Find – The most compact soft clustering of points of that satisfies the distortion constraint • Rate Distortion Function
Rate Distortion Function Complexity Term Distortion Term Lagrange Multiplier Minimize !
Rate Distortion Curve Minimal Distortion Ed(X, T) Maximal Compression
Rate Distortion Function Minimize Subject to The minimum is attained when Normalization
Solution - Analysis Solution: Known The solution is implicit
Solution - Analysis Solution: For a fixed When is similar to closer points probability is small are attached to with higher
Solution - Analysis Solution: Fix t reduces the influence of distortion does not depend on this + maximal compression single cluster Fix x most of cond. prob. goes to some with smallest distortion hard clustering
Solution - Analysis Solution: Intermediate soft clustering, intermediate complexity Varying
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Blahut – Arimoto Algorithm Input: Randomly init Optimize convex function over convex set the minimum is global
Blahut-Arimoto Algorithm Advantages: • Obtains compact clustering of the data with minimal expected distortion • Optimal clustering given fixed set of representatives
Blahut-Arimoto Algorithm Drawbacks: • Distortion measure is a part of the problem setup – Hard to obtain for some problems – Equivalent to determining relevant features • Fixed set of representatives • Slow convergence
Rate Distortion Theory – Additional Insights – Another problem would be to find optimal representatives given the clustering. – Joint optimization of clustering and representatives doesn’t have a unique solution. (like EM or K-means)
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Information Bottleneck • Copes with the drawbacks of Rate Distortion approach • Compress the data while preserving “important” (relevant) information • It is often easier to define what information is important than to define a distortion measure. • Replace the distortion upper bound constraint by a lower bound constraint over the relevant information Tishby, Pereira & Bialek, 1999
Information Bottleneck-Example Given: Documents Joint prior Topics
Information Bottleneck-Example Obtain: I(Word; Topic) I(Cluster; Topic) I(Word; Cluster) Words Partitioning Topics
Information Bottleneck-Example Extreme case 1: I(Cluster; Topic)=0 Not Informative I(Word; Cluster)=0 Very Compact
Information Bottleneck-Example Extreme case 2: I(Cluster; Topic)=max Very Informative I(Word; Cluster)=max Not Compact Minimize I(Word; Cluster) & maximize I(Cluster; Topic)
Information Bottleneck words Compactness topics Relevant Information
Relevance Compression Curve D – relevance constraint Maximal Compression Maximal Relevant Information
Relevance Compression Function • Let be minimal allowed value of Smaller more relaxed relevant information constraint Stronger compression levels are attainable • Given relevant information constraint Find the most compact model (with smallest )
Relevance Compression Function Compression Term Relevance Term Lagrange Multiplier Minimize !
Relevance Compression Curve Maximal Relevant Information Maximal Compression
Relevance Compression Function Minimize Subject to The minimum is attained when Normalization
Solution - Analysis Solution: Known The solution is implicit
Solution - Analysis Solution: • KL distance emerges as effective distortion measure from IB principle For a fixed When is similar to attach such points KL is small to with higher probability The optimization is also over cluster representatives
Solution - Analysis Solution: Fix t reduces the influence of KL does not depend on this + maximal compression single cluster Fix x most of cond. prob. goes to some with smallest KL (hard mapping)
Relevance Compression Curve Maximal Relevant Information Hard Mapping Maximal Compression
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Iterative Optimization Algorithm (i. IB) • Input: • Randomly init Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001
Iterative Optimization Algorithm (i. IB) p(topic | cluster) p(cluster | word) p(cluster) Pereira, Tishby, Lee , 1993;
i. IB simulation • Given: – 300 instances of with prior – Binary relevant variable – Joint prior – • Obtain: – Optimal clustering (with minimal )
i. IB simulation… X points and their priors
i. IB simulation… Given point the is given by the color of the on the map
i. IB simulation… Single Cluster – Maximal Compression
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation…
i. IB simulation… Hard Clustering – Maximal Relevant Information
Iterative Optimization Algorithm (i. IB) Optimize non-convex functional over 3 convex sets the minimum is local • Analogy to K-means or EM
“Semantic change” in the clustering solution
Iterative Optimization Algorithm (i. IB) Advantages: • Defining relevant variable is often easier and more intuitive than defining distortion measure • Finds local minimum
Iterative Optimization Algorithm (i. IB) Drawbacks: • Finds local minimum (suboptimal solutions) • Need to specify the parameters • Slow convergence • Large data sample is required
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Deterministic Annealing-like algorithm (d. IB) • Iteratively increase the parameter and then adapt the solution from the previous value of to the new one. • Track the changes in the solution as the system shifts its preference from compression to relevance • Tries to reconstruct the relevance-compression curve Slonim, Friedman, Tishby, 2002
Deterministic Annealing-like algorithm (d. IB) Solution from previous step:
Deterministic Annealing-like algorithm (d. IB)
Deterministic Annealing-like algorithm (d. IB) Small Perturbation
Deterministic Annealing-like algorithm (d. IB) Apply i. IB using the duplicated cluster set as initialization
Deterministic Annealing-like algorithm (d. IB) if are different leave the split else use the old
Deterministic Annealing-like algorithm (d. IB) Illustration What clusters split at which values of
Deterministic Annealing-like algorithm (d. IB) Advantages: • Finds local minimum (suboptimal solutions) • Speed-up convergence by adapting previous soultion
Deterministic Annealing-like algorithm (d. IB) Drawbacks: • Need to specify and tune several parameters: - perturbation size - step for (splits might be “skipped”) - similarity threshold for splitting - may need to vary parameters during the process • Finds local minimum (suboptimal solutions) • Large data sample is required
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Agglomerative Algorithm (a. IB) • Find hierarchical clustering tree in a greedy bottom-up fashion • Results in different trees for each • Each tree is a range of clustering solutions at different resolutions Same Different Resolutions Slonim & Tishby 1999
Agglomerative Algorithm (a. IB) Fix Start with
Agglomerative Algorithm (a. IB) For each pair Compute new Merge and that produce the smallest
Agglomerative Algorithm (a. IB) For each pair Compute new Merge and that produce the smallest
Agglomerative Algorithm (a. IB) For each pair Continue merging until single cluster is left
Agglomerative Algorithm (a. IB)
Agglomerative Algorithm (a. IB) Advantages: • Non-parametric • Full Hierarchy of clusters for each • Simple
Agglomerative Algorithm (a. IB) Drawbacks: • Greedy – is not guaranteed to extract even locally minimal solutions along the tree • Large data sample is required
Agenda • Motivation • Information Theory - Basic Definitions • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Applications… Unsupervised Clustering of Images Modeling assumption: Shiri Gordon et. al. , 2003 For a fixed colors and their spatial distribution are generated by a mixture of Gaussians in 5 -dim
Applications… Unsupervised Clustering of Images Apply EM procedure to estimate the mixture parameters Mixture of Gaussians model: Shiri Gordon et. al. , 2003
Applications… Unsupervised Clustering of Images • Assume uniform prior • Calculate conditional • Apply a. IB algorithm Shiri Gordon et. al. , 2003
Applications… Unsupervised Clustering of Images Shiri Gordon et. al. , 2003
Applications… Unsupervised Clustering of Images Shiri Gordon et. al. , 2003
Summary • Rate Distortion Theory – Blahut-Arimoto algorithm • Information Bottleneck Principle • IB algorithms – i. IB – d. IB – a. IB • Application
Thank you
Blahut-Arimoto algorithm A Minimum Distance B ? Convex set of distributions When does it converge to the global minimum? - A and B are convex + some requirements on distance measure Csiszar & Tusnady, 1984
Blahut-Arimoto algorithm B A Reformulate using distance
Blahut-Arimoto algorithm A B
Rate Distortion Theory - Intuition • – zero distortion – not compact – • – high distortion – very compact –
Information Bottleneck - cont’d • Assume Markov relations: – T is a compressed representation of X, thus independent of Y if X is given – Information processing inequality:
- Slides: 114