BiClustering COMP 790 90 Seminar Spring 2008 The
Bi-Clustering COMP 790 -90 Seminar Spring 2008 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster Want to accommodate noises but not outliers 2 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster • Coherent cluster Ø Subspace clustering • pair-wise disparity Ø For a 2 2 (sub)matrix consisting of objects {x, y} and attributes {a, b} dxa x z dya y mutual bias of attribute a 3 mutual bias of attribute b dxb dyb a b attribute The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster Ø A 2 2 (sub)matrix is a -coherent cluster if its D value is less than or equal to . Ø An m n matrix X is a -coherent cluster if every 2 2 submatrix of X is -coherent cluster. v. A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster. Ø Objective: given a data matrix and a threshold , find all maximum -coherent clusters. 4 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster • Challenges: Ø Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. v. The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. Ø The actual values of the objects in a coherent cluster may be far apart from each other. v. Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster. 5 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Two-way Pruning Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters 6 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster • Observation: Given a pair of objects {o 1, o 2} and a (sub)set of attributes {a 1, a 2, …, ak}, the 2 k submatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do 1 ai – do 2 ai) does not differ from each other by more than . 7 5 o 1 3 o 2 1 7 a 1 a 2 3 2 a 3 a 4 If = 1. 5, then {a 1, a 2, a 3, a 4, a 5} is a coherent attribute set (CAS) of (o 1, o 2). a 5 3. 5 2 2. 5 [2, 3. 5] The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster • Observation: given a subset of objects {o 1, o 2, …, ol} and a subset of attributes {a 1, a 2, …, ak}, the l k submatrix is a -coherent cluster iff {a 1, a 2, …, ak} is a coherent attribute set for every pair of objects (oi, oj) where 1 i, j l. a 1 a 2 a 3 a 4 a 5 a 6 a 7 o 1 o 2 o 3 o 4 o 5 o 6 8 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster • Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold . 7 7 5 r 1 3 r 2 1 a 2 a 3 a 4 a 5 1 a 2 a 4 a 5 a 1 a 3 = 1 3 2 3. 5 2 2 2. 5 3 3. 5 The maximum coherent attribute sets define the search space for maximum coherent The clusters. UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 9
Two Way Pruning a 0 a 1 a 2 o 0 1 4 2 o 1 2 5 5 o 2 3 6 5 o 3 4 200 7 o 4 300 7 6 (o 0, o 2) →(a 0, a 1, a 2) (o 1, o 2) →(a 0, a 1, a 2) (a 0, a 1) →(o 0, o 1, o 2) (a 0, a 2) →(o 1, o 2, o 3) (a 1, a 2) →(o 1, o 2, o 4) (a 1, a 2) →(o 0, o 2, o 4) delta=1 nc =3 nr = 3 MCAS 10 MCOS The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster 11 attributes objects • Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). • Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
a 0 a 1 a 2 a 3 o 0 1 4 2 5 o 1 2 5 5 8 o 2 3 6 5 7 o 3 4 20 7 2 o 4 30 7 6 {a 0, a 1} : (o 0, o 1) (o 1, o 2) (o 0, o 2) {a 0, a 2} : (o 1, o 3), (o 2, o 3) (o 1, o 2) (o 0, o 2) {a 1, a 2} : (o 0, o 4), (o 1, o 4), (o 2, o 4) (o 1, o 2) (o 0, o 2) {a 2, a 3} : (o 0, o 1), (o 1, o 2) (o 0, o 2) {a 0, a 1, a 2} : (o 1, o 2) (o 0, o 2) {a 0, a 1, a 2, a 3} : (o 0, o 2) 6 a 0 assume = 1 (o 0, o 1) : {a 0, a 1}, {a 2, a 3} a 1 (o 0, o 2) : {a 0, a 1, a 2, a 3} (o 0, o 4) : {a 1, a 2} (o 0, o 1) (o 1, o 2) : {a 0, a 1, a 2}, {a 2, a 3} a 2 (o 1, o 3) : {a 0, a 2} (o 1, o 4) : {a 1, a 2} (o 1, o 2) (o 2, o 3) : {a 0, a 2} a 3 (o 2, o 4) : {a 1, a 2} 12 (o 0, o 2) a 2 a 1 a 2 (o 1, o 3) (o 2, o 3) a 3 (o 0, o 4) (o 1, o 4) (o 2, o 4) (o 0, o 1) (o 1, o 2) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
{o 0, o 2} {a 0, a 1, a 2, a 3} {o 1, o 2} {a 0, a 1, a 2} {o 0, o 1, o 2} {a 0, a 1} {o 1, o 2, o 3} {a 0, a 2} {o 0, o 2, o 4} {a 1, a 2} {o 1, o 2, o 4} {a 1, a 2} {o 0, o 1, o 2} {a 2, a 3} a 3 a 1 (o 0, o 2) a 0 a 2 a 1 a 2 a 3 (o 0, o 1) (o 1, o 3) (o 0, o 4) (o 1, o 2) (o 2, o 3) (o 1, o 4) a 2 (o 0, o 2) (o 1, o 2) (o 2, o 4) (o 0, o 2) (o 1, o 2) a 3 (o , o ) (o 1, o 2) a 3 0 2 (o , o ) a 3 0 2 (o 0, o 2) (o 0, o 1) (o 1, o 2) (o 0, o 2) 13 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Coherent Cluster • High expressive power Ø The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. • Efficient and highly scalable • Wide applications subspace cluster coherent cluster Ø Gene expression analysis Ø Collaborative filtering 14 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Remark • Comparing to Bicluster Ø Can well separate noises and outliers Ø No random data insertion and replacement Ø Produce optimal solution 15 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
• Let I be a subset of genes in the database. Let J be a subset of conditions. We say <I, J> forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions. Expression Levels Definition of OP-Cluster A 1 16 when A 2 A 3 A 4 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Problem Statement • Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold nc and nr. 17 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Expression Levels Conversion to Sequence Mining Problem Sequence: A 1 18 A 2 A 3 A 4 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Ming OP-Clusters: A naïve approach • A naïve approach root Ø Enumerate all possible subsequences in a prefix tree. Ø For each subsequences, collect all genes that contain the subsequences. • Challenge: Ø The total number of distinct subsequences are a b b c d c a d c d … c d b c c d a d … d c d b c b d c d a … A Complete Prefix Tree with 4 items {a, b, c, d} 19 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Mining OP-Clusters: Prefix Tree Goal: Build a compact prefix tree that includes all sub-sequences only occurring in the original database. Strategies: g 1 adbc g 2 abdc g 3 badc 1. Depth-First Traversal Root 2. Suffix concatenation: Visit subsequences that only exist in the input sequences. 3. Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences. a: 1, 2, 3 a: 1, 2 d: 1, 2, 3 d: 1 b: 1 c: 1 20 c: 1, 2, 3 c: 1, 3 b: 2 a: 3 d: 2 d: 3 c: 2 c: 3 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References • J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18 th IEEE International Conference on Data Engineering (ICDE), pp. 517 -528, 2002. • H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. • Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced p. Clustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. • J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’ 03. 21 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
- Slides: 21