Information Organization Clustering Clustering Unsupervised Learning Similar items










- Slides: 10
Information Organization: Clustering
Clustering § Unsupervised Learning § Similar items are grouped together into clusters. A B C D E F t 1 (1=blue) 1 0 0 1 t 2 (1=circle) 1 0 0 1 1 1 t 3 (1=small) 1 A A B C D E 0 0 1 1 B C D E F 0 1 2 2 3 2 1 1 0 0 0 1 3 2 1 A D B E C F A = (blue, circle, small) B = (red, square, large) C = (blue, square, large) D = (red, circle, small) E = (red, circle, small) F = (blue, circle, small) 2 F Search Engine 2
Clustering: Procedure 1. Construct an Object-Attribute matrix array element = presence/absence, occurrence frequency, measure of importance (e. g. tf*idf, term relevance weight) u 2. Compute measure of association between objects Construct the Object-Object Association matrix array element = similarity measure u 4. Identify the pairwise associations above a given threshold C D E F t 1 (1=blue) 1 0 0 1 t 2 (1=circle) 1 0 0 1 1 1 t 3 (1=small) 1 0 0 1 1 1 two objects are related if the strength of association (e. g. similarity) is greater than or equal to some threshold value u 5. B e. g. Dice, Jaccard, Cosine similarity u 3. A Apply clustering criterion (rules for combining related objects) Single Link Clustering Criterion u · Object is related to at least one member of an existing cluster · i. e. similarity to the closest element in a cluster > threshold Complete Link Clustering Criterion u · Object is related to all the members of an existing cluster · i. e. similarity to the farthest element in a cluster > threshold A A B C D E F 0 1 2 2 3 2 1 1 0 0 0 1 3 2 2 F Search Engine 3
Clustering: Example Object-Attribute array Object-Object Association array D 1 D 2 D 3 D 4 D 5 D 6 D 1 t 1 1 3 1 0 1 1 D 1 t 2 0 1 2 1 0 0 D 2 t 3 1 0 3 0 1 0 D 3 t 4 0 1 0 2 2 0 D 4 t 5 0 0 1 1 0 3 D 5 t 6 0 1 0 2 D 6 D 2 D 3 D 4 D 5 D 6 2/6 4/6 0 4/5 2/5 4/8 6/8 4/7 4/8 4/7 2/7 4/7 2/6 Threshold = 0. 6 Single Link Clusters: C 1 = (D 1, D 3, D 5) C 2 = (D 2, D 4) Complete Link Clusters: C 1 = (D 1, D 3) or (D 1, D 5) C 2 = (D 2, D 4) D 1 D 2 D 3 D 4 D 5 D 6 2/6 4/6 0 4/5 2/5 4/8 6/8 4/7 4/8 4/7 2/7 4/7 2/6 D 6 Search Engine 4
Clustering: Problem Document-Document Similarity Matrix D 1 Threshold = 0. 67 D 2 D 3 D 4 D 5 D 6 D 7 0. 40 0. 33 0. 57 0. 10 0. 57 0. 50 D 1 0. 40 0. 33 0. 10 0. 33 0. 67 D 2 0. 29 0. 11 0. 29 0. 50 D 3 0. 11 0. 25 0. 40 D 4 0. 00 D 5 0. 40 D 6 D 2 D 3 D 4 D 5 D 6 D 7 0. 40 0. 33 0. 57 0. 10 0. 57 0. 50 0. 40 0. 33 0. 10 0. 33 0. 67 0. 29 0. 11 0. 29 0. 50 0. 11 0. 25 0. 40 0. 00 0. 40 Single & Complete Link: C 1 = (D 2, D 7) Threshold = 0. 5 D 1 D 2 D 3 D 4 D 5 D 6 Threshold = 0. 57 D 2 D 3 D 4 D 5 D 6 D 7 0. 40 0. 33 0. 57 0. 10 0. 57 0. 50 D 1 0. 40 0. 33 0. 10 0. 33 0. 67 D 2 0. 29 0. 11 0. 29 0. 50 D 3 0. 11 0. 25 0. 40 D 4 0. 00 D 5 0. 40 D 6 Single Link: C 1 = (D 1, D 2, D 3, D 4, D 6, D 7) Complete Link: C 1 = (D 1, D 4) or (D 1, D 6) or (D 1, D 7), C 2 = (D 2, D 7) or (D 3, D 7) Search Engine D 2 D 3 D 4 D 5 D 6 D 7 0. 40 0. 33 0. 57 0. 10 0. 57 0. 50 0. 40 0. 33 0. 10 0. 33 0. 67 0. 29 0. 11 0. 29 0. 50 0. 11 0. 25 0. 40 0. 00 0. 40 Single Link: C 1 = (D 1, D 4, D 6), C 2 = (D 2, D 7) Complete Link: C 1 = (D 1, D 4) or (D 1, D 6), C 2 = (D 2, D 7) 5
Clustering Types n Hierarchical vs. Flat u u n Hard vs. soft u u n u Disjunctive: an item must belong to only one cluster Non-disjunctive: items can be part of more than one clusters Iterative 1. 2. 3. n Hard: assign each item to a cluster (binary decision) Soft: assign each item a probability of belonging to a cluster. Disjunctive vs. non-disjunctive u n Hierarchical: induce a hierarchy of clusters of decreasing generality (less efficient than flat) Flat (Non-hierarchical): all clusters are the same Start with initial set of clusters Reassign items to improve clusters Repeat step 2 until convergence Linkage vs. Non-linkage u u Search Engine Linkage: link together similar items to identify clusters Non-linkage: e. g, iterative clustering 6
Clustering: Examples hard, non-hierarchical, disjunctive hard, hierarchical, disjunctive Search Engine hard, non-hierarchical, non-disjunctive C 1 C 2 C 3 D 1 0. 123 0. 543 0. 231 D 2 0. 434 0. 232 0. 434 D 3 0. 013 0. 512 D 4 0. 444 0. 277 0. 435 soft, non-hierarchical, non-disjunctive 7
Clustering: n Hierarchical u Agglomerative (Bottom-Up) Example · · Start at bottom and merge a pair of clusters into a single cluster Algorithm 1. 2. 3. 4. · → → u create doc-doc similarity matrix – initially, each document is its own cluster combine two most similar clusters into one update doc-doc similarity matrix Goto 2 & repeat until there is only one cluster left Linkage Methods → Single-linkage: proximity to the closest element in another cluster (max. similarity) Complete-linkage: proximity to the most distant element (min. similarity) Mean: proximity to the mean (centroid) Divisive (top-down) · Start at the top and split one cluster into two new clusters → n Hierarchical vs Non-Hierarchical split the cluster to produce two new clusters with largest dissimilarity Non-Hierarchical u u Search Engine find the best grouping of items into k clusters e. g. k-means clustering 8
k-means clustering n Features u u Iterative, Hard, Flat (non-hierarchical), non-linkage n items are assigned to k clusters so that average distance to the cluster mean is minimized. · n Algorithm 1. 2. Select k (number of clusters) Select k initial cluster centers c 1, …, ck a. b. 3. 4. randomly assign each item to a cluster calculate the centroid for each cluster For each item a. b. n Uses Euclidian distance calculate the distance to each cluster centroid assign the item to the closest cluster Go to 2. b and repeat until convergence Issues u u must select a number of k must initialize the clusters · Search Engine random selection of k documents as clusters 9
Automatic Thesaurus n Typical Use u Query Refinement · heteronym → · homonym → → → · words with multiple related meanings (e. g. , mole, branch, bank) – similar concept Query Expansion · · · n words that are pronounced or spelled the same way but have distinctly different meanings homograph -- spelled the same differ in meaning (e. g. fair, bank) – different concept homophone --pronounced the same but differ in meaning (e. g. bare and bear) polyseme → u words that are spelled the same way but differ in pronunciation (e. g. bow) synonym: e. g. bank financial institution hypernym: e. g. car vehicle hyponym: e. g. car SUV, van, sedan Term Clustering Example u Build Term vectors · rows in the inverted index u Cluster by Term-Term similarity u Premise · · terms are related if they often appear in the same document (Term-Term Co-occurrence) Problems → → Search Engine A very frequent term will co-occur with everything Very general terms will co-occur with other general terms 10