Data Mining Algorithms and Principles Mining Text Data

  • Slides: 104
Download presentation
Data Mining: Algorithms and Principles — Mining Text Data — ©Jiawei Han Department of

Data Mining: Algorithms and Principles — Mining Text Data — ©Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www. cs. uiuc. edu/~hanj Additional contributors: Rob Mc. Cann, Chao Liu, Bin Tan, Xiao Hu in the Spring 2004 course 1/8/2022 Data Mining: Principles and Algorithms 1

Mining Text Data n Introduction: Text mining, natural language processing and information extraction n

Mining Text Data n Introduction: Text mining, natural language processing and information extraction n Text categorization n Text classification methods n Text cluster analysis n Summary n Research problems in text mining 1/8/2022 Data Mining: Principles and Algorithms 2

Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Home. Loan

Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Home. Loan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200, 000 Term: 15 years ) 1/8/2022 Multimedia Free Text Loans($200 K, [map], . . . ) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200, 000 under a 15 -year loan from MW Financial. Data Mining: Principles and Algorithms Hypertext <a href>Frank Rizzo </a> Bought <a hef>this home</a> from <a href>Lake View Real Estate</a> In <b>1992</b>. <p>. . . 3

Bag-of-Tokens Approaches Documents Token Sets Four score and seven years ago our fathers brought

Bag-of-Tokens Approaches Documents Token Sets Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or … Feature Extraction nation – 5 civil - 1 war – 2 men – 2 died – 4 people – 5 Liberty – 1 God – 1 … Loses all order-specific information! Severely limits context! 1/8/2022 Data Mining: Principles and Algorithms 4

Natural Language Processing A dog is chasing a boy on the playground Det Noun

Natural Language Processing A dog is chasing a boy on the playground Det Noun Aux Noun Phrase Verb Complex Verb Semantic analysis Dog(d 1). Boy(b 1). Playground(p 1). Chasing(d 1, b 1, p 1). + Det Noun Prep Det Noun Phrase Lexical analysis (part-of-speech tagging) Prep Phrase Verb Phrase Syntactic analysis (Parsing) Verb Phrase Sentence Scared(x) if Chasing(_, x, _). Scared(b 1) Inference (Taken from Cheng. Xiang Zhai, CS 397 cxz. Data – Fall 2003) 1/8/2022 Mining: Principles and Algorithms A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act) 5

General NLP—Too Difficult! n n Word-level ambiguity n “design” can be a noun or

General NLP—Too Difficult! n n Word-level ambiguity n “design” can be a noun or a verb (Ambiguous POS) n “root” has multiple meanings (Ambiguous sense) Syntactic ambiguity n “natural language processing” (Modification) n “A man saw a boy with a telescope. ” (PP Attachment) Anaphora resolution n “John persuaded Bill to buy a TV for himself. ” (himself = John or Bill? ) Presupposition n “He has quit smoking. ” implies that he smoked before. Humans rely on context to interpret (when possible). This context may extend beyond a given document! (Taken from Cheng. Xiang Zhai, CS 397 cxz. Data – Fall 2003) 1/8/2022 Mining: Principles and Algorithms 6

Shallow Linguistics Progress on Useful Sub-Goals: • English Lexicon • Part-of-Speech Tagging • Word

Shallow Linguistics Progress on Useful Sub-Goals: • English Lexicon • Part-of-Speech Tagging • Word Sense Disambiguation • Phrase Detection / Parsing 1/8/2022 Data Mining: Principles and Algorithms 7

Word. Net An extensive lexical network for the English language • Contains over 138,

Word. Net An extensive lexical network for the English language • Contains over 138, 838 words. • Several graphs, one for each part-of-speech. • Synsets (synonym sets), each defining a semantic sense. • Relationship information (antonym, hyponym, meronym …) • Downloadable for free (UNIX, Windows) • Expanding to other languages (Global Word. Net Association) • Funded >$3 million, mainly government (translation interest) • Founder George Miller, National Medal of Science, 1991. moist watery parched wet dry damp anhydrous arid synonym 1/8/2022 Data Mining: Principles and Algorithms antonym 8

Part-of-Speech Tagging Training data (Annotated text) This Det sentence N serves V 1 “This

Part-of-Speech Tagging Training data (Annotated text) This Det sentence N serves V 1 “This is a new sentence. ” as P an example Det N POS Tagger of P annotated V 2 text… N This is a new Det Aux Det Adj sentence. N Pick the most likely tag sequence. Independent assignment Most common tag Partial dependency (HMM) (Adapted Fall 2003) 1/8/2022 from Cheng. Xiang Zhai, CS 397 cxz Data–Mining: Principles and Algorithms 9

Word Sense Disambiguation ? “The difficulties of computational linguistics are rooted in ambiguity. ”

Word Sense Disambiguation ? “The difficulties of computational linguistics are rooted in ambiguity. ” N Aux V P N Supervised Learning Features: • Neighboring POS tags (N Aux V P N) • Neighboring words (linguistics are rooted in ambiguity) • Stemmed form (root) • Dictionary/Thesaurus entries of neighboring words • High co-occurrence words (plant, tree, origin, …) • Other senses of word within discourse Algorithms: • Rule-based Learning (e. g. IG guided) • Statistical Learning (i. e. Naïve Bayes) • Unsupervised Learning (i. e. Nearest Neighbor) 1/8/2022 Data Mining: Principles and Algorithms 10

Parsing Choose most likely parse tree… Grammar Lexicon V chasing Aux is N dog

Parsing Choose most likely parse tree… Grammar Lexicon V chasing Aux is N dog N boy N playground Det the Det a P on Probability of this tree=0. 000015 NP Probabilistic CFG S NP VP NP Det BNP NP NP PP BNP N VP V VP Aux V NP VP PP PP P NP S 1. 0 0. 3 0. 4 0. 3 Det BNP A N . . . V NP is chasing P NP on a boy Probability of this tree=0. 000011 S NP 0. 01 Det 0. 003 A … PP the playground 1. 0 … VP Aux dog … … VP VP BNP N Aux is dog (Adapted Fall 2003) 1/8/2022 from Cheng. Xiang Zhai, CS 397 cxz Data–Mining: Principles and Algorithms NP V chasing NP a boy PP P NP on the playground 11

Obstacles • Ambiguity “A man saw a boy with a telescope. ” • Computational

Obstacles • Ambiguity “A man saw a boy with a telescope. ” • Computational Intensity Imposes a context horizon. Text Mining NLP Approach: 1. Locate promising fragments using fast IR methods (bag-of-tokens). 2. Only apply slow NLP techniques to promising fragments. 1/8/2022 Data Mining: Principles and Algorithms 12

Summary: Shallow NLP However, shallow NLP techniques are feasible and useful: • Lexicon –

Summary: Shallow NLP However, shallow NLP techniques are feasible and useful: • Lexicon – machine understandable linguistic knowledge • possible senses, definitions, synonyms, antonyms, typeof, etc. • POS Tagging – limit ambiguity (word/POS), entity extraction • “. . . research interests include text mining as well as bioinformatics. ” NP N • WSD – stem/synonym/hyponym matches (doc and query) • Query: “Foreign cars” Document: “I’m selling a 1976 Jaguar…” • Parsing – logical view of information (inference? , translation? ) • “A man saw a boy with a telescope. ” Even without complete NLP, any additional knowledge extracted from text data can only be beneficial. Ingenuity will determine the applications. 1/8/2022 Data Mining: Principles and Algorithms 13

References for Introduction 1. 5. 6. C. D. Manning and H. Schutze, “Foundations of

References for Introduction 1. 5. 6. C. D. Manning and H. Schutze, “Foundations of Natural Language Processing”, MIT Press, 1999. S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”, Prentice Hall, 1995. S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi. Structured Data”, Morgan Kaufmann, 2002. G. Miller, R. Beckwith, C. Fell. Baum, D. Gross, K. Miller, and R. Tengi. Five papers on Word. Net. Princeton University, August 1993. C. Zhai, Introduction to NLP, Lecture Notes for CS 397 cxz, UIUC, Fall 2003. M. Hearst, Untangling Text Data Mining, ACL’ 99, invited paper. 7. http: //www. sims. berkeley. edu/~hearst/papers/acl 99 -tdm. html R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall 2. 3. 4. 8. 9. 2003. A Road Map to Text Mining and Web Mining, University of Texas resource page. http: //www. cs. utexas. edu/users/pebronia/text-mining/ Computational Linguistics and Text Mining Group, IBM Research, http: //www. research. ibm. com/dssgrp/ 1/8/2022 Data Mining: Principles and Algorithms 14

Mining Text Data n Introduction: Text mining, natural language processing and information extraction n

Mining Text Data n Introduction: Text mining, natural language processing and information extraction n Text categorization n Text classification methods n Text cluster analysis n Summary n Research problems in text mining 1/8/2022 Data Mining: Principles and Algorithms 15

Text Categorization n Pre-given categories and labeled document examples (Categories may form hierarchy) Classify

Text Categorization n Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning ) problem Sports Categorization System Business Education Sports Business … … Science Education 1/8/2022 Data Mining: Principles and Algorithms 16

Applications n n n News article classification Automatic email filtering Webpage classification Word sense

Applications n n n News article classification Automatic email filtering Webpage classification Word sense disambiguation …… 1/8/2022 Data Mining: Principles and Algorithms 17

Categorization Methods n n Manual: Typically rule-based n Does not scale up (labor-intensive, rule

Categorization Methods n n Manual: Typically rule-based n Does not scale up (labor-intensive, rule inconsistency) n May be appropriate for special data on a particular domain Automatic: Typically exploiting machine learning techniques n Vector space model based n n n Probabilistic or generative model based n 1/8/2022 Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM) Naïve Bayes classifier Data Mining: Principles and Algorithms 18

Vector Space Model n n 1/8/2022 Represent a doc by a term vector n

Vector Space Model n n 1/8/2022 Represent a doc by a term vector n Term: basic concept, e. g. , word or phrase n Each term defines one dimension n N terms define a N-dimensional space n Element of vector corresponds to term weight n E. g. , d = (x 1, …, x. N), xi is “importance” of term i New document is assigned to the most likely category based on vector similarity. Data Mining: Principles and Algorithms 19

VS Model: Illustration Starbucks C 2 Category 3 C 3 new doc Microsoft 1/8/2022

VS Model: Illustration Starbucks C 2 Category 3 C 3 new doc Microsoft 1/8/2022 Java C 1 Category 1 Data Mining: Principles and Algorithms 20

What VS Model Does Not Specify n n n 1/8/2022 How to select terms

What VS Model Does Not Specify n n n 1/8/2022 How to select terms to capture “basic concepts” n Word stopping n e. g. “a”, “the”, “always”, “along” n Word stemming n e. g. “computer”, “computing”, “computerize” => “compute” n Latent semantic indexing How to assign weights n Not all words are equally important: Some are more indicative than others n e. g. “algebra” vs. “science” How to measure the similarity Data Mining: Principles and Algorithms 21

How to Assign Weights n Two-fold heuristics based on frequency n TF (Term frequency)

How to Assign Weights n Two-fold heuristics based on frequency n TF (Term frequency) n n n IDF (Inverse document frequency) n n 1/8/2022 More frequent within a document more relevant to semantics e. g. , “query” vs. “commercial” Less frequent among documents more discriminative e. g. “algebra” vs. “science” Data Mining: Principles and Algorithms 22

TF Weighting n Weighting: n More frequent => more relevant to topic n n

TF Weighting n Weighting: n More frequent => more relevant to topic n n n e. g. “query” vs. “commercial” Raw TF= f(t, d): how many times term t appears in doc d Normalization: n Document length varies => relative frequency preferred n 1/8/2022 e. g. , Maximum frequency normalization Data Mining: Principles and Algorithms 23

IDF Weighting n n Ideas: n Less frequent among documents more discriminative Formula: n

IDF Weighting n n Ideas: n Less frequent among documents more discriminative Formula: n — total number of docs k — # docs with term t appearing (the DF document frequency) 1/8/2022 Data Mining: Principles and Algorithms 24

TF-IDF Weighting n n n TF-IDF weighting : weight(t, d) = TF(t, d) *

TF-IDF Weighting n n n TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) n Freqent within doc high tf high weight n Selective among docs high idf high weight Recall VS model n Each selected term represents one dimension n Each doc is represented by a feature vector n Its t-term coordinate of document d is the TF-IDF weight n This is more reasonable Just for illustration … n Many complex and more effective weighting variants exist in practice 1/8/2022 Data Mining: Principles and Algorithms 25

How to Measure Similarity? n n Given two document Similarity definition n dot product

How to Measure Similarity? n n Given two document Similarity definition n dot product n 1/8/2022 normalized dot product (or cosine) Data Mining: Principles and Algorithms 26

Illustrative Example doc 1 text mining search engine text travel text doc 2 Sim(newdoc,

Illustrative Example doc 1 text mining search engine text travel text doc 2 Sim(newdoc, doc 1)=4. 8*2. 4+4. 5*4. 5 Sim(newdoc, doc 2)=2. 4*2. 4 Sim(newdoc, doc 3)=0 map travel text IDF(faked) 2. 4 doc 3 government president congress …… 1/8/2022 To whom is newdoc more similar? mining 4. 5 doc 1 doc 2 doc 3 2(4. 8) 1(4. 5) 1(2. 4 ) newdoc 1(2. 4) travel 2. 8 map search engine govern president congress 3. 3 2. 1 5. 4 2. 2 3. 2 4. 3 1(2. 1) 1(5. 4) 2 (5. 6) 1(3. 3) 1 (2. 2) 1(3. 2) 1(4. 3) 1(4. 5) Data Mining: Principles and Algorithms 27

VS Model-Based Classifiers n What do we have so far? n A feature space

VS Model-Based Classifiers n What do we have so far? n A feature space with similarity measure n This is a classic supervised learning problem n n Search for an approximation to classification hyper plane VS model based classifiers n K-NN n Decision tree based n Neural networks n Support vector machine 1/8/2022 Data Mining: Principles and Algorithms 28

Probabilistic Model n Main ideas n Category C is modeled as a probability distribution

Probabilistic Model n Main ideas n Category C is modeled as a probability distribution of pre-defined random events n Random events model the process of generating documents n Therefore, how likely a document d belongs to category C is measured through the probability for category C to generate d. 1/8/2022 Data Mining: Principles and Algorithms 29

Quick Revisit of Bayes’ Rule Category Hypothesis space: H = {C 1 , …,

Quick Revisit of Bayes’ Rule Category Hypothesis space: H = {C 1 , …, Cn} One document: D As we want to pick the most likely category C*, we can drop p(D) Posterior probability of Ci Document model for category C 1/8/2022 Data Mining: Principles and Algorithms 30

Probabilistic Model n Multi-Bernoulli n Event: word presence or absence n D = (x

Probabilistic Model n Multi-Bernoulli n Event: word presence or absence n D = (x 1, …, x|V|), xi =1 for presence of word wi; xi =0 for absence Parameters: {p(wi=1|C), p(wi=0|C)}, p(wi=1|C)+ p(wi=0|C)=1 Multinomial (Language Model) n Event: word selection/sampling n D = (n 1, …, n|V|), ni: frequency of word wi n=n 1, +…+ n|V| n n n 1/8/2022 Parameters: {p(wi|C)} p(w 1|C)+… p(w|v||C) = 1 Data Mining: Principles and Algorithms 31

Parameter Estimation Training examples: E(C 2) E(C 1) C 1 C 2 n Category

Parameter Estimation Training examples: E(C 2) E(C 1) C 1 C 2 n Category prior n Multi-Bernoulli Doc model n Multinomial doc model Ck E(Ck) Vocabulary: V = {w 1, …, w|V|} 1/8/2022 Data Mining: Principles and Algorithms 32

Classification of New Document Multi-Bernoulli 1/8/2022 Multinomial Data Mining: Principles and Algorithms 33

Classification of New Document Multi-Bernoulli 1/8/2022 Multinomial Data Mining: Principles and Algorithms 33

Categorization Methods n n Vector space model n K-NN n Decision tree n Neural

Categorization Methods n n Vector space model n K-NN n Decision tree n Neural network n Support vector machine Probabilistic model n n Naïve Bayes classifier Many, many others and variants exist [F. S. 02] n 1/8/2022 e. g. Bim, Nb, Ind, Swap-1, LLSF, Widrow-Hoff, Rocchio, Gis-W, … … Data Mining: Principles and Algorithms 34

Evaluations n Effectiveness measure n Classic: Precision & Recall 1/8/2022 n Precision n Recall

Evaluations n Effectiveness measure n Classic: Precision & Recall 1/8/2022 n Precision n Recall Data Mining: Principles and Algorithms 35

Evaluation (con’t) n Benchmarks n Classic: Reuters collection n n A set of newswire

Evaluation (con’t) n Benchmarks n Classic: Reuters collection n n A set of newswire stories classified under categories related to economics. Effectiveness n n n 1/8/2022 Difficulties of strict comparison n different parameter setting n different “split” (or selection) between training and testing n various optimizations … … However widely recognizable n Best: Boosting-based committee classifier & SVM n Worst: Naïve Bayes classifier Need to consider other factors, especially efficiency Data Mining: Principles and Algorithms 36

Summary: Text Categorization n Wide application domain n Comparable effectiveness to professionals n Manual

Summary: Text Categorization n Wide application domain n Comparable effectiveness to professionals n Manual TC is not 100% and unlikely to improve substantially. n n 1/8/2022 A. T. C. is growing at a steady pace Prospects and extensions n Very noisy text, such as text from O. C. R. n Speech transcripts Data Mining: Principles and Algorithms 37

References n Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol.

References n Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No. 1, March 2002 n Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”, ACM SIGKDD Explorations, 2000. n Cleverdon, “Optimizing convenient online accesss to bibliographic databases”, Information Survey, Use 4, 1, 37 -47, 1984 n Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval, 1: 67 -88, 1999. n Yiming Yang and Xin Liu “A re-examination of text categorization methods”. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42 --49), 1999. 1/8/2022 Data Mining: Principles and Algorithms 38

Mining Text Data n Introduction: Text mining, natural language processing and information extraction n

Mining Text Data n Introduction: Text mining, natural language processing and information extraction n Text categorization n Text classification methods n Text cluster analysis n Summary n Research problems in text mining 1/8/2022 Data Mining: Principles and Algorithms 39

Research Problems in Text Mining n Google: Is it too easy or too shallow?

Research Problems in Text Mining n Google: Is it too easy or too shallow? n How to find the pages that match approximately the sohpisticated documents, with incorporation of userprofiles or preferences? n Look back of Google: inverted indicies n Construction of indicies for the sohpisticated documents, with incorporation of user-profiles or preferences n 1/8/2022 Similarity search of such pages using such indicies Data Mining: Principles and Algorithms 40

1/8/2022 Data Mining: Principles and Algorithms 41

1/8/2022 Data Mining: Principles and Algorithms 41

1/8/2022 Data Mining: Principles and Algorithms 42

1/8/2022 Data Mining: Principles and Algorithms 42

1/8/2022 Data Mining: Principles and Algorithms 43

1/8/2022 Data Mining: Principles and Algorithms 43

1/8/2022 Data Mining: Principles and Algorithms 44

1/8/2022 Data Mining: Principles and Algorithms 44

1/8/2022 Data Mining: Principles and Algorithms 45

1/8/2022 Data Mining: Principles and Algorithms 45

1/8/2022 Data Mining: Principles and Algorithms 46

1/8/2022 Data Mining: Principles and Algorithms 46

1/8/2022 Data Mining: Principles and Algorithms 47

1/8/2022 Data Mining: Principles and Algorithms 47

1/8/2022 Data Mining: Principles and Algorithms 48

1/8/2022 Data Mining: Principles and Algorithms 48

1/8/2022 Data Mining: Principles and Algorithms 49

1/8/2022 Data Mining: Principles and Algorithms 49

1/8/2022 Data Mining: Principles and Algorithms 50

1/8/2022 Data Mining: Principles and Algorithms 50

1/8/2022 Data Mining: Principles and Algorithms 51

1/8/2022 Data Mining: Principles and Algorithms 51

1/8/2022 Data Mining: Principles and Algorithms 52

1/8/2022 Data Mining: Principles and Algorithms 52

1/8/2022 Data Mining: Principles and Algorithms 53

1/8/2022 Data Mining: Principles and Algorithms 53

1/8/2022 Data Mining: Principles and Algorithms 54

1/8/2022 Data Mining: Principles and Algorithms 54

1/8/2022 Data Mining: Principles and Algorithms 55

1/8/2022 Data Mining: Principles and Algorithms 55

1/8/2022 Data Mining: Principles and Algorithms 56

1/8/2022 Data Mining: Principles and Algorithms 56

1/8/2022 Data Mining: Principles and Algorithms 57

1/8/2022 Data Mining: Principles and Algorithms 57

1/8/2022 Data Mining: Principles and Algorithms 58

1/8/2022 Data Mining: Principles and Algorithms 58

1/8/2022 Data Mining: Principles and Algorithms 59

1/8/2022 Data Mining: Principles and Algorithms 59

1/8/2022 Data Mining: Principles and Algorithms 60

1/8/2022 Data Mining: Principles and Algorithms 60

1/8/2022 Data Mining: Principles and Algorithms 61

1/8/2022 Data Mining: Principles and Algorithms 61

1/8/2022 Data Mining: Principles and Algorithms 62

1/8/2022 Data Mining: Principles and Algorithms 62

1/8/2022 Data Mining: Principles and Algorithms 63

1/8/2022 Data Mining: Principles and Algorithms 63

1/8/2022 Data Mining: Principles and Algorithms 64

1/8/2022 Data Mining: Principles and Algorithms 64

1/8/2022 Data Mining: Principles and Algorithms 65

1/8/2022 Data Mining: Principles and Algorithms 65

1/8/2022 Data Mining: Principles and Algorithms 66

1/8/2022 Data Mining: Principles and Algorithms 66

1/8/2022 Data Mining: Principles and Algorithms 67

1/8/2022 Data Mining: Principles and Algorithms 67

1/8/2022 Data Mining: Principles and Algorithms 68

1/8/2022 Data Mining: Principles and Algorithms 68

1/8/2022 Data Mining: Principles and Algorithms 69

1/8/2022 Data Mining: Principles and Algorithms 69

1/8/2022 Data Mining: Principles and Algorithms 70

1/8/2022 Data Mining: Principles and Algorithms 70

1/8/2022 Data Mining: Principles and Algorithms 71

1/8/2022 Data Mining: Principles and Algorithms 71

1/8/2022 Data Mining: Principles and Algorithms 72

1/8/2022 Data Mining: Principles and Algorithms 72

1/8/2022 Data Mining: Principles and Algorithms 73

1/8/2022 Data Mining: Principles and Algorithms 73

1/8/2022 Data Mining: Principles and Algorithms 74

1/8/2022 Data Mining: Principles and Algorithms 74

1/8/2022 Data Mining: Principles and Algorithms 75

1/8/2022 Data Mining: Principles and Algorithms 75

Text Mining – Clustering CS 412 Spring 2004 Xiao Hu 1/8/2022 Data Mining: Principles

Text Mining – Clustering CS 412 Spring 2004 Xiao Hu 1/8/2022 Data Mining: Principles and Algorithms 76

Outline n n n Introduction Agglomerative clustering K-means The EM algorithm Partial Supervision Summary

Outline n n n Introduction Agglomerative clustering K-means The EM algorithm Partial Supervision Summary 1/8/2022 Data Mining: Principles and Algorithms 77

Examples of Doc/Term Clustering n n n Clustering of retrieval results Clustering of documents

Examples of Doc/Term Clustering n n n Clustering of retrieval results Clustering of documents in a collection Term clustering to define “concept” or “theme” Automatic construction of hyperlinks …… very useful for text mining and exploratory text analysis 1/8/2022 Data Mining: Principles and Algorithms 78

Clustering Algorithms n n n Structure n Hierarchical clustering n Flat clustering Assignment n

Clustering Algorithms n n n Structure n Hierarchical clustering n Flat clustering Assignment n Hard clustering to one cluster completely n Soft clustering to multiple clusters w/ memberships Overlap n 1/8/2022 Disjunctive clustering objects can belong to multiple clusters Data Mining: Principles and Algorithms 79

Text Clustering Algorithms • Similarity-based (need a similarity function) – Construct a partition •

Text Clustering Algorithms • Similarity-based (need a similarity function) – Construct a partition • Agglomerative, bottom up • K-means – Typically “hard” clustering • Model-based (latent models, probabilistic or algebraic) – First compute the model – Clusters are obtained easily after having a model • EM algorithm – Typically “soft” clustering 1/8/2022 Data Mining: Principles and Algorithms 80

Agglomerative clustering n n 1/8/2022 Given a similarity function to measure similarity between two

Agglomerative clustering n n 1/8/2022 Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottomup fashion n Each step, merge two most similar clusters Stop when some stopping criterion is met n # of groups; # of levels; … Variations: different ways to compute group similarity based on individual object similarity Data Mining: Principles and Algorithms 81

How to Compute Group Similarity? Three Popular Methods: Given two groups g 1 and

How to Compute Group Similarity? Three Popular Methods: Given two groups g 1 and g 2, Single-link algorithm: s(g 1, g 2)= similarity of the closest pair Complete-link algorithm: s(g 1, g 2)= similarity of the farthest pair Average-link algorithm: s(g 1, g 2)= average of similarity of all pairs 1/8/2022 Data Mining: Principles and Algorithms 82

Three Methods Illustrated complete-link algorithm g 2 g 1 ? …… Single-link algorithm 1/8/2022

Three Methods Illustrated complete-link algorithm g 2 g 1 ? …… Single-link algorithm 1/8/2022 average-link algorithm Data Mining: Principles and Algorithms 83

Comparison of the Three Methods n n Single-link n “Loose” clusters, local coherent n

Comparison of the Three Methods n n Single-link n “Loose” clusters, local coherent n Individual decision, sensitive to outliers Complete-link n “Tight” clusters, global coherent n Individual decision, sensitive to outliers Average-link n “In between” n Group decision, insensitive to outliers Which one is the best? Depends on applications! n Usually, global coherent is preferred 1/8/2022 Data Mining: Principles and Algorithms 84

K-Means Clustering n n Similarity-based n Usually Euclidean distance Hard clustering algorithm Define clusters

K-Means Clustering n n Similarity-based n Usually Euclidean distance Hard clustering algorithm Define clusters by the centroids of members Need setting # of clusters and initial cluster centers 1/8/2022 Data Mining: Principles and Algorithms 85

K-Means Clustering (cont. ) n n Select k initial centers while stopping criterion is

K-Means Clustering (cont. ) n n Select k initial centers while stopping criterion is not true do for all clusters do end for all means do end 1/8/2022 Data Mining: Principles and Algorithms 86

Hierarchical VS. K-Means • Provides more information • Less efficient : O(n 2) or

Hierarchical VS. K-Means • Provides more information • Less efficient : O(n 2) or O(n 3) • Preferable for detailed data analysis n Simple Efficient : O(n) Used as preprocessing n # of clusters is preset n n Similarity base Sensitive to similarity measures Hard clustering sometimes soft clustering is needed. e. g: part-of-speech 1/8/2022 Data Mining: Principles and Algorithms 87

EM algorithm-General Data: X (observed) + H(hidden) Model: Parameter: “Incomplete” likelihood: L( )= log

EM algorithm-General Data: X (observed) + H(hidden) Model: Parameter: “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X, H| ) Goal: find parameter that maximize Lc( ) Expectation: If we knew the value of , we could compute the expected value of the hidden structure of the model Maximize: If we knew the expected value of the hidden structure of the model, we could compute the maximum likelihood value of 1/8/2022 Data Mining: Principles and Algorithms 88

EM algorithm-General (cont. ) EM tries to iteratively maximize the complete likelihood: Starting with

EM algorithm-General (cont. ) EM tries to iteratively maximize the complete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function 3. Stop when Lc( ) converges 1/8/2022 Data Mining: Principles and Algorithms 89

Mixture Model for Clustering P(X|Cluster 1) P(X|Cluster 2) P(X|Cluster 3) P(X)=P(Cluster 1)P(X|Cluster 1)+P(Cluster 2)P(X|Cluster

Mixture Model for Clustering P(X|Cluster 1) P(X|Cluster 2) P(X|Cluster 3) P(X)=P(Cluster 1)P(X|Cluster 1)+P(Cluster 2)P(X|Cluster 2)+ P(Cluster 3)P(X|Cluster 3) 1/8/2022 Data Mining: Principles and Algorithms 90

Simple Unigram Mixture Model … Model/topic 1 p(w| 1) text 0. 2 mining 0.

Simple Unigram Mixture Model … Model/topic 1 p(w| 1) text 0. 2 mining 0. 1 assocation 0. 01 clustering 0. 02 … food 0. 00001 =0. 7 … … Model/topic 2 food 0. 25 nutrition 0. 1 p(w| 2) 1 - =0. 3 healthy 0. 05 diet 0. 02 … p(w| 1 2) = p(w| 1)+(1 - )p(w| 2) 1/8/2022 Data Mining: Principles and Algorithms 91

EM in Text Clustering n n n Estimating a mixture of probability distributions Observed

EM in Text Clustering n n n Estimating a mixture of probability distributions Observed document d is generated by several underlying causes p(d|C computed from terms p(d|Ci)1 is C 2 … Ck) = p(C 1)p(d|C 1)+ p(C 2) p(d|C 2)+…+ p(Ck) p(d|Ck) Let be all parameters: p(Ci), p(w|Ci) 1<=i<=k Hidden data z {0, 1} zij=1 iff di is in cluster j 1/8/2022 Data Mining: Principles and Algorithms 92

EM in Text Clustering Cluster/group Document d 1 c 1 z 11, …z 1

EM in Text Clustering Cluster/group Document d 1 c 1 z 11, …z 1 k zij {0, 1} zij=1 iff di is in cluster j c 2 dn ck Hidden variables zn 1, …znk Data: D={d 1, …, dn} Incomplete likelihood: Complete likelihood: E-step: compute E z | old[Lc( |D)] Compute p(zij|di, old) M-step: = argmax E z | old[Lc( |D)] Compute expected counts for estimating 1/8/2022 Data Mining: Principles and Algorithms 93

EM Updating Formula n n n Parameters: =({p(Ci)}, {p(wj|Ci)}) Initialization: randomly set 0 Repeat

EM Updating Formula n n n Parameters: =({p(Ci)}, {p(wj|Ci)}) Initialization: randomly set 0 Repeat until converge 1/8/2022 n E-step n M-step Data Mining: Principles and Algorithms 94

Properties of the EM algorithm n n n Hill-climbing approach n Only guarantee local

Properties of the EM algorithm n n n Hill-climbing approach n Only guarantee local maximum Sensitive to the parameter initialization n Use the results of another clustering algorithm to initialize the parameters Convergence rate can be very slow 1/8/2022 Data Mining: Principles and Algorithms 95

Outline n n n Introduction Agglomerative clustering K-means The EM algorithm Partial Supervision Summary

Outline n n n Introduction Agglomerative clustering K-means The EM algorithm Partial Supervision Summary 1/8/2022 Data Mining: Principles and Algorithms 96

Partial supervised clustering n n n Traditionally, clustering is regarded as unsupervised approach n

Partial supervised clustering n n n Traditionally, clustering is regarded as unsupervised approach n No way to use domain knowledge All objects were unlabeled => classification not possible. Only a small amount of knowledge was added => Semi-supervised clustering The clustering accuracy was greatly improved (from ~50% accuracy to 100%). 1/8/2022 Data Mining: Principles and Algorithms 97

Domain Knowledge n n n 1/8/2022 A small amount of labeled data Instance-level constraints

Domain Knowledge n n n 1/8/2022 A small amount of labeled data Instance-level constraints n Must-link: two objects must be put to the same cluster. n Cannot-link: two objects must not be put to the same cluster. n Objects that are similar/dissimilar to each other. Existing taxonomy some classification info. Feedback: good cluster, bad cluster Others Data Mining: Principles and Algorithms 98

Supervision-When? n n n At the beginning of clustering During cluster validation, to guide

Supervision-When? n n n At the beginning of clustering During cluster validation, to guide the next round of clustering When the algorithm requests n The user may give a “don’t know” response. 1/8/2022 Data Mining: Principles and Algorithms 99

Supervision-How? n n n Guide the formation of seed clusters. Force/recommend some objects to

Supervision-How? n n n Guide the formation of seed clusters. Force/recommend some objects to be put in the same cluster/different clusters. Modify the objective function. Modify the similarity function. Modify the distance matrix. 1/8/2022 Data Mining: Principles and Algorithms 100

An example –EM algorithm n n Docs: D=E(C 1) … E(Ck) U U –

An example –EM algorithm n n Docs: D=E(C 1) … E(Ck) U U – Unlabeled E-labeld Parameters: =({p(Ci)}, {p(wj|Ci)}) Initialization: randomly set 0 Repeat until converge n E-step (only applied to di in U) n 1/8/2022 Essentially, set p(zij)=1 for all di in E(Cj)! M-step (pool real counts from E and expected counts from U) Data Mining: Principles and Algorithms 101

Summary n n n 1/8/2022 Introduction Agglomerative clustering n Compute group similarity K-means n

Summary n n n 1/8/2022 Introduction Agglomerative clustering n Compute group similarity K-means n A flat clustering algorithm V. S. hierarchical The EM algorithm n Mixture model Partial Supervision n Make use of domain knowledge Data Mining: Principles and Algorithms 102

References n n n C. Zhai, Lecture for CS 397 -CXZ Intro Text Info.

References n n n C. Zhai, Lecture for CS 397 -CXZ Intro Text Info. Systems Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999. Michael W. Berry, Survey of Text Mining: Clustering, Classification, and Retrieval, Springer Verlag Pub September, 2003 Charu C. Aggarwal, et al, On Using Partial Supervision for Text Categorization , IEEE Transactions on Knowledge and Data Engineering 16 (02) p. 145 -288 February 1, 2004 Hanjoon Kim, et al, A Semi. Supervised Document Clustering Technique for Information Organization, CIKM 2000 1/8/2022 Data Mining: Principles and Algorithms 103

www. cs. uiuc. edu/~hanj Thank you !!! 1/8/2022 Data Mining: Principles and Algorithms 104

www. cs. uiuc. edu/~hanj Thank you !!! 1/8/2022 Data Mining: Principles and Algorithms 104