Exploiting Wikipedia as External Knowledge for Document Clustering

Outline • Introduction • Framework • Concept Mapping: Exact and Relatedness Match • Category

Introduction • Problems of BOW-based clustering: – Ignores the relationships among words – If

Introduction • Two issues for enhancing text clustering by leveraging ontology semantics: – An

Wikipedia as Ontology The free encyclopedia that anyone can edit • Wikipedia is a

Wikipedia as Ontology • Unlike other standard ontologies, such as Word. Net and Mesh,

Wikipedia as Ontology • Moreover, Wikipedia has a well-formed structure – Each article only

Wikipedia Article that describes the Concept Artificial intelligence

AI is redirected to its equivalent concept Artificial Intelligence

The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences &

The different meanings that Artificial intelligence may refer to are listed in its disambiguation

The Framework of leveraging Wikipedia for document clustering Similarity metric between two documents: ·SIM(d

Concept Mapping • Match documents to Wikipedia concepts. • Result: A document-concept Matrix •

1. Concept Mapping ---Exact Match d 1 Artificial Intelligence, or AI for short, is

Concept Mapping Schemes: Exact Match – Each document is scanned to find Wikipedia concepts

Concept Mapping Schemes: Exact Match – A dictionary is constructed, with each entry corresponding

Concept Mapping Schemes: Exact Match – High efficiency – Low recall. Only the concepts

2. Concept Mapping –Relatedness Match d 1 Wikipedia collection Artificial Intelligence, or AI for

Concept Mapping Schemes: Relatedness Match Step 1: A Wikipedia term-concept matrix is constructed from

Concept Mapping Schemes: Relatedness Match Step 2: The word-concept matrix is used as a

Concept Mapping Schemes: Relatedness Match – More time consuming – Helpful for identify relevant

Category Mapping Wikipedia Category CAT 1 C 2 1 1 0 1 Wikipedia Concept

3. Category Mapping Concept Mapping (Exact Match or Relatedness Match) The hierarchical categorization system

Category Mapping • Exact match: – A document-category frequency matrix is first derived from

Category Mapping • Relatedness Match – The document-category matrix is generated by replacing each

Document Clustering • Agglomerative Clustering • Partitional Clustering and respectively indicate the importance of

Experiments • Wikipedia data: 911, 028 articles and about 29000 categories after pre-processing and

Experiments • Clustering Schemes under Comparison Notation Explanation Word Clustering solely based on word

Experiments • Parameter Setting (α and β that indicate the importance of concept vector

Agglomerative clustering results 20 News Group NMI Match Scheme Word (Base. Line) F-Score EM

Agglomerative clustering results LA Times NMI Match Scheme Word (Base. Line) F-Score EM RM

Agglomerative clustering results TDT 2 NMI Match Scheme Word (Base. Line) F-Score EM RM

Partitional Clustering Results 20 News Group NMI Match Scheme Word (Base. Line) F-Score EM

Partitional Clustering Results LA Times NMI Match Scheme Word (Base. Line) F-Score EM RM

Partitional Clustering Results TDT 2 NMI Match Scheme Word (Base. Line) F-Score EM RM

Conclusions 1. 2. 3. 4. 5. Category information is most useful for improving clustering

Future Work • Introduce sense disambiguation functions into the concept mapping process by leveraging

Slides: 41

Download presentation

Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu 1, Xiaodan Zhang 1, Caimei Lu 1, E. K. Park 2, Xiaohua Zhou 1 1 College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA 2 School of Computing and Engineering, University of Missouri at Kansas City, MO 64110, USA

Outline • Introduction • Framework • Concept Mapping: Exact and Relatedness Match • Category Mapping • Clustering • Experiments & Results • Conclusions & Future work

Introduction • Problems of BOW-based clustering: – Ignores the relationships among words – If two documents use different collections of core words to represent the same topic, they would be assigned to different clusters. • Solution: enrich document representation with the background knowledge represented by an ontology.

Introduction • Two issues for enhancing text clustering by leveraging ontology semantics: – An ontology which can cover the topical domain of individual document collections as completely as possible. – A proper matching method which can enrich the document representation by fully leveraging ontology terms and relations without introducing more noise. • This paper aims to address both issues.

Wikipedia as Ontology The free encyclopedia that anyone can edit • Wikipedia is a free, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. • Wikipedia's articles have been written collaboratively by volunteers around the world. • Almost all of its articles can be edited by anyone who can access the Wikipedia website. ----http: //en. wikipedia. org/wiki/Wikipeida

Wikipedia as Ontology • Unlike other standard ontologies, such as Word. Net and Mesh, Wikipedia itself is not a structured thesaurus. • However, it is more… – Comprehensive: it contains 12 million articles (2. 8 million in the English Wikipedia) – Accurate : A study by Giles (2005) found Wikipedia can compete with Encyclopædia Britannica in accuracy*. – Up to date: Current and emerging concepts are absorbed timely. * Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438: 900– 901.

Wikipedia as Ontology • Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept. – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus.

Wikipedia Article that describes the Concept Artificial intelligence

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Ontology • Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. – Equivalent concepts are grouped together by redirected links. – It contains a hierarchical categorization system, in which each article belongs to at least one category.

The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences & Technology in society

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page.

The Framework of leveraging Wikipedia for document clustering Similarity metric between two documents: ·SIM(d 1, d 2)= Sim(W 1, W 2)+ α∙Sim(C 1, C 2) + β∙Sim(Cat 1, Cat 2)

Concept Mapping • Match documents to Wikipedia concepts. • Result: A document-concept Matrix • Matching Schemes: C 1 Documents – Exact Match – Relatedness Match Wikipedia Concept D 1 D 2 C 2 w 11 w 12 w 21 w 22

1. Concept Mapping ---Exact Match d 1 Artificial Intelligence, or AI for short, is a combination of computer science, physiology, and philosophy. … document-concept Vector Exact match Wikipedia concept dictionary artificial Intelligence, AI … computer science … physiology … philosophy … Wiki Concepts C artificial … computer … physiology … philosophy … Intelligence science d 1 2 … 1 … 1 …

Concept Mapping Schemes: Exact Match – Each document is scanned to find Wikipedia concepts (article titles). – The searched Wikipedia concepts are used to comprise the concept vector of the corresponding document. – Synonymous phrases to the same concept are grouped together through the redirect links in Wikipedia.

Concept Mapping Schemes: Exact Match – A dictionary is constructed, with each entry corresponding to a topic covered by Wikipedia. – Each entry includes not only the preferred Wikipedia concept which is used as the title of the article, but also all the redirected concepts representing the same topic. – Based on the dictionary, both preferred concepts and redirected concepts are retrieved from documents. – Only preferred concepts are used to build the concept vector for each document. The weight of each preferred concept equals to the frequency of itself plus the frequencies of all the redirected concepts appearing in a document. – The document-concept TFIDF matrix is further calculated based on the document-concept frequency matrix.

Concept Mapping Schemes: Exact Match – High efficiency – Low recall. Only the concepts which explicitly appear in a document are extracted and used to construct the concept vector of the document.

2. Concept Mapping –Relatedness Match d 1 Wikipedia collection Artificial Intelligence, or AI for short, is a combination of computer science, physiology, and philosophy. … Wikipedia term concept matrix term vector of document d 1 w 1 d 1 tfidf 11 terms W w 2 … tfidf 12 Wiki article terms W w. N tfidf 1 N w 1 w 2 … w. N document-concept Vector c 1 d 1 r 11 Wiki concepts C c 2 … r 12 c. M r 1 N c 1 tfidf 11 tfidf 21 … tfidf. N 1 Wiki concepts C c 2 … tfidf 12 … tfidf 22 … … … tfidf. N 2 … c. M tfidf 1 N tfidf 2 N … tfidf. NM

Concept Mapping Schemes: Relatedness Match Step 1: A Wikipedia term-concept matrix is constructed from Wikipedia article collection. – – The values in the matrix are TFIDF scores, which denote the relatedness of each term to each Wikipedia concept article. For each word, only top k concepts with highest TFIDF scores are selected. (In this study, k = 5)

Concept Mapping Schemes: Relatedness Match Step 2: The word-concept matrix is used as a bridge to build Document- Wikipedia concept Matrix. The relatedness of a Wikipedia concept given document is calculated as: to a : the TFIDF score of word in document ; representing the importance of to : the TFIDF score of word in concept ; representing the importance of to o For each document, only the top M concepts with highest relatedness score are selected. (In this experiment, M is set to 200)

Concept Mapping Schemes: Relatedness Match – More time consuming – Helpful for identify relevant Wikipedia concepts which are not explicitly present in a document. – Especially useful when Wikipedia concepts have less coverage for a dataset.

Category Mapping Wikipedia Category CAT 1 C 2 1 1 0 1 Wikipedia Concept C 1 Wikipedia Category C 2 Documents D 1 w 12 D 2 w 21 w 22 CAT 1 Documents Wikipedia Concepts C 1 CAT 2 D 1 D 2 CAT 2 v 11 v 12 v 21 v 22

3. Category Mapping Concept Mapping (Exact Match or Relatedness Match) The hierarchical categorization system of Wikipedia Wiki Concept-Category Matrix document-concept Vector c 1 d 1 r 11 Wiki concepts C c 2 … r 12 c. M Wiki Concepts C r 1 N c 1 c 2 … c. N document-category Vector Wiki categories CAT cat 1 cat 2 … catk d 1 v 12 v 1 k cat 1 1 0 … 0 Wiki categories CAT cat 2 … catk 0 … 0 0 … 1 … … … 1 … 0

Category Mapping • Exact match: – A document-category frequency matrix is first derived from the document-concept frequency matrix by replacing each concept with its corresponding categories. – If a category is mapped to a document through more than one concept, the sum of the frequencies of these concepts is the category’s frequency. – Based on the generated document-category frequency matrix, we further derive the documentcategory TFIDF matrix

Category Mapping • Relatedness Match – The document-category matrix is generated by replacing each concept in the document-concept matrix with its corresponding categories, which share the same normalized relatedness score as the concept. – If a category is mapped to a document through more than one concept, its relatedness score to the document is the sum of the scores of all these concepts.

Document Clustering • Agglomerative Clustering • Partitional Clustering and respectively indicate the importance of concept vector and category vector in measuring the similarity between two documents.

Experiments • Wikipedia data: 911, 028 articles and about 29000 categories after pre-processing and filtering. • Clustering dataset: TDT 2, LA Times (from TREC), and 20 -newsgroups (20 NG). • Evaluation Metrics: – Purity – F-score – NMI (normalized mutual information )

Experiments • Clustering Schemes under Comparison Notation Explanation Word Clustering solely based on word vector Concept Clustering solely based on concept vector Category Clustering solely based on category vector Word_Concept Clustering based on the linear combination of word vector and concept vector Word_Category Clustering based on the linear combination of word vector and category vector Concept_Category Clustering based on the linear combination of concept vector and category vector Word_Concept_Category Clustering based on the linear combination of word vector, concept vector and category vector

Experiments • Parameter Setting (α and β that indicate the importance of concept vector and category vector in measuring the similarity between two documents. ) – For Word_Concept scheme, β is set to zero and α is set to 0. 1, 0. 2, ∙∙∙, 1. 0 respectively. The average result of the ten runs is used as the final clustering results for Word_Concept scheme. – For Word_Category scheme, α is set to zero and β is set to 0. 1, 0. 2, ∙∙∙, 1. 0 respectively. The average result of the ten runs is used as the final clustering results for Word_Category scheme. – For Word_Concept_Category scheme, α is set to the value which produces best results for Word_Concept based clustering, and β is set to the value that generates best results for Word_Category based clustering.

Agglomerative clustering results 20 News Group NMI Match Scheme Word (Base. Line) F-Score EM RM 0. 144 EM RM 0. 146 Purity EM RM 0. 132 Concept 0. 134 0. 157 0. 102 0. 062 0. 103 0. 081 Cateogory 0. 111 0. 160* 0. 128 0. 111 0. 143 0. 114 Concept_Category 0. 131 0. 148 0. 146 0. 084 0. 160 0. 095 Word_Concept 0. 144 0. 150 0. 153 0. 168 0. 136 0. 148 Word_Category 0. 166* 0. 171* 0. 189* 0. 201* 0. 180* Word_Concept_Category 0. 166* 0. 154 0. 196* 0. 195* 0. 206* 0. 165 * indicates the improvement is significant according to the paired-sample T-test at the level of p<0. 05.

Agglomerative clustering results LA Times NMI Match Scheme Word (Base. Line) F-Score EM RM 0. 048 EM RM 0. 066 Purity EM RM 0. 124 Concept 0. 060 0. 073 0. 057 0. 044 0. 120 0. 113 Cateogory 0. 071* 0. 053 0. 174 0. 054 0. 111 0. 118 Concept_Category 0. 073 0. 054 0. 177 0. 054 0. 202 0. 118 Word_Concept 0. 051 0. 052 0. 064 0. 072 0. 124 0. 128 Word_Category 0. 101* 0. 049 0. 210* 0. 097* 0. 238* 0. 142* Word_Concept_Category 0. 103* 0. 052 0. 204* 0. 100* 0. 232* 0. 144* * indicates the improvement is significant according to the paired-sample T-test at the level of p<0. 05.

Agglomerative clustering results TDT 2 NMI Match Scheme Word (Base. Line) F-Score EM RM 0. 537 EM RM 0. 622 Purity EM RM 0. 600 Concept 0. 296 0. 372 0. 398 0. 483 0. 368 0. 463 Cateogory 0. 577* 0. 448 0. 637 0. 539 0. 649* 0. 549 Concept_Category 0. 581* 0. 444 0. 656 0. 543 0. 659* 0. 560 Word_Concept 0. 563 0. 609 0. 637 0. 689* 0. 620 0. 678* Word_Category 0. 695* 0. 660* 0. 754* 0. 721* 0. 769* 0. 737* Word_Concept_Category 0. 675* 0. 661* 0. 734* 0. 726* 0. 751* 0. 747* * indicates the improvement is significant according to the paired-sample T-test at the level of p<0. 05.

Partitional Clustering Results 20 News Group NMI Match Scheme Word (Base. Line) F-Score EM RM 0. 390 EM RM 0. 382 Purity EM RM 0. 411 Concept 0. 288 0. 313 0. 312 0. 372 0. 302 0. 401 Cateogory 0. 291 0. 326 0. 332 0. 383 0. 341 0. 391 Concept_Category 0. 287 0. 322 0. 333 0. 354 0. 317 0. 388 Word_Concept 0. 390 0. 383 0. 380 0. 382 0. 411 Word_Category 0. 409* 0. 429* 0. 402* 0. 412* 0. 430* 0. 442* Word_Concept_Category 0. 398 0. 412 0. 400* 0. 418* 0. 429* 0. 442* * indicates the improvement is significant according to the paired-sample T-test at the level of p<0. 05.

Partitional Clustering Results LA Times NMI Match Scheme Word (Base. Line) F-Score EM RM 0. 188 EM RM 0. 317 Purity EM RM 0. 328 Concept 0. 186 0. 082 0. 312 0. 253 0. 333 0. 251 Cateogory 0. 185 0. 097 0. 315 0. 241 0. 327 0. 249 Concept_Category 0. 190 0. 112 0. 310 0. 242 0. 329 0. 245 Word_Concept 0. 159 0. 128 0. 292 0. 264 0. 304 0. 275 Word_Category 0. 194 0. 179 0. 325 0. 312 0. 335 0. 322 Word_Concept_Category 0. 189 0. 140 0. 319 0. 276 0. 330 0. 286 * indicates the improvement is significant according to the paired-sample T-test at the level of p<0. 05.

Partitional Clustering Results TDT 2 NMI Match Scheme Word (Base. Line) F-Score EM RM 0. 790 EM RM 0. 825 Purity EM RM 0. 848 Concept 0. 556 0. 447 0. 622 0. 522 0. 647 0. 544 Cateogory 0. 577 0. 448 0. 637 0. 539 0. 649 0. 549 Concept_Category 0. 543 0. 442 0. 630 0. 523 0. 643 0. 545 Word_Concept 0. 787 0. 766 0. 815 0. 792 0. 840 0. 819 Word_Category 0. 804 0. 737 0. 830 0. 720 0. 854 0. 763 Word_Concept_Category 0. 802 0. 804 0. 833 0. 846 0. 854 0. 876 * indicates the improvement is significant according to the paired-sample T-test at the level of p<0. 05.

Conclusions 1. 2. 3. 4. 5. Category information is most useful for improving clustering results. In both agglomerative clustering and partitional clustering, combining category information with document content information generates the best results in most cases. Clustering based on all three document vectors (word vector, concept vector, category vector) also gets significantly better results than the baseline, but does not outperform clustering based only on word vector and category vector. Concept information is not as useful as category information for improving clustering performance due to the noisy information it contains and sense ambiguity problem. The effect of category and concept information on k-means clustering is not as significant as it on agglomerative clustering. But, in most cases, Word_Category based clustering still achieves best performance among all clustering schemes. The effect of the two mapping schemes depends on the dataset, quality metric and clustering approach.

Future Work • Introduce sense disambiguation functions into the concept mapping process by leveraging Wikipedia disambiguation pages. • Explore how to utilize the link structure among Wikipedia concepts for document clustering.