Wikitology A Wikipedia Derived Knowledge Base Zareen Syed
Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6 th, 2009
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress • Conclusion 10/31/2020 Page 2
Introduction • Wikipedia • • Encyclopedia Developed Collaboratively Freely available online Millions of articles • English Wikipedia (2, 723, 767 articles) • Multiple Languages (More than 260) • Structured and un-structured content 10/31/2020 Page 3
Introduction • Wikipedia Content and Organization • • • Article Text Categories and Category Hierarchy Inter-article Links Info-boxes Disambiguation Pages Redirection Pages Talk Pages History Pages Meta-data 10/31/2020 Page 4
Motivation Challenges • Human Understandable Content (not machine readable) • How to make it more structured and organized to improve machine readability • How to automatically exploit the knowledge in Wikipedia to solve some real world problems 10/31/2020 Page 5
Thesis Statement • We can exploit Wikipedia and other related knowledge sources to automatically create knowledge about the world supporting a set of common use cases such as: • Concept Prediction • Information Retrieval • Information Extraction 10/31/2020 Page 6
Proposed Contributions • Developing a Novel Hybrid Knowledge Base composed of structured, semi-structured and unstructured information extracted from Wikipedia and other related sources • Developing Novel Application Specific Algorithms for exploiting the hybrid knowledge base • Task Based Evaluation of the system on common use-cases such as Concept Prediction, Information Retrieval and Information Extraction 10/31/2020 Page 7
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress • Conclusion 10/31/2020 Page 8
Related Work • Information Extraction • Relation extraction [35] • Co-reference resolution [25] • Named Entity Classification [52] • Natural Language Processing • Automatic word sense disambiguation [27] • Searching synonyms [28] 10/31/2020 Page 9
Related Work • Information Retrieval • • Text categorization [24] Computing semantic relatedness [30, 31, 32] Predicting document topics [26] Search Engine [69] • Semantic Web • • DBPedia [46] Semantic Media. Wiki [46] Linked Open Data Project [23] Freebase [22] 10/31/2020 Page 10
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress 10/31/2020 Page 11
Proposed Work • Refining, Enriching and Exploiting Structured Content in Wikipedia • Integrating other related knowledge sources • Developing application specific algorithms • Developing a dynamic and scalable architecture 10/31/2020 Page 12
Issues • Single document in too many categories: • George W. Bush is included in about 30 categories • Links between articles belonging to very different categories • John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points. • Number of articles with in a category: • Some categories are under represented where as others have many articles • Administrative Categories • For eg: • Clean up from Sep 2006 • Articles with unsourced statements • Links to words in an article • For eg. If the word United States appears in the document then that word might be linked to the page on “United States” 10/31/2020 Page 13
Issues • Category Hierarchy: • Multiple Parents (Thesaurus) • Noisy • “Animals” category defined in the sub-tree rooted at “People” • loose subsumption • Geography-> Geography by place -> Regions of Asia>Middle East -> Dances of Middle East • Events->Events by year->Lists of leaders by year 10/31/2020 Page 14
Refining, Enriching and Exploiting Structured Content in Wikipedia Category Hierarchy • Filtering out Administrative Categories • Algorithms for Selecting and Ranking Categories • Inferring and Labeling Semantic Relations between Categories • Refining Subsumption (Taxonomy) • Instance-of Relation • Using Information in Wikipedia Lists_of_Topics • Using Specific Administrative Categories 10/31/2020 Page 15 Done
Refining, Enriching and Exploiting Structured Content in Wikipedia Inter-Article Links: • Problem: Don’t imply semantic relatedness • Links to locations, term definitions, dates, entities • Possible solutions: • Classifying Link Types • Introducing Link Weights Done 10/31/2020 Page 16
Refining, Enriching and Exploiting Structured Content in Wikipedia Redirection Pages Disambiguation pages 10/31/2020 Page 17
Proposed Work • Exploring • Other structured content • Talk pages, user pages, history pages and meta-data • Other structured resources • Integrating structured information from other sources like DBpedia and Freebase in Wikitology • How and When to employ reasoning over the RDF triples 10/31/2020 Page 18
Proposed Work • Developing Novel Application Specific Algorithms on top of the hybrid Wikitology Knowledge Base for applications such as • Concept Prediction • Information Retrieval • Information Extraction 10/31/2020 Page 19
Proposed Work • Evaluation • Main Approaches to Evaluating Ontologies • • • Gold Standard Evaluation (Comparison to an existing Ontology) Criteria based Evaluation (By humans) Task based Evaluation (Application based) Comparison with Source of Data (Data driven) Using a Reasoning Engine • Our Approach to Evaluation • Task based Evaluation (Application based) 10/31/2020 Page 20
Wikitology Overview Articles IR Index Application Specific Algorithms Category Links Hierarchical Graph Wikitology Code Application Specific Algorithms Page Links Graph RDF Reasoner Application Specific Algorithms Triple Store Relational Database 10/31/2020 Page 21
Outline • Introduction and Motivation • Related Work • Proposed Work • Time Line • Work Progress • Conclusion 10/31/2020 Page 22
Time Line No. Mile Stones Expected Completion Date 1 Enriching Wikitology by extracting additional information from Wikipedia May, 2009 2 Studying other related knowledge sources in detail such as Freebase, DBPedia, YAGO etc. May, 2009 3 Incorporating additional knowledge sources to enrich Wikitology May, 2009 4 Working on techniques to improve applications in Information Retrieval and Information Extraction using additional features generated from Wikitology Dec, 2009 5 Evaluating the Wikitology knowledge base May, 2010 6 Thesis write up Aug, 2010 10/31/2020 Page 23
Outline • Introduction and Motivation • Related Work • Proposed Work • Time Line • Work Progress • Conclusion 10/31/2020 Page 24
Work Done • Case Study 1: • Concept Prediction • Case Study 2: • Document Expansion for Information Retrieval • Case Study 3: • Named Entity Classification • Case Study 4: • Co-reference Resolution • Case Study 5: • Concept Based Features for Information Retrieval 10/31/2020 Page 25 In Progress
Case Study 1 Concept Prediction [2] Problem: Predict the individual document topics as well as concepts common to a set of documents Approach: • Hybrid Knowledge base: Wikitology 1. 0 • Algorithms for selecting and aggregating terms 10/31/2020 Page 26
Wikitology 1. 0 • Wikipedia as an Ontology • Each article is a concept in the ontology • Terms linked via Wikipedia’s category system and interarticle links • It’s a consensus ontology created, kept current and maintained by a diverse community • Overall content quality is high • Terms have unique IDs (URLs) and are “self describing” for people 10/31/2020 Page 27
Wikitology 1. 0 • Structured Data • Specialized Concepts (article titles) • Generalized Concepts (category titles) • Inter-category and Inter-article links as relations between concepts • Article-Category links as relations between specialized and generalized concepts • Un-Structured Data • Article Text ( A way to map ontology terms to free text) • Algorithms to select, rank and aggregate concepts using the hybrid knowledge base 10/31/2020 Page 28
Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Input Query doc(s) similar to 0. 2 0. 8 0. 1 Similar Wikipedia Articles 0. 2 Cosine similarity 0. 3 10/31/2020 Page 29
Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Wikipedia Category Graph Input Query doc(s) similar to 0. 2 0. 8 0. 1 Similar Wikipedia Articles 0. 2 Cosine similarity 0. 3 10/31/2020 Page 30
Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Output Rank Categories 1. Links 2. Cosine similarity Wikipedia Category Graph 0. 9 3 Input Query doc(s) similar to 0. 2 0. 8 0. 1 Similar Wikipedia Articles 0. 2 Cosine similarity 0. 3 10/31/2020 Page 31
Method 2 Using Spreading Activation on Category Links Graph to get Aggregated Concepts Spreading Activation Output Ranked Concepts based Wikipedia Category Graph on Final Activation Score Input Query doc(s) Similar to 0. 2 0. 8 0. 1 0. 2 Cosine similarity 0. 3 10/31/2020 Input Function Output Function Page 32
Method 3 Using Spreading Activation on Article Links Graph Input Threshold: Ignore Spreading Activation to articles with less than 0. 4 Cosine similarity score Query Similar To doc(s) Edge Weights: Cosine similarity between linked articles Wikipedia Article Links Graph Spreading Activation Node Input Function Node Output Function Output Ranked Concepts based on Final Activation Score 10/31/2020 Page 33
Wikitology 1. 0 • The system was evaluated by predicting the categories and article links of existing Wikipedia articles and comparing with the ground truth • It was observed that Wikitology 1. 0 system was able to predict the document topics and common concepts with high accuracy when the article concepts were well represented within Wikipedia 10/31/2020 Page 34
Case Study 2 Document Expansion with Wikipedia Derived Ontology Terms [21]* Preliminary work with TREC documents MAP P@10 base 0. 2076 0. 4207 Base + rf 0. 2470 0. 4480 Concepts + rf 0. 2400 0. 4553 IR Effectiveness Using Wikipedia Concepts Doc: FT 921 -4598 (3/9/92). . . Alan Turing, described as a brilliant mathematician and a key figure in the breaking of the Nazis' Enigma codes. Prof IJ Good says it is as well that British security was unaware of Turing's homosexuality, otherwise he might have been fired 'and we might have lost the war'. In 1950 Turing wrote the seminal paper 'Computing Machinery And Intelligence', but in 1954 killed himself. . . Turing_machine, Turing_test, Church_Turing_thesis, Halting_problem, Computable_number, Bombe, Alan_Turing, Recusion_theory, Formal_methods, Computational_models, Theory_of_computation, Theoretical_computer_science, Artificial_Intelligence In Collaboration with Paul Mc. Namee, John Hopkins University Applied Physics Laboratory * 10/31/2020 Page 35
Case Study 3 Named Entity Classification • Semi-automated generation of Training data • Persons, Locations and Events • Experimenting with different feature sets • Inter-article link labeling Results showing accuracy obtained using different feature sets 10/31/2020 Page 36
Case Study 4 Cross Document Entity Co-reference Resolution [21]* Problem: • To determine whether various named people, organizations or relations from different documents refer to the same object in the world. • For example, does the “Condoleezza Rice” mentioned in one document refer to the same person as the “Secretary Rice” from another? * In Collaboration with John Hopkins University Human Language Technology Center of Excellence 10/31/2020 Page 37
Entity Document (EDOC) <DOC> <DOCNO>ABC 19980430. 1830. 0091. LDC 2000 T 44 -E 2</DOCNO> <TEXT> Webb Hubbell PER Individual NAM: "Hubbell" "Hubbells" "Webb Hubbell" "Webb_Hubbell" NOM: "Mr. " "friend" "income" PRO: "he" "him" "his" , . abc's accountant after again ago alleges alone also and arranged attorney avoid been b efore being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel's department did disgrace do dog dollars earned eightynine enough eva sion feel financial firm first four friends going got grand happening has he help him his hope house hubbells hundred hush income increase independent indicted indictme nt inner investigating jackie_judd jail jordan judd jury justice kantor ken knew lady la te law left lie little make many mickey mid money mr my nineteen nineties ninetyfour nothing now office others paying peter_jennings president's pressured probe prosecutor s questions reported reveal rock saddened said schemed seen seven since starr statement such taxes tell them they thousand time today ultimately vernon washington webb_hubbell were what's whether which whitewater why wife years </TEXT> </DOC> Entity documents capture information about entities extracted from documents, including mention strings, type and subtype, and text surrounding the mentions. 10/31/2020 Page 38
Wikitology 2. 0 • Enhancements • Structured Data • • Specialized Concepts (article titles) Generalized Concepts (category titles) Inter-category and Inter-article links as relations between concepts Article-Category links as relations between specialized and generalized concepts • YAGO types (to identify entity type) • Table with Disambiguation set (to identify highly confused entities) • Aliases using Redirect pages • Un-Structured Data • Article Text • Redirect titles (added to article text) 10/31/2020 Page 39
Wikitology 2. 0 Data Structures • Lucene Index • Concept Title + Redirected Titles (field) • Article Text + Redirected Titles (field) • RDF field with Entity Type (YAGO type) • Graphs • Category links graph • Article-Category links • Tables • Disambiguation Set derived from disambiguation pages 10/31/2020 Page 40
Wikitology 2. 0 • Custom Query Front end • The EDOC’s name mention strings • Wikitology’s title field • slightly higher weight to the longest mention, i. e. , “Webb Hubbell” • The EDOC type • RDF Field: Yago Type • Name mention strings + Contextual text • Text (Wikitology Article Contents) 10/31/2020 Page 41
Wikitology Features Article Vector for ABC 19980430. 1830. 0091. LDC 2000 T 44 -E 2 1. 0000 Webster_Hubbell 0. 3794 Hubbell_Trading_Post_National_Historic_Site 0. 3770 United_States_v. _Hubbell 0. 2263 Hubbell_Center 0. 2221 Whitewater_controversy Category Vector for ABC 19980430. 1830. 0091. LDC 2000 T 44 -E 2 0. 2037 Clinton_administration_controversies 0. 2037 American_political_scandals 0. 2009 Living_people 0. 1667 1949_births 0. 1667 People_from_Arkansas 0. 1667 Arkansas_politicians 0. 1667 American_tax_evaders 0. 1667 Arkansas_lawyers Each entity document is tagged by Wikitology, producing vectors of article and category tags. Note the clear match with a known person in Wikipedia. 10/31/2020 Page 42
Features Derived from Wikitology 2. 0 Name Range Type Description APL 20 WAS {0, 1} sim 1 if the top article tags for the two entities are identical, 0 otherwise APL 21 WCS {0, 1} sim 1 if the top category tags for the two entities are identical, 0 otherwise APL 22 WAM [0. . 1] sim The cosine similarity of the medium length article vectors (N=5) for the two entities APL 23 Wc. M [0. . 1] sim The cosine similarity of the medium length category vectors (N=4) for the two entities APL 24 WAL [0. . 1] sim The cosine similarity of the long length article vectors (N=8) for the two entities APL 31 WAS 2 [0. . 1] sim match of entities top Wikitology article tag, weighted by avg(score 1, score 2) APL 32 WCS 2 [0. . 1] sim match of entities top Wikitology category tag, weighted by avg(score 1, score 2) APL 26 WDP {0, 1} dissim 1 if both entities are of type PER and their top article tags are different, 0 otherwise APL 27 WDD {0, 1} dissim 1 if the two top article tags are members of the same disambiguation set, 0 otherwise APL 28 WDO {0, 1} dissim 1 if both entities are of type ORG and their top article tags are different, 0 otherwise APL 29 WDP 2 [0. . 1] dissim APL 30 WDP 2 [0. . 1] dissim Match both entities are of type PER and their top article tags are different, weighted by 1 abs(score 1 -score 2), 0 otherwise Match if both entities are of type ORG and their top article matches are different organizations, weighted by 1 -abs(score 1 -score 2), 0 otherwise Twelve features were computed for each pair of entities using Wikitology, seven aimed at measuring their similarity and five for measuring their dissimilarity. 10/31/2020 Page 43
Evaluation results for cross-document entity coreference task using Wikitology features TP rate FP rate Precision Recall F-Measure yes . 722 . 001 . 966 . 722 . 826 no . 999 . 278 . 999 . 994 match 10/31/2020 Page 44
Case Study 5 Feature Generation to Improve Information Retrieval Performance* • Incorporating Generalized Concept Features in MORAG [69] search engine * Work being done during internship at River. Glass Company 10/31/2020 Page 45
MORAG Search Engine • Concept features generated using Wikipedia (ESA) • Feature Selection using pseudorelevance feedback • Merged Ranking of Concept scores and BOW scores Incorporating Wikitology based features in MORAG search engine 10/31/2020 Page 46
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress • Conclusion 10/31/2020 Page 47
Thesis Statement • We can exploit Wikipedia and other related knowledge sources to automatically create knowledge about the world supporting a set of common use cases such as: • Concept Prediction • Information Retrieval • Information Extraction 10/31/2020 Page 48
Proposed Contributions 1. Developing a Novel Hybrid Knowledge base composed of structured and un-structured information extracted from Wikipedia and other related sources • Wikitology 1. 0 • Wikitology 2. 0 10/31/2020 Page 49
Proposed Contributions 2. Developing Novel Application Specific Algorithms for exploiting the hybrid knowledge base • Methods for Concept Prediction • • Co-reference Resolution • • Ranking methods and Spreading Activation Novel Entity representation and Hybrid Querying Information Retrieval • Document Expansion, Generalized Concept Features augmentation 10/31/2020 Page 50
Proposed Contributions 3. Task Based Evaluation of the system on common use-cases such as Concept Prediction, Information Retrieval and Information Extraction • Metrics: • Precision and Recall 10/31/2020 Page 51
The End Thank you Questions? 10/31/2020 Page 52
- Slides: 52