Document Categorization and Related Concepts Prediction using Wikipedia

Slides: 1

Document Categorization and Related Concepts Prediction using Wikipedia Articles, Category Network and Page Links Graph Zareen Saba Syed zarsyed 1@umbc. edu Wikipedia The Free Encyclopedia Problem Statement: Use Cases To Categorize and Predict the Concepts Related to Documents using a General Ontology. 1. Improved Information Retrieval Objectives: Categorizing corpus documents based on a general user developed folksonomy (wikipedia category network) would improve information retrieval tasks for common users. 1. To investigate the use of wikipedia category network as a general ontology to categorize documents. 2. Business Intelligence and Advertising 2. To investigate the use of wikipedia articles as concepts and article links as relations between concepts for concept prediction. Methodology: Knowing which web pages the user has looked at can give an idea about the generalized interest of the user and aid in targeting. Spreading Activation Given a set of documents, the wikipedia article index will be used to find top 'n' similar documents. The top 'n' similar documents will serve as the initial set of activated nodes in Spreading Activation on Wikipedia Category Graph and the Wikipedia Page Links Graph. The output will be the title of the highest activated node, i. e, the “Category” in case of the Wikipedia Category Graph and the “ Related Concept” in case of the Wikipedia Page Links Graph. Spreading activation is a technique which is used to retrieve relevant information if it is associated with information already known to be relevant. This technique is based on the Spreading Activation Model which is based on the idea of Human Memory operation. Spreading Activation: Organizing documents using the wikipedia concepts and ontology can help in improving the existing content management systems. 4. Aid in User Collaborations Wikipedia Category Network is a Thesaurus etc. Output of Node i connected to node j Weight on edge from node i to node j Node Output Function: where Aj : k : Dj : 3. Enterprise Content Management Information about the articles that the user has accessed can help in directing to users with similar interests and aid in collaboration. Node Input Function: where Oi : Wij : Wikipedia is a free online encyclopedia with an exponential growth rate and has developed into probably the largest freely available knowledge base. The size and coverage of wikipedia has reached a limit where it may be used to identify the topics discussed in a document. Research has shown that a simple algorithm using only the wikipedia categories and document titles has been capable of characterizing documents quite well. Prediction Algorithm Activation of Node j Pulse No. Out Degree of Node j Edge Weights for Wikipedia Category Links Graph: Input Documents In case of wikipedia category links graph unit edge weights have been used. Wikipedia Articles Index Edge Weights for Wikipedia Page Links Graph: N Matching Documents Wikipedia articles may be heavily linked and articles may contain links to pages that may not be relevant to the topic of the article, for eg. articles in which a name of a country appears may have that name linked to the wikipedia page for that country and articles that mention a term may have a link to a page defining that term, such links may not be directly related to the title or concept of theoriginalarticle. Therefore, wehaveusedlucenesimilarityscorebetweenpairof linked articles as edge weight for spreading activation as well as to filter out links where similarity measure is below a threshold. Spreading Activation Page Links Graph Map Predicted Node to Page Title Compute Edge Weights Get Page Title Display Predicted Concept Wikipedia Article Network resembles the WWW Network Activate Category Nodes Activate Document Nodes Page Links Graph Opportunities for Parallelism N Matching Documents Compute Edge Weights Wikipedia Articles Index Spreading Activation Category Links Graph Wikipedia Lucene Index Get Category Title Wikipedia Database Calculating Edge Weights Category Links Graph Map Predicted Node to Category Title Display Predicted Category Expensive Computations Computing Edge Weights dynamically using lucene similarity score between a pair of linked documents is a computationally expensive process. One approach is to run the spreading activation algorithm in parallel. Each node that gets activated could dynamically compute the edge weights and activate its successors in parallel. Secondly, Spreading Activation involves Matrix Operations which could be done more efficiently on cell processor.