Extracting Named Entities and Synonyms from Wikipedia 2010

  • Slides: 35
Download presentation
Extracting Named Entities and Synonyms from Wikipedia 2010 24 th IEEE International Conference on

Extracting Named Entities and Synonyms from Wikipedia 2010 24 th IEEE International Conference on Advanced Information Networking and Applications 2015/05/12 M 1 ikuta 1

INTRODUCTION In many search domains, both contents and searches are frequently tied to named

INTRODUCTION In many search domains, both contents and searches are frequently tied to named entities ………… ……. . . … Named Entity Ex. ) News archive u. Named Entity (NE) o a single entity can have more than one way of referring to it Ø Using the abbreviation : United States → US Ø Different ways of referring a person : Barack Obama → President of United States 2

INTRODUCTION • To improve search quality 1. Recognize named entities (NEs) 2. determine possible

INTRODUCTION • To improve search quality 1. Recognize named entities (NEs) 2. determine possible synonyms for each NE. In this paper… Ø explore the idea of using Wikipedia contents to automatically generate a dictionary of NEs and synonyms that are all referring to the same entity. 3

INTRODUCTION Main contributions 1) an approach for improved Wikipedia-based NE recognition that also implicitly

INTRODUCTION Main contributions 1) an approach for improved Wikipedia-based NE recognition that also implicitly categorizes the found entities 2) discovery of synonyms of the NEs 3) overview on how the synonyms have been used in their system to improve search quality 4) a study of quality of NE extraction and synonym discovery. 4

RELATED WORK • NE recognition is nothing new o traditionally the focus has been

RELATED WORK • NE recognition is nothing new o traditionally the focus has been on recognizing NEs embedded in text. o Most approaches do not take into account the additional semantic information o Updating a dynamic dictionary continuously based on Wikipedia can be too time-consuming 5

PRELIMINARIES There are four Wikipedia features üinternal links üredirects üdisambiguations ücategories 6

PRELIMINARIES There are four Wikipedia features üinternal links üredirects üdisambiguations ücategories 6

PRELIMINARIES u. Internal Links o link words in one article with another article [[

PRELIMINARIES u. Internal Links o link words in one article with another article [[ path ( graph theory ) | path ]] destination caption 7

PRELIMINARIES u. Redirects o almost similar to links, except that they can not include

PRELIMINARIES u. Redirects o almost similar to links, except that they can not include an alternative text. o a redirect can only redirect to a specific article 8

PRELIMINARIES u. Disambiguations o Disambiguation pages are used by Wikipedia to resolve conflicts between

PRELIMINARIES u. Disambiguations o Disambiguation pages are used by Wikipedia to resolve conflicts between terms having multiple senses 9

PRELIMINARIES u. Categories o Categorization is used to group one or more articles together

PRELIMINARIES u. Categories o Categorization is used to group one or more articles together o Every article should be a member of at least one category o The categorization system is flexible as it is not limited to a tree structure, instead it is a direct cyclic graph. o Difficult to determine which category is the parent category and which one is a subcategory 10

PRELIMINARIES n Generic Named Entity Recognition all article titles have their first letter capitalized

PRELIMINARIES n Generic Named Entity Recognition all article titles have their first letter capitalized even if they are nouns rather than proper nouns. ⇓ �use capitalization of words to find entities Ø A more sophisticated approach 11

PRELIMINARIES n Generic Named Entity Recognition based on the following heuristics: p If multi

PRELIMINARIES n Generic Named Entity Recognition based on the following heuristics: p If multi word title and every word is capitalized, except prepositions, determiners, conjunctions, relative pronouns or negations, consider it an entity. Ø United States p If the title is a single word, with multiple capital letters, consider it an entity. Ø US p If at least 75% of the occurrences of the title in the article text itself are capitalized, consider it an entity. 12

IMPROVING NAMED-ENTITY RECOGNITION Alternative to the capitalization requirement ��Wikipedia categories Authors selected three categories

IMPROVING NAMED-ENTITY RECOGNITION Alternative to the capitalization requirement ��Wikipedia categories Authors selected three categories of entities in a news context Ø People, organizations and company But, Wikipedia categories form a directed cyclic graph Ødifficult to determine which category is the parent category �� use the fact that Category names often follow certain patterns 13

IMPROVING NAMED-ENTITY RECOGNITION Patterns used for category matching easy to find entities Ø these

IMPROVING NAMED-ENTITY RECOGNITION Patterns used for category matching easy to find entities Ø these pages are watched more carefully than other pages. 14

IMPROVING NAMED-ENTITY RECOGNITION Patterns used for category matching pattern matching to identify categories 15

IMPROVING NAMED-ENTITY RECOGNITION Patterns used for category matching pattern matching to identify categories 15

SYNOYNM EXTRACTION internal links, redirects, disambiguation pages To extract all the possible synonyms collect

SYNOYNM EXTRACTION internal links, redirects, disambiguation pages To extract all the possible synonyms collect all the links and redirects with destination and caption 16

SYNOYNM EXTRACTION u. Example of a synonym set There are really some noises 17

SYNOYNM EXTRACTION u. Example of a synonym set There are really some noises 17

SYNOYNM EXTRACTION ��To deal with some noises Given the set S of potential synonyms

SYNOYNM EXTRACTION ��To deal with some noises Given the set S of potential synonyms for an entity, for each si ∈ S: 1) Remove any suffix enclosed in parentheses and apply a light stemming stripping it of any possessive form 18

SYNOYNM EXTRACTION 2) Classify the synonym as good or bad synonym, remove si from

SYNOYNM EXTRACTION 2) Classify the synonym as good or bad synonym, remove si from S if it turns out to be bad v Algorithm using capitalization of link captions without article text o Similar to algorithm described in Generic Named Entity Recognition 19

SYNOYNM EXTRACTION 3) Given freq( sq) as the frequency of a synonym sq and

SYNOYNM EXTRACTION 3) Given freq( sq) as the frequency of a synonym sq and |S| as number of items in S, remove si if ( in this experiments they have used β = 0. 01) 20

EMPLOYING SYNONYMS IN SEARCH External or Internal • entity normalization before indexing • translate

EMPLOYING SYNONYMS IN SEARCH External or Internal • entity normalization before indexing • translate them into their main entity reference �using query expansion by expanding the query to include multiple synonyms △ submitindividual queries and present the total/merged result set Details are written in [3] 21

EMPLOYING SYNONYMS IN SEARCH n Synonym selection • A problem with the query expansion

EMPLOYING SYNONYMS IN SEARCH n Synonym selection • A problem with the query expansion o the popular entities have a very large amount of synonyms with very small variations. • The solution o limit the expanded query to the top 5 -10 synonyms 22

EVALUATION A. Evaluation Environment The goal of the experiments Ø study the quality of

EVALUATION A. Evaluation Environment The goal of the experiments Ø study the quality of entity recognition and synonym detection using the Wikipedia-based approaches described earlier. System overview 23

EVALUATION A. Evaluation Environment u. The metrics used in the evaluation Precision and recall

EVALUATION A. Evaluation Environment u. The metrics used in the evaluation Precision and recall The reference set : the set of items that would be generated from the input set if the operation performed on the input set was perfect. N : the size of the reference set M : the size of the generated result set C : the number of correct items in the result set F-Measure 24

EVALUATION A. Evaluation Environment two-fold focus 1. automatic generation of a NE dictionary 2.

EVALUATION A. Evaluation Environment two-fold focus 1. automatic generation of a NE dictionary 2. using the dictionary to better handle the occurrences of different synonyms. subsets : randomly chosen and then manually classified. 25

EVALUATION B. Named Entity Recognition Results Generic Recognition Precision, recall and F-Mearsure of the

EVALUATION B. Named Entity Recognition Results Generic Recognition Precision, recall and F-Mearsure of the recognized entities for the different vales of α 26

EVALUATION B. Named Entity Recognition Results NEs from Categories Number of entities matching each

EVALUATION B. Named Entity Recognition Results NEs from Categories Number of entities matching each of the patterns Number of unique entities per category. 27

EVALUATION B. Named Entity Recognition Results NEs from Categories • a very small list

EVALUATION B. Named Entity Recognition Results NEs from Categories • a very small list of entries that were not NEs NON-ENTITIES TAGGED WITH ENTITY CATEGORIES 28

EVALUATION B. Named Entity Recognition Results NEs from Categories Precision, recall and F-Mearsure of

EVALUATION B. Named Entity Recognition Results NEs from Categories Precision, recall and F-Mearsure of the categorized entities 29

EVALUATION B. Named Entity Recognition Results NEs from Categories Recall of the NE classification

EVALUATION B. Named Entity Recognition Results NEs from Categories Recall of the NE classification algorithm when used on the categories, for different value of α 30

EVALUATION B. Named Entity Recognition Results Observation v Generic Recognition a precision : 80%,

EVALUATION B. Named Entity Recognition Results Observation v Generic Recognition a precision : 80%, a recall : 95% v NEs from Categories a precision : 98%, a recall : 99% in addition to giving the entities grouped by categories. ��A problem with generating too many entities only a fraction of them are actually news relevant and the irrelevant ones may become noise as they match the wrong person. 31

EVALUATION C. Synonyms STATISTICS FROM THE SYNONYM EXTRACTION • the number of synonyms found

EVALUATION C. Synonyms STATISTICS FROM THE SYNONYM EXTRACTION • the number of synonyms found was in average lower among people than the other categories 32

EVALUATION C. Synonyms Precision, recall and F-Measure for the synonyms 33

EVALUATION C. Synonyms Precision, recall and F-Measure for the synonyms 33

EVALUATION C. Synonyms Observations l the popular entities the very large amount of synonyms

EVALUATION C. Synonyms Observations l the popular entities the very large amount of synonyms with very tiny differences �� try to determine the quality of the entries the links are coming  from and use that to weight the synonyms. 34

CONCLUSION • approaches for using Wikipedia to automatically build a dictionary of NEs and

CONCLUSION • approaches for using Wikipedia to automatically build a dictionary of NEs and their synonyms. • The evaluation shows that Wikipedia is well suited as a data source for NE mining. o extract a large amount of entities with a high precision o This resulted in lots of synonyms that were correct, but would rarely be used in a search query as they were very context specific. • Future work includes using additional Wikipedia structures and contents for improved NE recognition and categorization. 35