Named Entity Recognition in Query Jiafeng Guo Gu

  • Slides: 23
Download presentation
Named Entity Recognition in Query Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li (ACM

Named Entity Recognition in Query Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li (ACM SIGIR 2009) Speaker: Yi-Lin, Hsu Advisor: Dr. Koh, Jia-ling Date: 2009/11/16

Outline • • • Introduction to NERQ Problem Implementation WSLDA Experimental Results Conclusion and

Outline • • • Introduction to NERQ Problem Implementation WSLDA Experimental Results Conclusion and Future work 2009/10/22 2

Introduction to NERQ • Named entity recognition (NER)is a subtask of information extraction that

Introduction to NERQ • Named entity recognition (NER)is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. 2009/10/22 3

Introduction to NERQ • NERQ involves 2 tasks: – 1. Detection of the named

Introduction to NERQ • NERQ involves 2 tasks: – 1. Detection of the named entity in a given query – 2. Classification of the named entity into predefined classes. – Example: mine movie titles – Applications: Web search, etc. • Challenges – Queries are usually very short – Queries are not necessarily in standard form 2009/10/22 4

Query Data • New data source for NER – About 70% of search queries

Query Data • New data source for NER – About 70% of search queries contain named entities. – Rich context for determining the classes of entities. • Query Context – “harry potter walkthrough”→“harry potter cheats” (context in the same class) • Wisdom-of-crowds • Very Large-scale data and keep on growing • Frequent update with emerging named entities 2009/10/22 5

NERQ Problem • A query having one named entity is represented as a triple

NERQ Problem • A query having one named entity is represented as a triple (e, t, c), – e : named entity, – t : context of e – c : class of e 2009/10/22 α#β 6

Probabilistic Approach • (e, t, c)* = argmax (e, t, c) Pr(q, e, t,

Probabilistic Approach • (e, t, c)* = argmax (e, t, c) Pr(q, e, t, c) = argmax (e, t, c) Pr(q|e, t, c) Pr(e, t, c) = argmax (e, t, c) Pr(e, t, c) (1) • Pr(e, t, c) = Pr(e) Pr(c|e) Pr(t|e, c) = Pr(e) Pr(c|e) Pr(t|c) (2) 2009/10/22 Make an assumption here 7

Topic Model for NERQ • T = {(ei, ti, ci) | i = 1.

Topic Model for NERQ • T = {(ei, ti, ci) | i = 1. . N} , the learning problem can be formalized as : 2009/10/22 8

Implementation • Offline Training • Online Prediction 2009/10/22 9

Implementation • Offline Training • Online Prediction 2009/10/22 9

Offline Training Seeds ………………. . Harry Potter ………………. . 2009/10/22 Query log ………………. .

Offline Training Seeds ………………. . Harry Potter ………………. . 2009/10/22 Query log ………………. . Scan the Harry query log trail with the seed name Potter entity and collect Harry Potterthe walkqueries through contain them Harry Potter cheats ………………. . 10

Offline Training • Pr(e) : the total frequency of queries containing e in the

Offline Training • Pr(e) : the total frequency of queries containing e in the query log Name entity Harry New Moon Potter Context trails Class movie Query Pr(c|e) : estimated by WS-LDA Pr(c|t) : fixed 2009/10/22 11

Online Prediction harry potter trails Find the most likely triple (e, t, c) in

Online Prediction harry potter trails Find the most likely triple (e, t, c) in G(q) 2009/10/22 12

WSLDA 2009/10/22 13

WSLDA 2009/10/22 13

WSLDA • Introduce Weak Supervision – LDA log likelihood + soft constraints LDA Probability

WSLDA • Introduce Weak Supervision – LDA log likelihood + soft constraints LDA Probability Soft Constraints – Soft Constraints Document Probability on i-th Class 2009/10/22 Document Binary Label on i-th Class 14

WSLDA • Objective Fuction : 2009/10/22 15

WSLDA • Objective Fuction : 2009/10/22 15

Experiments • A real data set consisting of 6 billion queries • 930 million

Experiments • A real data set consisting of 6 billion queries • 930 million unique queries • Four semantic classes , “Movie”, “Game”, “Book”, and “Music”. • 4 human annotators. • 180 named entities were selected from the web sites of Amazon, Game. Spot, and Lyrics. • 120 for training and 60 for test. • Finally , we obtain 432, 304 contexts and about 1. 5 millions name entities. 2009/10/22 16

Experiments • Randomly sampled 400 queries from the recognition results(0. 14 millions) for evaluation.

Experiments • Randomly sampled 400 queries from the recognition results(0. 14 millions) for evaluation. Example Queries 2009/10/22 pics of fight club braveheart quote watch gladiator online american beauty company 12 angry men characters mario kart guide pc mass effect crysis mods mother teresa images condemned screenshots 4 minutes lyric king kong the black swan summary blackwater novel new moon rehab the song nineteen minutes synopsis umbrella chords all summer long video girlfriend lyrics 17

Experiments • The performance of NERQ is evaluated in terms of Top N accuracy.

Experiments • The performance of NERQ is evaluated in terms of Top N accuracy. 2009/10/22 18

Experiments • We performed experiments to make comparison between the WS-LDA approach and two

Experiments • We performed experiments to make comparison between the WS-LDA approach and two baseline methods: Determ and LDA. • Determ learns the contexts of a certain class by simply aggregating all the contexts of named entities belonging to that class. • LDA and WS-LDA take a probabilistic approach 2009/10/22 19

Experiments Determ Movie Contexts LDA Determ Book Contexts LDA 2009/11/16 WS-LDA Determ Game Contexts

Experiments Determ Movie Contexts LDA Determ Book Contexts LDA 2009/11/16 WS-LDA Determ Game Contexts LDA WS-LDA Determ Music Contexts LDA WS-LDA 20

 • Table 5: Comparisons on Learned Named Entities of Each Class (P@N) Movie

• Table 5: Comparisons on Learned Named Entities of Each Class (P@N) Movie 2009/11/16 Game Book Music Average-Class 21

Experiments • Comparisons between WS-LDA and LDA 2009/10/22 22

Experiments • Comparisons between WS-LDA and LDA 2009/10/22 22

Conclusion • • Formalized the Problem of NERQ Proposed a novel method for NERQ

Conclusion • • Formalized the Problem of NERQ Proposed a novel method for NERQ Develop a new topic model called WSLDA Future Works: – We plan to add more classes and conduct the experiments. – The proposed method focuses on single named entity queries. – Some queries contained the named entity out of predefined classes. (e. g. American beauty company) – Some contexts were not learned in our approach since they are uncommon. (e. g lyrics for # by chris brown ) 2009/10/22 23