A Framework for Robust Discovery of Entity Synonyms
A Framework for Robust Discovery of Entity Synonyms KDD '12 Proceedings of the 18 th ACM SIGKDD international conference on Knowledge discovery and data mining 2014/10/21 B 4 ikuta 1
Introduction “Entity Synonyms” • alternate strings to refer to the same named entity. “Canon EOS 400 d Digital Camera” ➠ canon “ rebel xti”, “canon kiss k” “Harry Potter and the Half Blood Prince” ➠ harry “ potter 6”, “half blood prince” • critical for many applications. ーInformation retrieval. ーNamed entity recognition in document. 2
Introduction � Prior techniques suffer from several limitations. ü Click Similarity (Click. Sim) T. Cheng, H. W. Lauw, and S. Paparizos. Entity synonyms for structured web search. TKDE, 2011. ü Document Similarity(Doc. Sim) P. D. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. Co. RR, cs. LG/0212033, 2002. ü Distributional Similarity(Dist. Sim) P. Pantel, E. Crestan, A. Borkovsky, A. -M. Popescu, and. V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, 2009. …… ons… limitati e m o c r e v o o T “a general framework” with two novel similarity functions 3
Review of State-of-the-art n Click Similarity (Click. Sim) ーUsing query click logs n Document Similarity(Doc. Sim) ーBased on coocurrence in web documents Microsoft excel …………………… ………………. . . . . ms excel …………………… ………………. . . . . 4
Limitations of State-of-the-art 1) Click Log Sparsity ーOften many true synonyms of an entity are tail queries. ーClick. Sim will miss these synonyms. 5
Limitations of State-of-the-art 2) Inability to distinguish entities of different classes v by Click. Sim ………… ……. . . … microsoft excel ms excel tutorial v By Doc. Sim Microsoft excel …………………… ms excel tutorial ……………………. . . . . ms excel tutorial …………………… microsoft excel …………………. . . . . 6
DEFINITIONS AND FRAMEWORK �� The notion of synonym similarity function �� The general framework for discovering synonyms. 7
Synonym Relation and Properties n The synonym discovery problem ※In this paper, the strings in the synonym relation are unambiguous. 8
Synonym Relation and Properties The following properties should be true for judging 9
Synonym Similarity Function ck e h c o t is e g n The challe … the strings se and re alone are not enough… Ø auxiliary evidence for se and re aux(s) : the auxiliary evidence associated with a string s. ��the set of documents clicked by users for the web search query s ��the set of documents in which s is mentioned. 10
Synonym Similarity Function ü This function does not need to be symmetric. 11
Synonym Discovery Framework 12
Synonym Discovery Framework In a threshold based framework (with threshold θ), the following condition is met to ensure symmetry: ü An advantage of the framework Being able to accommodate various similarity functions 13
BASELINE SIMILARITY FUNCTIONS Why they are inadequate ? �� Click. Sim and Doc. Sim 14
Click Similarity The strength of the relationship of se to re is formalized as follows: ○symmetry property ☓similarity property 15
Document Similarity The strength of the relationship of se to re is formalized as follows: ☓symmetry property ☓similarity property 16
NOVEL SIMILARITY FUNCTIONS Propose two novel similarity functions. 1) Pseudo Document Similarity 2) Query Context Similarity ensure the synonym properties and overcome the limitations. 17
Pseudo Document Similarity (Pseudo. Doc. Sim in short) �� to address the sparsity problem. microsoft spreadsheet ms excel 18
Pseudo Document Similarity The strength of the relationship of se to re is 19
Pseudo Document Similarity There are two main benefits of using pseudo document similarity. 1) It harvests strictly more supporting evidence than Click. Sim. 2) In contrast to Doc. Sim, pseudo document allows us to focus on the essential parts of a document, rather than the complete content. 20
Query Context Similarity The words that appear in the context of entity names in web search queries Ø help distinguish between entities of different classes. “book” “ppt” “guide” “download” “help” “training” microsoft excel ms excel tutorial aux(s) : the set of contexts in web search queries 21
Combining the Two Similarity Measures The synonym discovery process in the framework can be stated as follows : 22
EFFICIENT AND SCALABLE ALGORITHMS u System Architecture The most expensive computation 23
Pseudo. Doc. Sim Computation 1) How to save the computation cost of calculating Pseudo. Doc. Sim (efficiency) 2) How to partition the task into subtasks (scalability). Efficient Computation Three algorithms to calculate Pseudo. Doc. Sim ① Baseline algorithm Does not exploit any overlap of tokens. ② Doc. Index exploits overlap on the document side but not on the candidate side. ③ Dual. Index exploits overlap on both document and candidate sides. 24
• the set of distinct tokens in the set Se the set of distinct tokens in the set De 25
Partitioning the Task To scale the computation of Pseudo. Doc. Sim o The Map. Reduce framework o partition the computation into subtasks by entity Formally, the Map and Reduce steps are (the key is re): 26
EXPERIMENTAL EVALUATION Experimental Setting • The query click log from “Bing” from 2009 to 2010. • Two real life datasets: o Local Business Names (Local): 937 local business names sampled from “Bing”local catalog, e. g. , la police credit union; o Software Names (Software): 10 software names, e. g. , microsoft excel. • The discovered synonyms are judged by human experts as to whether they are true synonyms or not. 27
Quality Results The results on the following 4 settings: I. Click. Sim: click similarity II. Doc. Sim: document similarity III. Pseudo. Doc. Sim: pseudo document similarity IV. Pseudo. Doc. Sim+QCSim: pseudo document similarity with con- text similarity 28
〜 4 synonyms per entity at 〜 80% precision for software entities 〜 12 synonyms per entity at 〜 85% precision for software entities 29
u How the various similarity features contribute to the output result in using a classifier. 30
Efficiency Results Report the processing efficiency results 31
CONCLUSION • In this paper, they proposed a general framework for discovering entity synonyms. • They study novel similarity functions that overcome the limitations of previously proposed functions. • They developed efficient and scalable algorithms to generate such synonyms, and robust techniques to handle long entity names. • Their experiments demonstrate superior quality of our synonyms and efficiency of our algorithms, as well as its impact in improving search. 32
- Slides: 32