Sectoral Operational Programme Increase of Economic Competitiveness Investments















- Slides: 15

Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future" General Word Sense Disambiguation System applied to Romanian and English Languages - Sen. Di. S Project co-financed by the European Regional Development Fund Word sense disambiguation using lexicon nets Alin Ştefănescu, Oana Șoica, Andrei Mincă & Sen. Di. S team June 27, 2013

Introduction Alin Ştefănescu

The ambiguous hen Sen. Di. S „Găina cea nouă ne ouă nouă ouă. “ Page 3 Image from aliexpress. com

Natural Language Processing (NLP) n NLP develops systems that allow computers to communicate with people using everyday language. An important area, natural language understanding n Subproblem: word sense disambiguation n Page 4 Sen. Di. S

NLP @ SOFTWIN Research n NLP is an active research area at Softwin Research n n Page 5 biometrics is the other active area previously, antivirus research in the same R&D department led to the creation of a award-winning, internationally certified internet security and antivirus software Sen. Di. S

NLP @ Softwin Reseach – Sen. Di. S project at Softwin Research n „A general Word Sense Disambiguation System applied to n Romanian and English languages“ n 2010 -2013 n co-financed through Sectoral Operational Programme “Increase of Economic Competitiveness” (POS-CCE) n team of 7 -10 computer scientists and linguists n method: use of structured linguistic knowledge encoded with Softwin‘s GRAALAN formalism n previous projects: PALIROM & LINCOR (with collaborators from UB, ILIR, UPB etc) Page 6 Sen. Di. S

NLP system - GRAALAN n Sen. Di. S builds upon and further develops the NLP system GRAALAN at Softwin Research 1. Linguistic theoretical background 2. GRAALAN Grammar Abstract Language 3. Linguistic tools 4. Linguistic knowledge bases 5. Linguistic applications Sen. Di. S Page 7

Word Sense Disambiguation (WSD) Sen. Di. S n identify the meaning of words in context in a computational manner n very difficult problem n three main approaches: n supervised disambiguation n unsupervised disambiguation n knowledge-based disambiguation “Tower of Babel” by Brueghel Sen. Di. S Page 8

Dealing with ambiguity Sen. Di. S GRAALAN knowledge bases can encode several types of ambiguities: n multiword expression (MWE) ambiguity n morphologic ambiguity (synthetic & analytic) n lexical ambiguity (synthetic & analytic) n morphemic ambiguity n syntactic ambiguity Page 9 Sen. Di. S

Lesk Algorithm - basic idea n n Sen. Di. S a simple and intuitive knowledge-based WSD approach computes the word overlap between sense definitions of context target words n For a two-word context (w 1, w 2) and S 1 in Senses(w 1) and S 2 in Senses(w 2): score. Lesk (S 1, S 2) = | gloss(S 1) ∩ gloss(S 2) | n another variant, less computational intensive, computes the word overlap between a word sense definition and other context words score. Lesk. Var (S) = | context(w) ∩ gloss(S) | Page 10

Our approach: Lesk Algorithm extended Our approach: Lesk algorithm reasoning extended. Every annotated sense is extended with its definition that also has words with disambiguated senses and so on. Page 11 Sen. Di. S

Lesk Algorithm extended - example Generic example (Principle): <lemma>…= Sense 1 : <word> <word> Sense 2 : <word> <word> Sense 3 : <word> <word> <lemma>…= Sense 1 : <word> Sense 2 : <word> Sense 3 : <word> Page 12 Sen. Di. S

Lesk Algorithm extended - example Romanian example: "radio" = “ 0” : "Aparat de receptie radiofonica; radioreceptor. " “ 1” : "Instalatie de transmitere a sunetelor prin unde electromagnetice, cuprinzând aparatele de emisiune şi pe cele de receptie. " "aparat" = "0" : "Sistem de piese care serveste pentru o operatie mecanica, tehnica, stiintifica etc. " "1" : "Sistem tehnic care transforma o forma de energie în alta. " "2" : "Ansamblu de organe anatomice care servesc la îndeplinirea unei functiuni fundamentale. " "3" : "Totalitatea serviciilor sau a personalului care asigura bunul mers al unei institutii sau al unui domeniu de activitate. " "4" : "Ansamblul mijloacelor care servesc penrtu un anumit scop. " "receptie" = "0" : "Operatie de luare în primire a unui material sau a unei lucrari, pe baza verificarii lor cantitative şi calitative. " "1" : "Serviciu într-o întreprindere hoteliera care evidenta persoanelor aflate în hotel, face repartizarea în camere a solicitatorilor etc. " "2" : "(Tehn) Primire a unei anumite forme de energie pentru a o transforma în alta forma de energie. " "3" : "Reuniune, banchet cu caracter, festiv (În cercurile oficiale). "4" : "Primire, întâmpinare (cu caracter ceremonios) a unui oaspete. " "radiofonic" = "0" : "Care aparţine radiofoniei, privitor la radiofonie, care utilizeaza radiofonia. " "radioreceptor" = "0" : "Aparat folosit pentru receptionarea undelor radiofonice (prin antene), pentru transformarea lor în semnale sonore şi transmiterea lor prin intermediul difuzoarelor; radio. " Page 13 Sen. Di. S

WSD using a specific lexicon network Word Sense defined by defines Word Sense “gloss tagged” relation Page 14 a LARGE lexicon net Sen. Di. S

Sen. Di. S - workflow Page 15 Sen. Di. S