Fabien JALABERT Mathieu LAFOURCADE jalabertlirmm fr lafourcadelirmm fr
Fabien JALABERT Mathieu LAFOURCADE jalabert@lirmm. fr lafourcade@lirmm. fr Definition Clustering, Sense Naming & Lexical Augmentation
Study context 1/2 Natural Language Processing • Lexical Semantics - WSD - Document indexing • Dictionary construction and vectorization pb extracting definition meta-language example : ‘cannibale’ = ‘qui mange l’Homme en parlant de l’Homme’ themes: homme, manger, rhétorique • Multi-source approach noise reduction problem : atom element = definition ≠ sense • Objectives - clustering definitions to obtain senses - naming these senses
Study context Sense naming Clustering Term T def 1 - Source 1 Catégorie Sense 1 1 def 2 - Source 1 def 1 - Source 1 def 3 - Source 1 Sense 2 def 1 - Source 2 def 2 - Source 2 def 1 - Source 3 def 2 - Source 3 Multi-source base Sense 1 – Name Sense 2 – Name t 1 t 2 def 2 - Source 1 t 3 def 2 - Source 2 t 4 def 1 - Source 3 t 5 Sense 3 t 6 def 3 - Source 1 def 1 - Source 2 Sense 2 – Name 2/2 def 2 - Source 3 ‘Acception’ or sense base Re-injection as new lexical source tn
Summary • Model, Construction, Organization • Definition Clustering • Sense Naming • Lexical Augmentation • Results
Conceptual Vector Model Salton Deerwester • An idea = a vector Chauché Lafourcade • A vector component = a primitive as defined in a T h. – Thesaurus Larousse : 873 concepts – Concepts are inter-related Generator space • A definition a vector Most activated primitives for ‘frégate’ : (oiseau 6134) (transports maritimes et fluviaux 5644) (arme 4891) … transports maritimes et fluviaux oiseau arme 1/2
Conceptual Vector Model Thematic distance = angle between two vectors 2/2 x y Thematicaly terms close to ‘frégate’ : (destroyer 0. 2246) (youyou 0. 2267) (voilier 0. 2268) (contre-torpilleur 0. 2274) (chlamydère 0. 2276) (oiseau-jardinier 0. 2295) (trois-mâts 0. 233) … Thematicaly terms close to ‘frégate/oiseau/’ : (oiseau-jardinier 0. 1237) (plumeur 0. 1319) (goglu 0. 136) (travailleur 0. 136) (chlamydère 0. 1385) (penne 0. 141) (Galliformes 0. 1422) (agami 0. 1428) … Thematicaly terms close to‘frégate/bateau/’ : (démâtage 0. 1604) (dégréer 0. 1676) (naval 0. 1718) (bateau-piège 0. 1774) (bateau-vanne 0. 1821) (batelet 0. 1824) …
SYGMART Definition Vector Computation Chauché 1 2 PHAMBG 3 PH 4 GN 5 le 13 PH 7 GV 6 petit 8 briser 12. 9 GN 10 le 14 GN 15 le 19 GV 16 GA 18 brise 17 petit 11 glace 20 GN 23. 22 glacer 21 le la petite brise la glace le petit briser le glace GN – Gouv - adj GV - Gouv GN – Gouv - nf
Double-loop Lecerf Multi-Agent Organization Schwab Learning agents : Sygmart, computation of vectors from definition, synonymy, antonymy, … Agent Endogenous loop Other agents (society) Exogenous loop
Clustering Objective Grouping definitions into senses
Strategy Clustering • Deep analysis - several criteria • No training (but enhancement through exogenous loop) • Frontier between senses and definitions - Centroïd approach - Heuristics (preferences) - cluster number = nb max of definitions in dictionaries - two definitions of a same source two different clusters 1/5
Difficulty Clustering ‘botte’ 2/5
Algorithm 1/2 Clustering 3/5 • Source by source iteration until obtaining a min value distribution Affectation of min. value source/cluster From a distance matrix : Hungarian method – O(n 3) Kuhn Ford, Fulkerson
Algorithm 2/2 Clustering • For each criteria one evaluation one distance matrix • Criteria Comparing lexical contents of definitions (with term frequency, co-occurrences, etc. ) Angular distance Symbolic markers - morphology - etymology - use - language level - de domain ( ‘avocat’ : ‘ahuacatl’ / ‘advocatus’ ) (‘vieux’ , ‘ancien’, ‘poétique’ … ) (‘argot’, ‘familier’, … ) (‘médecine’, ‘zoologie’, … ) 4/5
Results Clustering Correct results in many cases 90 % for nouns, 70 % for verbs - to be done for adj Pb with very strong polysemy vagueness, continuity in meanings support verb: ‘prendre’, … Study augmentation of cluster number ‘botte’ We would like to designate meanings 5/5
Sense Naming Objective To give the system some capacity to « talk about a sense »
Sense Naming Properties • Dictionary independent • Interface (man-system & system-system) • A new lexical source looping : -) Semantic annotation La frégate/vaisseau/ naviguait à travers les océans La frégate/oiseau/ planait à travers les nues en poussant son cri incomparable 1/10
Procedure Sense Naming 1. Extraction 2. Validation and dispatching of polysem bags bijection 3. Evaluation of candidates ordering and extracting the most appropriate ones 2/10
Extraction Sense Naming 3/10 • Extraction attached to a meaning – Morpho-syntactic analysis of the definition – Extraction of markers : « anc. » , « méd. » , … – Extraction from unstructured or semi-structured data (XML…) ‘frégate’ : [nf] [ancien] Au XVe s. , grande barque demi-pontée gréant deux voiles latines sur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet] • Extraction from polysem bags – Word list (like synonym list of Université de Caen : ) ex: Ploux, Victori ‘botte’ = chaussure, bottillon, coup, attaque, amas, bouquet, …
Sense Naming Validation 4/10 Bijection being able to re-associate the proper meaning ƒ : (term, sense) (term, annotation) ƒ-1 : (term, annotation) (term, sense) • A candidate associated to a sense should be closer of its own sense than any other • Unattached candidates are associated to the closest meaning • A candidate should not be present in a concurrent definition
Evaluation Sense Naming 5/10 • Extraction grade • Evaluating the capacity to disambiguate (to distinguish a sense from all others) • Evaluating the capacity to associate Cognitive cost reduction Prince
Sense Naming Extraction grade 6/10 ‘frégate’ : [nf] [ancien] Au XVe s. , grande barque demi-pontée gréant deux voiles latines sur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet] XVe grande barque demi-pontée (6) (2) (1) (3) (1) gréant voiles latines antenne (4) (5) (6) (5) (7) CC au XVe , Sujet GV COD CC grande barque demi-pontée gréant deux voiles latines sur antennes …
Sense Naming 7/10 absolute margin vaisseau frégate w. 1 t. 11 0, 85 (oiseau) (navire) relative margin 0, 3= d 1 w. 2 0, 8 0, 9 (navire ancien)5 t. 12 (sanguin) risk of ‘non-sens’ 0, 4= d 2 0, 2= d 3 1, 2 Ma = d 1 - d 2 = 0, 1 w. 3 (navire moderne) Mr = 0, 1 / d 1= 0. 33 Rns = d 3 / 0, 33= 0. 6 Disambiguation capacity 1/2
Sense Naming 8/10 vaisseau frégate w. 1 (oiseau) w. 1 0, 29 = d 2 (oiseau) t. 11 0, 85 voilier frégate t. 11 (oiseau) 0, 7 (navire) 0, 65= d 3 0, 72 0, 3= d 1 w. 2 0, 8 w. 2 0, 9 (navire ancien)5 0. 25 = d 1 (navire ancien) t. 12 (navire) (sanguin) 0, 72 0, 3 0, 4= d 2 0, 2= d 3 1, 2 Ma = d 1 - d 2 = 0, 04 Ma = d 1 - d 2 = 0, 1 w. 3 (navire moderne) Mr = 0, 1 / d 1= 0. 33 w. 3 (navire moderne) Rns = d 3 / 0, 33= 0. 6 Mr = 0, 04 / d 1= 0, 16 Rns = d 3 / 0, 16= 4 Disambiguation capacity 2/2
Cognitive cost Sense Naming survey Done for 13 terms totalizing 38 definitions 134 answers - collocations (botte de paille, …) - co-occurrences Church Daille Véronis (Tintin Milou) - synonyms and hyperonyms (manger se nourrir, mouche insecte animal) - domain / context for technical terms (médecine, architecture, agriculture, sport, …) 9/10
Results Sense Naming 10/10 - multi-criteria approach seems adapted - easily extensible - strong precision - enhancement needed for meta-language processing - criteria implementation (associative memory, lexical functions Mel’cuk ) - synthesis grammar (botte/secret/ vs. botte/secrète/) ‘botte’ Useful for multilingual lexical databases Schwab
Lexical Augmentation Multilingual Lexical Database Some terms are not lexicalized in some l a n g u a g e Objective lexicalize these terms
Papillon project Boitet Mangot-Lerebours FRANCAIS Lexical Augmentation Sérasset 1/2 Lepage ACCEPTIONS abats de volaille giblets ENGLISH giblets abats offal abats de bœuf offal. 1 beef offal porc offal. 2 abats de porc déchet refuse scrap
Procedure Lexical Augmentation 2/2 • Extraction from definition and sense mane (glosses of dictionaries) abats = {‘porc’, ‘volaille’, ‘bœuf’, …} • Patterns ‘abats de volaille’, ‘abats en volaille’, … • Patterns validation with co-occurrences relative number de hits in Google • Difficulties ‘dog meat’ ‘viande pour chien’ / ‘viande de chien’ ?
Conclusion Clustering • promissing results manual evaluation on 100 difficult terms, 70 % of proper clusters, 30 % of bad affectation locutions • pb to increase the cluster number maturing of the basic clusters Sens Naming complementary with conceptual vectors • Good precision manual evaluation 90 % of pertinent terms automatic evaluation 70 % (angular distance) • Towards a synthesis grammar botte/secret/ botte/secrète/ Future works • More criteria (associative memory, more lexical functions) • Enhance definition analysis (meta-language)
Contribution Theoric formalisation de la ‘capacité de désambiguïsation’ et du ‘risque de non-sens’ formalisation de l’annotation en sémantique lexicale proposition d’une mesure de similarité générique entre définitions Pratical implémentation sous forme d’agents catégorisation, nommage (services sur la Toile) augmentation lexicale (en cours) Diffusion un poster à RECITAL’ 2003 (Batz sur Mer – 10 – 14 juin 2003) un article à Papillon’ 2003 (Sapporo – 2 – 6 juillet 2003) soumission pour RFIA’ 2004
Thank you
- Slides: 31