English version TALN 2003 atelier TALN et multilinguisme
- Slides: 23
English version TALN 2003 atelier : "TALN et multilinguisme" A tool for endogenous and multilingual terminological extraction Jacques Vergne GREYC - Université de Caen 14/6/2003 http: //www. info. unicaen. fr/~jvergne © Jacques Vergne Atelier : "TALN et multilinguisme"
English version frame application (1) • news websites --- system ---> reviews of papers • users : journalists, web users "what and who papers of a given geographical or linguistic space are speaking about? " front-page of Le Monde • inversion of the issue of search engines key-words (topics) ---> documents search space ---> main topics of the news • "front-pages" of news websites ---> hyperlinks : URL and source code of hyperlinks "texts" 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 2
English version frame application (2) • hyperlinks "texts" of "front-pages" : an editorial choice of journalists of news websites • hyperlinks "texts" of "front-pages" --extracting--> terms present on several sites • ---> graph of terms nodes = weighted terms (sites - documents) links = weighted links between terms front-page of Le Monde (co-occurrences of 2 terms in the same hyperlink text) • the user navigates into this graph to access to linked terms and to papers 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 3
English version user interface santé jeunes gouvernement loi santé des jeunes milieu scolaire école suivi été alcool navigating into the graph of terms 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 4
English version specifications of the tool • hyperlinks "texts" of "front-pages" --extracting--> terms present on several sites • method able to locate : rare function words and frequent content words (as guerre or war ) • to center terms onto content words • in a multilingual corpus (15 000 to 30 000 words) • of unknown alphabetical languages • without parsing, nor dictionary, nor stoplist 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 5
English version state of the art : repeated patterns • methods of André Salem, Helena Ahonen, François Rousselot : - research of repeated patterns while using algorithms extrapolated from the greedy algorithm (research of n-grams from n-1 -grams) - with, as input, function words of the processed language to avoid to choose them as terms (in a stopword-list) 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 6
English version an endogenous tool • term proposed by Didier Bourigault : computing prepositional and adjectival phrase attachments in a monolingual corpus with a dictionary and parsing • same generic meaning : using lexical distributional regularities in a corpus to process this same corpus • but different specific meaning : locating rare function words and frequent content words in a multilingual corpus without parsing, nor dictionary, nor stoplist 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 7
English version very general linguistic properties • word frequency => silence on frequent content words • Zipf : "the principle of least effort" the more frequent is a word, the shorter it is short and frequent words are function words • Saussure : "dans la langue, il n'y a que des différences" • => using differences of length and of frequency of 2 contiguous words • no other resource than the processed corpus itself, without identifying the language 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 8
English version expected result • text : sequence of function words and content words Manifestazioni per la pace in tutto il mondo expected result : C f f C f C Manifestazioni per la pace in tutto il mondo 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 9
difference between 2 contiguous words English version • difference criteria between 2 contiguous words : - difference of length in number of letters : il mondo (2 letters - 5 letters ) - difference of frequency in the corpus : il mondo (19 occurrences - 3 occurrences) 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 10
English version proposed solution : principle • searching 2 types of sequences of words, where 1 or 2 function words are within 2 Content words : tutto il mondo C f Manifestazioni per la pace C C sequence Cf. C du la of im ne il le lui y 14/6/2003 f f C sequence Cff. C en © Jacques Vergne de la que des n'a Atelier : "TALN et multilinguisme" of the ist ein is the aus dem a été qui ne 11
English version proposed solution : process (0) 1) Extracting function words from the corpus 2) Generating candidate terms 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 12
English version proposed solution : process (1) 1) Extracting function words from the corpus • segmenting the corpus on limits of texts of hyperlinks and on punctuations --> segments • for every segment, searching Cf. C and Cff. C sequences from differences of length and frequency 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 13
English version proposed solution : process (2) • for every segment, searching Cf. C and Cff. C sequences : Manifestazioni lengths profile 14 long deductions 14/6/2003 3 short 1 frequencies profile > per rare < 10 la 2 < 4 short in 2 long 207 > 2 frequent Content pace 62 rare Atelier : "TALN et multilinguisme" il 5 > 2 long function Content © Jacques Vergne tutto mondo < 5 short long 3 < 19 > 3 rare frequent rare Content function Content 14
English version proposed solution : process (3) 2) Generating candidate terms • according to patterns : C+ Manifestazioni pace tutto mondo C+ f+C+ 14/6/2003 Manifestazioni per la pace in tutto il mondo © Jacques Vergne Atelier : "TALN et multilinguisme" 15
English version results (1) (15 mars 2003) search space 1 : search space corpus candidate terms most frequent function words 14/6/2003 search space 2 : 22 sites of French national and regional press, 17 sites of European press (Suisse, Belgique, Deutschland, Italy, Spain, UK, Irland), and 4 USA news websites , at least 2 sites for every language about 100 sites published by Google News, about half are USA sites (http: //news. google. fr/news/) 84 Kb, 14 800 words 163 Kb, 28 500 words 1566 occurrences of 584 candidate terms (from 42 à 2 occurrences / term) 2435 occurrences of 820 candidate terms (from 47 à 2 occurrences / term) de : 340 du : 103 la : 207 et : 99 l' : 153 des : 88 le : 113 en : 87 d' : 107 les : 84 à : 107 a : 82 to : 327 un : 80 Les : 55 in : 280 Le : 74 's : 55 of : 237 La : 72 to : 53 L' : 62 pour : 43 the : 230 's : 166 in : 62 au : 41 une : 56 sur : 41 de : 154 © Jacques Vergne Atelier : "TALN et multilinguisme" for : 144 la : 75 from: 36 on : 143 by : 55 at : 34 and: 138 Al : 53 i : 34 a : 126 with : 52 't : 32 The : 118 is : 41 un : 31 en : 76 A : 38 à : 31 16
English version results (2) search space 1 : most frequent candidate terms silence on function words => noise on candidate terms 14/6/2003 article : 42 guerre : 21 Jean-Luc Lagardère : 17 monde : 12 Açores : 11 Weitere Artikel : 10 mort : 10 Bagdad : 8 empire : 8 semaine : 8 Lettre : 7 Was : 5 If : 2 Tutti : 4 Mais : 2 vous : 3 Qu' : 2 About: 2 Wie : 2 Alors : 2 Wo : 2 Ein : 2 avant : 2 Have : 2 contra: 2 search space 2 : Plan : 7 Läs mer: 47 Statement : ÉÑ Ä : 29 17 fin : 7 guerra : 7 Laden : 24 Sep 12 : 15 war : 22 Pak : 14 procès : 7 Kabul : 20 Press réforme : 7 sommet : 7 Qaeda : 20 Secretary : China : 18 13 Echos : 6 Sep 11 : 13 Northern Alliance: 12 guerra : 12 Irak : 11 Kandahar : 11 could : 2 plusieurs This: 12 won' : 4 Where: 2 enough: 2 depuis: 2 How : 7 Alla : 3 Why : 2 only : 2 encore: 2 that : 2 Don' : 6 My : 3 après : 2 they : 2 It : 6 auf : 3 down : 2 when : 2 faut : 2 tout : 2 mieux: 2 tutto : 2 Most : 4 One : 2 einer : 2 which: 2 contra: 4 Wer : 2 nous : 2 now : 2 25/584 = 4, 3% of 584 extracted terms candidate © Jacques Vergne 22/820 = 2, 7% of 820 extracted terms candidate Atelier : "TALN et multilinguisme" 17
English version results (3) search space 1 : search space 2 : attendu : 2 noise on War : 9 dimanche: 4 Photo : 3 home : 2 function paix : 7 baisse : 3 turn : 2 words => soir : 7 Aide : 2 voie : 2 silence on war : 7 Groupe : 2 world : 2 candidate aide : 4 terms 15/584 = 2, 6% of 584 extracted candidate terms most frequent exracted terms (nb of sites nb of documents) 14/6/2003 guerre (12 -24) Lagardère (11 -16) Jean-Luc Lagardère (9 -12) monde (8 -13) 15 (7 -10) 16 (7 -9) Aznar (7 -8) Açores (7 -10) empire (7 -8) News : 77 New: 43 news : 23 killed : 18 Home : 17 Help : 16 Free : 10 Global : 9 Air : 8 help : 8 make : 8 First : 7 Get : 7 groups : 7 88/820 = 10, 7% of 820 extracted candidate terms Policy (19 -23) East (10 -12) semaine (7 -8) U. S. (18 -39) American (9 -14) Chirac (6 -6) China (14 -29) Information (9 -13) Premier ministre war (14 -71) Press (9 -25) (6 -7) Special (12 -24) Saddam (9 -13) fin (6 -9) This (12 -24) Azores (8 -8) français (6 -9) United (12 -18) How (8 -10) mort (6 -10) Index (8 -8) pays (6 -10) Privacy Policy (11 -11) Middle East (8 -8) site (6 -8) Week (11 -14) Money (8 -8) sommet (6 -6) © Jacques Vergne Atelier : "TALN et multilinguisme" 18
English version results (4) • are rare function words and frequent content words correctly located ? computation founded on differences between words and not on absolute values (no threshold) => detection of function words or content words nearly independent of their frequency article (42), guerre (21), monde (12), mort (10), guerra (9) : content words von (8), con (7), della (6), sous (5), vom (4), zum (3), einer (2), grâce (1) : function words only one context is enough to detect by the mean of adequate differences 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 19
English version conclusion • original method which uses neither parsing, nor dictionary, nor stoplist - able to detect rare function words and frequent content words - in a multilingual corpus, in alphabetical languages, unknown a priori, mixed in the corpus, and not identified in the computation • computation independent of languages, insensitive when adding a new language, insensitive to different proportions between languages • good quality of results and adequation of the method to the task => des very general linguistic properties are exploited : differences (or relative values ) optimisation of language : the more frequent is a word, the shorter it is 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 20
English version your questions ? 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 21
English version to download • you can download this presentation on http: //www. info. unicaen. fr/~jvergne/TALN 2003 mult. JVergne_en. ppt • also see my presentation at TALN 2002 A method for top-down and determinist parsing of multilingual corpora on http: //www. info. unicaen. fr/~jvergne/TALN 2002_JVergne_en. ppt • also see the tutorial of Coling 2000 "Trends in Robust Parsing" on http: //www. info. unicaen. fr/~jvergne/tutorial. Coling 2000. html (presentation and references) 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 22
English version 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 23
- Aqifga
- Atelier invest
- Atelier de sevres tuition
- Mourinho atelier
- Atelier travel agent
- Atelier lucie marie
- L'atelier de crom
- Atelier presse papier
- What is the song si patokaan all about
- Finn no english version
- Acara numeracy continuum
- Vlan spanning
- Version control systems industry
- Unified reporting system
- Uft1
- Classifying triangles maze answers
- God knows the thoughts and intents of our hearts
- Apostles creed methodist
- Svn tortoise tutorial
- Risk student version
- Software configuration management version control
- Www.sedboyaca.gov.co sac
- Two households in romeo and juliet
- Alan rea