English version TALN 2003 atelier TALN et multilinguisme

  • Slides: 23
Download presentation
English version TALN 2003 atelier : "TALN et multilinguisme" A tool for endogenous and

English version TALN 2003 atelier : "TALN et multilinguisme" A tool for endogenous and multilingual terminological extraction Jacques Vergne GREYC - Université de Caen 14/6/2003 http: //www. info. unicaen. fr/~jvergne © Jacques Vergne Atelier : "TALN et multilinguisme"

English version frame application (1) • news websites --- system ---> reviews of papers

English version frame application (1) • news websites --- system ---> reviews of papers • users : journalists, web users "what and who papers of a given geographical or linguistic space are speaking about? " front-page of Le Monde • inversion of the issue of search engines key-words (topics) ---> documents search space ---> main topics of the news • "front-pages" of news websites ---> hyperlinks : URL and source code of hyperlinks "texts" 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 2

English version frame application (2) • hyperlinks "texts" of "front-pages" : an editorial choice

English version frame application (2) • hyperlinks "texts" of "front-pages" : an editorial choice of journalists of news websites • hyperlinks "texts" of "front-pages" --extracting--> terms present on several sites • ---> graph of terms nodes = weighted terms (sites - documents) links = weighted links between terms front-page of Le Monde (co-occurrences of 2 terms in the same hyperlink text) • the user navigates into this graph to access to linked terms and to papers 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 3

English version user interface santé jeunes gouvernement loi santé des jeunes milieu scolaire école

English version user interface santé jeunes gouvernement loi santé des jeunes milieu scolaire école suivi été alcool navigating into the graph of terms 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 4

English version specifications of the tool • hyperlinks "texts" of "front-pages" --extracting--> terms present

English version specifications of the tool • hyperlinks "texts" of "front-pages" --extracting--> terms present on several sites • method able to locate : rare function words and frequent content words (as guerre or war ) • to center terms onto content words • in a multilingual corpus (15 000 to 30 000 words) • of unknown alphabetical languages • without parsing, nor dictionary, nor stoplist 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 5

English version state of the art : repeated patterns • methods of André Salem,

English version state of the art : repeated patterns • methods of André Salem, Helena Ahonen, François Rousselot : - research of repeated patterns while using algorithms extrapolated from the greedy algorithm (research of n-grams from n-1 -grams) - with, as input, function words of the processed language to avoid to choose them as terms (in a stopword-list) 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 6

English version an endogenous tool • term proposed by Didier Bourigault : computing prepositional

English version an endogenous tool • term proposed by Didier Bourigault : computing prepositional and adjectival phrase attachments in a monolingual corpus with a dictionary and parsing • same generic meaning : using lexical distributional regularities in a corpus to process this same corpus • but different specific meaning : locating rare function words and frequent content words in a multilingual corpus without parsing, nor dictionary, nor stoplist 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 7

English version very general linguistic properties • word frequency => silence on frequent content

English version very general linguistic properties • word frequency => silence on frequent content words • Zipf : "the principle of least effort" the more frequent is a word, the shorter it is short and frequent words are function words • Saussure : "dans la langue, il n'y a que des différences" • => using differences of length and of frequency of 2 contiguous words • no other resource than the processed corpus itself, without identifying the language 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 8

English version expected result • text : sequence of function words and content words

English version expected result • text : sequence of function words and content words Manifestazioni per la pace in tutto il mondo expected result : C f f C f C Manifestazioni per la pace in tutto il mondo 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 9

difference between 2 contiguous words English version • difference criteria between 2 contiguous words

difference between 2 contiguous words English version • difference criteria between 2 contiguous words : - difference of length in number of letters : il mondo (2 letters - 5 letters ) - difference of frequency in the corpus : il mondo (19 occurrences - 3 occurrences) 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 10

English version proposed solution : principle • searching 2 types of sequences of words,

English version proposed solution : principle • searching 2 types of sequences of words, where 1 or 2 function words are within 2 Content words : tutto il mondo C f Manifestazioni per la pace C C sequence Cf. C du la of im ne il le lui y 14/6/2003 f f C sequence Cff. C en © Jacques Vergne de la que des n'a Atelier : "TALN et multilinguisme" of the ist ein is the aus dem a été qui ne 11

English version proposed solution : process (0) 1) Extracting function words from the corpus

English version proposed solution : process (0) 1) Extracting function words from the corpus 2) Generating candidate terms 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 12

English version proposed solution : process (1) 1) Extracting function words from the corpus

English version proposed solution : process (1) 1) Extracting function words from the corpus • segmenting the corpus on limits of texts of hyperlinks and on punctuations --> segments • for every segment, searching Cf. C and Cff. C sequences from differences of length and frequency 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 13

English version proposed solution : process (2) • for every segment, searching Cf. C

English version proposed solution : process (2) • for every segment, searching Cf. C and Cff. C sequences : Manifestazioni lengths profile 14 long deductions 14/6/2003 3 short 1 frequencies profile > per rare < 10 la 2 < 4 short in 2 long 207 > 2 frequent Content pace 62 rare Atelier : "TALN et multilinguisme" il 5 > 2 long function Content © Jacques Vergne tutto mondo < 5 short long 3 < 19 > 3 rare frequent rare Content function Content 14

English version proposed solution : process (3) 2) Generating candidate terms • according to

English version proposed solution : process (3) 2) Generating candidate terms • according to patterns : C+ Manifestazioni pace tutto mondo C+ f+C+ 14/6/2003 Manifestazioni per la pace in tutto il mondo © Jacques Vergne Atelier : "TALN et multilinguisme" 15

English version results (1) (15 mars 2003) search space 1 : search space corpus

English version results (1) (15 mars 2003) search space 1 : search space corpus candidate terms most frequent function words 14/6/2003 search space 2 : 22 sites of French national and regional press, 17 sites of European press (Suisse, Belgique, Deutschland, Italy, Spain, UK, Irland), and 4 USA news websites , at least 2 sites for every language about 100 sites published by Google News, about half are USA sites (http: //news. google. fr/news/) 84 Kb, 14 800 words 163 Kb, 28 500 words 1566 occurrences of 584 candidate terms (from 42 à 2 occurrences / term) 2435 occurrences of 820 candidate terms (from 47 à 2 occurrences / term) de : 340 du : 103 la : 207 et : 99 l' : 153 des : 88 le : 113 en : 87 d' : 107 les : 84 à : 107 a : 82 to : 327 un : 80 Les : 55 in : 280 Le : 74 's : 55 of : 237 La : 72 to : 53 L' : 62 pour : 43 the : 230 's : 166 in : 62 au : 41 une : 56 sur : 41 de : 154 © Jacques Vergne Atelier : "TALN et multilinguisme" for : 144 la : 75 from: 36 on : 143 by : 55 at : 34 and: 138 Al : 53 i : 34 a : 126 with : 52 't : 32 The : 118 is : 41 un : 31 en : 76 A : 38 à : 31 16

English version results (2) search space 1 : most frequent candidate terms silence on

English version results (2) search space 1 : most frequent candidate terms silence on function words => noise on candidate terms 14/6/2003 article : 42 guerre : 21 Jean-Luc Lagardère : 17 monde : 12 Açores : 11 Weitere Artikel : 10 mort : 10 Bagdad : 8 empire : 8 semaine : 8 Lettre : 7 Was : 5 If : 2 Tutti : 4 Mais : 2 vous : 3 Qu' : 2 About: 2 Wie : 2 Alors : 2 Wo : 2 Ein : 2 avant : 2 Have : 2 contra: 2 search space 2 : Plan : 7 Läs mer: 47 Statement : ÉÑ Ä : 29 17 fin : 7 guerra : 7 Laden : 24 Sep 12 : 15 war : 22 Pak : 14 procès : 7 Kabul : 20 Press réforme : 7 sommet : 7 Qaeda : 20 Secretary : China : 18 13 Echos : 6 Sep 11 : 13 Northern Alliance: 12 guerra : 12 Irak : 11 Kandahar : 11 could : 2 plusieurs This: 12 won' : 4 Where: 2 enough: 2 depuis: 2 How : 7 Alla : 3 Why : 2 only : 2 encore: 2 that : 2 Don' : 6 My : 3 après : 2 they : 2 It : 6 auf : 3 down : 2 when : 2 faut : 2 tout : 2 mieux: 2 tutto : 2 Most : 4 One : 2 einer : 2 which: 2 contra: 4 Wer : 2 nous : 2 now : 2 25/584 = 4, 3% of 584 extracted terms candidate © Jacques Vergne 22/820 = 2, 7% of 820 extracted terms candidate Atelier : "TALN et multilinguisme" 17

English version results (3) search space 1 : search space 2 : attendu :

English version results (3) search space 1 : search space 2 : attendu : 2 noise on War : 9 dimanche: 4 Photo : 3 home : 2 function paix : 7 baisse : 3 turn : 2 words => soir : 7 Aide : 2 voie : 2 silence on war : 7 Groupe : 2 world : 2 candidate aide : 4 terms 15/584 = 2, 6% of 584 extracted candidate terms most frequent exracted terms (nb of sites nb of documents) 14/6/2003 guerre (12 -24) Lagardère (11 -16) Jean-Luc Lagardère (9 -12) monde (8 -13) 15 (7 -10) 16 (7 -9) Aznar (7 -8) Açores (7 -10) empire (7 -8) News : 77 New: 43 news : 23 killed : 18 Home : 17 Help : 16 Free : 10 Global : 9 Air : 8 help : 8 make : 8 First : 7 Get : 7 groups : 7 88/820 = 10, 7% of 820 extracted candidate terms Policy (19 -23) East (10 -12) semaine (7 -8) U. S. (18 -39) American (9 -14) Chirac (6 -6) China (14 -29) Information (9 -13) Premier ministre war (14 -71) Press (9 -25) (6 -7) Special (12 -24) Saddam (9 -13) fin (6 -9) This (12 -24) Azores (8 -8) français (6 -9) United (12 -18) How (8 -10) mort (6 -10) Index (8 -8) pays (6 -10) Privacy Policy (11 -11) Middle East (8 -8) site (6 -8) Week (11 -14) Money (8 -8) sommet (6 -6) © Jacques Vergne Atelier : "TALN et multilinguisme" 18

English version results (4) • are rare function words and frequent content words correctly

English version results (4) • are rare function words and frequent content words correctly located ? computation founded on differences between words and not on absolute values (no threshold) => detection of function words or content words nearly independent of their frequency article (42), guerre (21), monde (12), mort (10), guerra (9) : content words von (8), con (7), della (6), sous (5), vom (4), zum (3), einer (2), grâce (1) : function words only one context is enough to detect by the mean of adequate differences 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 19

English version conclusion • original method which uses neither parsing, nor dictionary, nor stoplist

English version conclusion • original method which uses neither parsing, nor dictionary, nor stoplist - able to detect rare function words and frequent content words - in a multilingual corpus, in alphabetical languages, unknown a priori, mixed in the corpus, and not identified in the computation • computation independent of languages, insensitive when adding a new language, insensitive to different proportions between languages • good quality of results and adequation of the method to the task => des very general linguistic properties are exploited : differences (or relative values ) optimisation of language : the more frequent is a word, the shorter it is 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 20

English version your questions ? 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme"

English version your questions ? 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 21

English version to download • you can download this presentation on http: //www. info.

English version to download • you can download this presentation on http: //www. info. unicaen. fr/~jvergne/TALN 2003 mult. JVergne_en. ppt • also see my presentation at TALN 2002 A method for top-down and determinist parsing of multilingual corpora on http: //www. info. unicaen. fr/~jvergne/TALN 2002_JVergne_en. ppt • also see the tutorial of Coling 2000 "Trends in Robust Parsing" on http: //www. info. unicaen. fr/~jvergne/tutorial. Coling 2000. html (presentation and references) 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 22

English version 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 23

English version 14/6/2003 © Jacques Vergne Atelier : "TALN et multilinguisme" 23