Extending Word Net with syntagmatic information Luisa Bentivogli
Extending Word. Net with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst 2 nd GWC, January 20 th-23 rd 2004 - Brno
Overview • Word. Net: paradigmatic vs syntagmatic information • Recurrent Free Phrases • Encoding RFP through Phrasets and Syntagmatic Relations • Getting RFPs in bilingual dictionaries and corpora • Conclusions 2 nd GWC, January 20 -23 2004 - Brno
Paradigmatic vs Syntagmatic An international conference took place in 2 nd GWC, January 20 -23 2004 - Brno
Paradigmatic vs Syntagmatic meeting national symposium An international conference took place in 2 nd GWC, January 20 -23 2004 - Brno Czech Republic Prague Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) meeting national symposium An international conference took place in 2 nd GWC, January 20 -23 2004 - Brno Czech Republic Prague Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) meeting national symposium An international conference took place in multiword expression 2 nd GWC, January 20 -23 2004 - Brno Czech Republic Prague Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting national Prague symposium An international conference took place in multiword expression 2 nd GWC, January 20 -23 2004 - Brno semantic restriction
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting national Prague symposium An international conference took place in free phrase multiword expression 2 nd GWC, January 20 -23 2004 - Brno semantic restriction
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting national Prague symposium An international conference took place in free phrase multiword expression Syntagmatic relations (in presentia) 2 nd GWC, January 20 -23 2004 - Brno semantic restriction
Why is syntagmatic info useful • From a lexicographic point of view – See examples of usage in dictionaries (and WN itself) – Often a very short phrase – Sometimes more useful than definitions • From a computational point of view – statistics oriented, corpus based methods – crucial role of co-occurrence information – co-occurrence of words vs meanings 2 nd GWC, January 20 -23 2004 - Brno
Lexical units in Word. Net • Criterium for inclusion in synsets: only lexicalized concept • What counts as a lexical unit – Simple words: {tree} – Idioms • non compositional meaning • {rollercoaster, big_dipper, . . . } – Restricted collocations • compositional, reduced substitution, no literal translation • {criminal_record, record} (Italian: precedenti penali) – Named entities: {Praha, capital_of_the_Czech_Repubblic, …} 2 nd GWC, January 20 -23 2004 - Brno
Problems with inclusion criteria - 1 • Artificial nodes: synsets with no lexical unit – {social_group}– {gruppo_sociale} – Free combinations of words (Benson et al. , 1986) • DEF: a combination of words following only the general rules of syntax • Restricted collocations: – reduced substitution, no literal transl. , but compositional – ex: circulatory system (*blood, *circulation system) – are they lexical unit? – should we include them in synsets? • Can we “keep” information currently contained in artificial nodes and restricted collocations without violating the criterium for inclusion in synsets? 2 nd GWC, January 20 -23 2004 - Brno
Problems with inclusion criteria - 2 • A considerable number of expressions which are systematically used to express a concept are excluded from (Multi)Word. Net as they are not lexical units • Ex: “andare in bicicletta” [to bike] – andare: to move by walking or using a means of locomotion – in bicicletta: by bike • Ex: “punta di freccia” [arrowhead] 2 nd GWC, January 20 -23 2004 - Brno
Introducing Recurrent Free Phrases • Recurrent free phrase (RFP): a free combination of words which is recurrently used to express a concept • 1. Syntactically constrained: N|V|A|P Phrases (cfr. restricted collocations) • 2. High frequency (“governo italiano” Italian government) • 3. High degree of association (“prima volta” first time) • 4. Salience: – intuition of the native speaker lexicographer that a certain expression picks up a concept which is perceived as relevant and somehow unitary – not necessarily related to frequency and word association • “vertice internazionale” international summit (high salience) • “coscia destra” right thigh 2 nd GWC, January 20 -23 2004 - Brno
The salience criterium • Hypothesis: – Related to the amount of world knowledge that is attached to a certain phrase – Such knowledge cannot be inferred from the meanings composing the phrase • Example: – right hand (more salient) – right thigh 2 nd GWC, January 20 -23 2004 - Brno
Recurrent Free Phrases for NLP • Knowledge-based word alignment of parallel corpora – EX: cornfield ~ campo di grano • Word Sense Disambiguation – campo: 12 senses in MWN – grano: 9 senses – both unambiguous in “campo di grano” 2 nd GWC, January 20 -23 2004 - Brno
Criteria for RFP selection • RFPs expressing a concept which is not lexicalized in a language but lexicalized in another language (lexical gaps) – EX: andare in bicicletta [to bike] • RFPs synonyms with a lexical unit in the same language – EX: strofinaccio dei piatti / canovaccio [dishcloth] • RPFs that are frequent, cohese and salient within a corpus considered as reference corpus – EX: vertice internazionale [international summit] • RPFs whose components are highly polysemous. – EX: campo di grano [cornfield ] 2 nd GWC, January 20 -23 2004 - Brno
Multi. Word. Net • • • Multi. Word. Net: Italian/English lexical database Princeton Word. Net building criteria Strict alignment (see expand model) Explicit treatment of lexical gaps Italian (44, 000 words) and – Hebrew (University of Haifa, just started) – Cfr Spanish Word. Net (Euro. Word. Net) 2 nd GWC, January 20 -23 2004 - Brno
Introducing Phrasets • Phraset: a set of synonymous recurrent free phrases ENG-synset ITA-phraset {cornfield} {GAP} {campo_di_grano} ENG-synset ITA-phraset {toilet_roll} {GAP} {rotolo_di_carta_igienica} ENG-synset ITA-phraset {dishcloth} {canovaccio} {strofinaccio_dei_piatti, strofinaccio_da_cucina} 2 nd GWC, January 20 -23 2004 - Brno
RFPs vs definitions RFPs are not definitions E-synset {tree -- a tall perennial wody plant having a main trunk …} I-synset {albero -- ogni pianta perenne con fusto legnoso ramificato} I-phraset{} E-synset {paperboy} I-synset {GAP – ragazzo che recapita i giornali} I-phraset{ragazzo_dei_giornali} E-synset {straphanger} I-synset {GAP – chi viaggia in piedi su mezzi pubblici reggendosi ad un sostegno} I-phraset{} 2 nd GWC, January 20 -23 2004 - Brno
Synsets vs Phrasets Free combination of words Recurrent Free Phrases Phrasets Restricted collocations Named entities Synsets Idioms Simple words 2 nd GWC, January 20 -23 2004 - Brno
Syntagmatic Relations in WN • MEANING project: using the involve semantic relation to encode deep selectional restrictions • Can RFP be encoded through semantic relations? 2 nd GWC, January 20 -23 2004 - Brno
Encoding “campagna antifumo” -1 Through phrasets Synset: {campagna} Phraset: {} campaign hypernym Synset: {GAP} Phraset: {campagna_antifumo} campaign against smoking 2 nd GWC, January 20 -23 2004 - Brno
Encoding “campagna antifumo” - 2 Through a semantic relation has_constraint Synset: {campagna} Synset: {antifumo} campaign against smoking 2 nd GWC, January 20 -23 2004 - Brno
Pros and cons of using semantic rels for encoding RPFS • Smart and concise but what about • • trigram RFP? synonymous RFPs RPFs that are translation equivalent of lexical units? Restrictions on word order and word morphology? 2 nd GWC, January 20 -23 2004 - Brno
Taking the best of both encodings • Phrasets and lexical syntagmatic relations appezzamento (parcel) cereale (cereal) hypernym campo (field) composed-of hypernym (campo) frumento, grano (corn) composed-of (grano) GAP -- campo di grano (cornfield) 2 nd GWC, January 20 -23 2004 - Brno
RFP in Bilingual Dictionaries • Collins bilingual dictionary (medium size) • Italian Translation Equivalents (Bentivogli and Pianta, 2000) – 92. 2% correspond to lexical units – 7. 8% correspond to free combination of words (lexical gaps) • Manual check of 300 lexical gaps – 67% correspond to RFPs => More than half of the synsets which are gaps in Italian potentially have an associated phraset 2 nd GWC, January 20 -23 2004 - Brno
RFPs in corpora • Correlation between RPFs and frequency? • Analysis of a 32 M word corpus (Repubblica, 20002001) • Standard n-gram analysis package (NSP) • All bigrams including at least a stopword excluded • 118, 464 bigrams occurring more than 3 times • Highest rank: 5, 914 occurrences (“New York”) • Rank 4: 31, 453 bigrams • 497 distinct ranks (frequence classes) 2 nd GWC, January 20 -23 2004 - Brno
RFPs in corpora cont. • Lower ranks are systematically and densely populated • Higher ranks are sparsely and poorly populated • Rank groups – – – – – A: B: C: D: E: F: G: H: I: 5, 914 -509 505 -257 256 -129 128 -65 64 -33 32 -17 16 -9 8 -5 4 (100 bigrams) (257) (731) (1, 965) (4, 525) (10, 477) (22, 167) (46, 798) (31, 453) • Manual check of 100 random bigrams from each rank group 2 nd GWC, January 20 -23 2004 - Brno
RFPs in corpora cont. Manual check of 100 random bigrams from each rank group A B C D E F G H I 5, 914 505 256 128 64 32 16 8 (4) Lexical units 82 79 74 65 58 55 42 35 28 Recurrent free phrases 14 4 9 14 17 4 15 3 15 4 17 17 21 25 41 43 58 57 Other NB: similar results on trigrams 2 nd GWC, January 20 -23 2004 - Brno
Correlation between num. of RFPs and frequency in a reference corpus 2 nd GWC, January 20 -23 2004 - Brno
Future work • Better characterization and classification • Correlation with association measures • Evaluating RFP for WSD 2 nd GWC, January 20 -23 2004 - Brno
Conclusions • Wordnet is poor of syntagmatic information • We introduced Recurrent Free Phrases, Phrasets, syntagmatic lexical relations • RFP: free combination of word recurrently used to express a concept • Criteria for their selection • Bilingual dictionaries contain many RFPs • Corpora: no clear correlation with frequency • Useful for: – lexicographic work – Word Sense Disambiguation 2 nd GWC, January 20 -23 2004 - Brno
- Slides: 33