Annotating language data Toma Erjavec Institut fr Informationsverarbeitung
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Lecture 4: Lexical Semantics 24. 11. 2006
Overview 1. 2. 3. Word senses Word sense disambiguation Semantic lexica
Word Senses n n n Lexical semantics is the study of how and what the words of a language denote. Lexical semantics involves the meaning of each individual word A word sense is one of the meanings of a word A word is called ambiguous if it can be interpreted in more than one way, i. e. , if it has multiple senses. Disambiguation determines a specific sense of an ambiguous word.
Homonymy and Polysemy n n n A homonym is a word with multiple, unrelated meanings. A homonym is a word that is spelled and pronounced the same as another but with a different meaning. bank → financial institution → slope of land alongside a river A polyseme is a word with multiple, related meanings. school → I go to school every day. (institution) → The school has a blue facade. (building) → The school is on strike. (teacher) Regular polysemy performs a regular induction of a word sense on the basis of another, e. g. school / office.
Human Beings and Ambiguity n What seems perfectly obvious to a human being is deeply ambiguous to the computer, and there is no easy way of resolving ambiguity. u u n n I paid the money on my bank account. I watched the ducks on the river bank. Semantic priming (psycholinguistics): The response time for a word is reduced when it is presented with a semantically related word. doctor → nurse / butter If an ambiguous prime such as bank is given, it turns out that all word senses are primed for bank → money / river
Disambiguation Cues n n n Probability and prototypicality → default interpretation: corpus-related importance of word senses Internal text evidence: context, in particular collocations One sense per discourse Domain Real-world knowledge
Word Sense Disambiguation (WSD) n n WSD: associating a word in a text with a meaning (sense) which can be distinguished from other meanings the word potentially has. Intermediate task: not an end in itself, but (arguably) necessary in most NLP tasks, such as machine translation, information retrieval, speech processing Problems: 1. Which are the senses? 2. Which is the correct sense? Sources of information: 1. Context of the word to be disambiguated (local, global) 2. External knowledge sources (e. g. dictionary definitions)
Sense Inventory n n Word Sense Disambiguation needs a set of word senses to disambiguate between. u Word Sense Discrimination doesn’t Sense inventories are found in dictionaries, thesauri or similar. The granularity and criteria for the set of senses differ (lumpers vs. splitters). There is no reason to expect a single set of word senses to be appropriate for different NLP applications.
Lexical Semantic Resources n n Sense inventory and organisation: u Word. Net Sense annotation and semantic role annotation: u Prague Dependency Treebank u Frame. Net u Prop. Bank u Onto. Bank / Onto. Notes
Word. Net n n n n Online lexical reference system, freely available also for downloading The design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organised into synonym sets (synsets). Each synset represents one underlying lexical concept. Different (paradigmatic) relations link the synonym sets. Word. Net was developed by the Cognitive Science Laboratory at Princeton University under the direction of George A. Miller. Word. Nets now exist for many languages.
Word. Net Synsets n n n Synsets are sets of synonymous words (“literals”). Polysemous words appear in multiple synsets. Examples: noun example: {coffee, java} {coffee, coffee tree} {coffee bean, coffee berry, coffee} adjective : {chocolate, coffee, deep brown, umber, burnt umber} adjective example: {cold} {aloof, cold} {cold, dry, uncordial} {cold, unaffectionate, uncaring} {cold, old}
More about synsets Synsets also include: n glosses (definitions) n examples of usage n e. g. (n) glass (glassware collectively) "She collected old glass" n recently added by ITC, Italy: semantic domains e. g.
Word. Net Relations n n n Within synsets: u Synonymy, such as {coffee, java} Between synsets / parts of synsets: u Antonymy: opposition, e. g. {cold} - {hot} u Hypernymy / Hyponymy: is-a relation, e. g. {coffee, java} - {beverage, drink, potable} u Meronymy / Holonymy: part-of relation, e. g. {coffee bean, coffee berry, coffee} - {coffee, coffee tree} Morphology: u Derivations: appealing - appealingness
Word. Net Hierarchy n n Depending on the part-of-speech, different relations are defined for a word. For example, the core relation for nouns is hypernymy, the core relation for adjectives is antonymy. Hypernymy imposes a hierarchical structure on the synsets. The most general synsets in the hierarchy consists of a number of pre-defined disjunctive top-level synsets: u nouns → {entity}, {abstraction}, {psychological}, … u verbs → {move}, {change}, {get}, {feel}, …
Word. Net Hierarchy: Examples {entity} | {object, inanimate object, physical object} | {substance, matter} {food, nutrient} | {beverage, drink, potable} {coffee, java} {abstraction} | {attribute} | {property} | {visual property} | {color, coloring} | {brown, brownness} | {chocolate, coffee, deep brown, umber, burnt umber}
Word. Net Family n n n Current status: Word. Nets for 38 languages Word. Nets in the world: http: //www. globalwordnet. org/gwa/wordnet_table. htm Integration of Word. Nets into multi-lingual resources: u Euro. Word. Net: English, Dutch, Italian, Spanish, German, French, Czech and Estonian u Balka. Net: Bulgarian, Czech, Greek, Romanian, Turkish, Serbian An inter-lingual index connects the synsets of the Word. Nets ~ multilingual lexicon; machine translation
Word. Net annotated corpora n n Sem. Cor: created at Princeton University, a subset Brown corpus (700, 000 words). 200, 000 content words are Word. Net sense-tagged Multi. Sem. Cor: created at ITC, Italy, consists of Sem. Cor + translation into Italian, which is also sense-tagged http: //multisemcor. itc. it/ DSO Corpus of Sense-Tagged English (National University of Singapore) etc.
Thematic roles Thematic role is the semantic relationship between a predicate (e. g. a verb) and an argument (e. g. the noun phrases) of a sentence. n Agent: animate, volitional; initiates action n Patient: animate or inanimate; undergoes (and is affected by) action n Experiencer: animate; undergoes perceptual experience n Theme: animate or inanimate; undergoes motion, or an action that does not affect it significantly Anna prepared chicken for dinner. Anna baked a cake for her daughter. The storm frightened Anna sent Tim a letter.
Thematic roles (2) n Recipient: generally animate; receives something n Benefactive: generally animate; one who benefits from the event n Goal: animate or inanimate; endpoint of the action n Location: place where the event occurs n Source: animate or inanimate; starting point of an action n Instrument: often inanimate; used in an action Tim kicked the ball to Bob. Anna baked a cake for her daughter. Anna put the book on the table. Anna and Tim met in Paris. Anna and Tim came from Berlin. Tim smashed the window with a hammer.
Prague Dependency Treebank n n n Three-level annotation scenario: u 1. morphological level u 2. syntactic annotation at the analytical level u 3. linguistic meaning at the tectogrammatical level Corpus data: newspaper articles (60%), economic news and analyses (20%), popular science magazines (20%) 1 million tokens are annotated on the tectogrammatical level.
Tectogrammatical Level of the PDT n n n Annotation: dependency, functor, ellipsis resolution, coreference, … 39 attributes Similar to the surface (analytical) level, but: u certain nodes deleted (auxiliaries, non-autosemantic words, punctuation) u some nodes added (based on word - mostly verb, noun - valency) u some ellipsis resolution (detailed dependency relation labels: functors)
Tectogrammatical Functors (~ thematic roles) n n n General functors, e. g. : actor/bearer, addressee, patient, origin, effect, cause, regard, concession, aim, manner, extent, substitution, accompaniment, locative, means, temporal, attitude, cause, regard, directional, benefactive, comparison Specific functors for dependents on nouns, e. g. : material, appurtenance, restrictive, descriptive, identity Subtle differentiation of syntactic relations, e. g. : temporal (before, after, on), accompaniment, regard, benefactive (for/against)
Tectogrammatical Example n Example: (he) gave him a book dal mu knihu The “Obj” goes into ACT, PAT, ADDR, EFF or ORIG, as based on the governor’s valency frame.
Analytical vs. Tectogrammatical Level
Frame. Net n n Frame-semantic descriptions for English verbs, nouns, and adjectives Aim: document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses Result: lexical database with u descriptions of the semantic frames u a representation of the valences for target words u a collection of annotated corpus attestations Current size: more than 6, 100 lexical units annotated in more than 625 semantic frames, exemplified in more than 135, 000 sentences
Frame. Net Vocabulary n n Frame semantics, developed by Charles Fillmore: u a theory that relates linguistic semantics to encyclopaedic knowledge u describes the meaning of a word (sense) by characterising the essential background knowledge that is necessary to understand the word/sentence Frame: conceptual structure modelling prototypical situations Frame element: frame-evoking word or expression Frame roles: participants and properties of the situation
Frame. Net Example n n n Frame: Transportation u Frame elements: mover, means, path u Scene: mover moves along path by means Frame: Driving u Inherit: Transportation u Frame elements: driver=mover, rider=mover, cargo=mover, vehicle=means u Scenes: driver starts vehicle, driver controls vehicle, driver stops vehicle Annotated corpus sentence: Now [D Tim] was driving [R his guest] [P to the station].
Frame. Net Languages English Frame. Net: Berkeley n German Frame. Net: Salsa, Saarbrücken n Spanish Frame. Net: Barcelona n Japanese Frame. Net: Keio, Yokohama & Tokyo n Issue: cross-lingual transfer of English Frame. Net
German Frame. Net: SALSA n Annotation of the TIGER treebank with semantic roles n Existing manual syntactic annotation of newspaper data: grammatical functions, syntactic categories, argument structure of syntactic heads n Annotation procedure: All frame elements are annotated by their frames and roles → corpus-based. (In comparison: The English Frame. Net annotates a selected set of prototypical examples for each frame → frame-based. ) n Current size: 476 German predicates with 18, 500 instances and 628 different frames
TIGER/SALSA Example
Conclusions n n Introduced lexical semantics: word-senses, word -sense disambiguation It is an open issue to what extent (and with how fine-grained senses) WSD is beneficial to (which) applications Some resources: Word. Net, PDT, Frame. Net Other semantic lexica and semantically annotated corpora exists: Prop. Bank, Onto. Notes…
- Slides: 31