LIN 3098 Corpus Linguistics Lecture 4 Albert Gatt
LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt
In this lecture o Levels of annotation o Corpus typology n classification based on type and levels of annotation n multilingual corpora LIN 3098 -- Corpus Linguistics
Part 1 Levels of corpus annotation (cont/d)
Levels of linguistic annotation o o part-of-speech (word-level) lemmatisation (word-level) parsing (phrase & sentence-level) semantics (multi-level) n semantic relationships between words and phrases n semantic features of words o discourse features (supra-sentence level) o phonetic transcription o prosody LIN 3098 -- Corpus Linguistics
Lemmatisation o Groups morphological variants of a word under the head word: n mexa’ (walk) o o imxejt (I walked) imxejna (we walked) nimxu (we walk). . . Together , these form a lemma o Increasingly common these days. LIN 3098 -- Corpus Linguistics
Lemmatisation example: the SUSANNE corpus o Format: word + tag + lemma A 05: 0030. 33 - Corpus file: sentence. word VVDv said say POS tag actual head word (lemma) o Every word in the corpus is on separate line. o Extremely useful for lexicography LIN 3098 -- Corpus Linguistics
Automatic morphological analysis o For some languages, there are reasonably good lemmatisers/ morphological analysers: o Examples for English: n morpha: built at the University of Sussex n Eng. Twol: commercial, by Ling. Soft. LIN 3098 -- Corpus Linguistics
Engtwol output o undeniable: n "undeniable" <DER: ble> A ABS o (derived with –ble suffix) o adjective (A) o absolute (ABS) form o This is a rule-based analyser. There are others which use corpus-derived statistical patterns. LIN 3098 -- Corpus Linguistics
Semantic annotation I: Two types o markup of semantic relations (e. g. predicate-argument structure) n currently used in parsed corpora, to mark up function-argument structures etc. o markup of features of word meaning (mainly, word senses) n has origins in content analysis to arrive at conclusions about how prominent particular concepts are n Now used in a lot of work on word sense disambiguation LIN 3098 -- Corpus Linguistics
Example of type 1 semantic markup (Penn Treebank) (S (NPSBJ 1 Chris) Empty embedded subject linked to NP subject no. 1 (VP wants (S (NPSBJ *1) (VP to (VP throw (NP the ball)))))) o Predicate Argument Structure: wants(Chris, throw(Chris, ball)) LIN 3098 -- Corpus Linguistics
Semantic markup type 2: lexical features o Most common type: n word-sense tagged corpora o Main idea: n disambiguate a word in context by tagging its sense o Often uses Word. Net (Miller et al 1993) n Word. Net is a lexical taxonomy which represents lexical relations within a large number of words. o including hyponymy (IS-A) relations etc o For each entry, all the (supposed) senses of the word are given. n Main use: identify senses of words in context, mark them up with a pointer to a wordnet sense. LIN 3098 -- Corpus Linguistics
Word. Net senses: Move (noun) o (377) move -- (the act of deciding to do something; "he didn't make a move to help"; "his first move was to hire a lawyer") o (70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire") o (57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility") o (30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path") o (5) move -- ((game) a player's turn to take some action permitted by the rules of the game) LIN 3098 -- Corpus Linguistics
Word. Net senses: Move (verb) o (130) travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go? "; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell") o (60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant") o (52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right") o (20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another") LIN 3098 -- Corpus Linguistics
Check it out! o Wordnet is freely available for download: n http: //wordnet. princeton. edu/ LIN 3098 -- Corpus Linguistics
Word sense annotation: other uses o tagging words with their semantic field (Wilson 1996) n n n o tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary: n n o plant life men’s clothing … social processes negative emotions This approach underlies Pennebaker’s Linguistic Inquiry and Word. Count (LIWC) system, n n analyses a text and comes up with a profile of its personal/emotional content relates this to some features of its author (gender, age…) LIN 3098 -- Corpus Linguistics
Discourse annotation o Most common: n text-level things such as paragraphs o Less common: n anaphoric NPs and reference (cf. example from lecture 3) o Even less common: n annotation of words which function as discourse cues (Stenstrom 1984): o apology (sorry), hedges (sort of), etc n annotation of rhetorical structure LIN 3098 -- Corpus Linguistics
Discourse: Annotating rhetorical structure (I) o Rhetorical Structure Theory (Mann and Thompson 1988): n views text as made up of “discourse units” n units stand in various rhetorical relations, which reflect their role in constructing an argument, a narrative, etc o CONCESSION/CONTRAST relation: o [Although Mr. Freeman is retiring, ] [he will continue to work as a consultant for American Express on a project basis]. o Second unit is the main one (nucleus) o First unit (satellite) “concedes” that what the main unit is saying is contradicted by another fact. o Recent corpus (Marcu et al 2003) is annotated with this information. LIN 3098 -- Corpus Linguistics
Phonetic transcription o Not many phonetically transcribed corpora. n MARSEC corpus is one of the best known. This is a version of the Lancaster/IBM Spoken English Corpus. n Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e. g. text -to-speech synthesis). LIN 3098 -- Corpus Linguistics
Annotating suprasegmentals o Aims: capture suprasegmental features such as stress, intonation and pauses in spoken speech. o Some transcription systems exist n TOBI (American) n Tonic Stress Marker (TSM; British) n define ways of annotating suprasegmentals such as start/end of tone group; simultaneous speech, risefall tone, falling tone, etc… LIN 3098 -- Corpus Linguistics
Problem-oriented tagging o If you’re interested in a particular problem, and no corpus exists, build your own! o Many corpora define problem-specific annotation schemes. LIN 3098 -- Corpus Linguistics
Example: the TUNA Corpus o Problem: How do people refer to objects using definite NPs? n Main interest: visual properties (colour, size etc) n Focus: semantics of definite NPs, i. e. what people choose to include in their description. o Method: n experiment to get people to describe objects, distinguishing them from other objects in the same visual “scene” n annotation of descriptions based on semantics LIN 3098 -- Corpus Linguistics
TUNA Corpus: description <DESCRIPTION NUM="SINGULAR"> <ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE> <ATTRIBUTE NAME="type" VALUE="sofa"> sofa </ATTRIBUTE> <ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE> </DESCRIPTION> Red sofa, bigger version. Features of the corpus: 1. represents the “target” referent 2. also represents the “distractors” (from which the target must be distinguished) 3. semantically transparent: annotation goes beyond language LIN 3098 -- Corpus Linguistics
Part 2 Multilingual corpora
Why multilingual corpora? o comparative studies n syntax n morphology n … o the cornerstone of most research in automatic machine translation nowadays n most MT systems are statistical, trained on large repositories of parallel (e. g. English-Chinese) text. LIN 3098 -- Corpus Linguistics
Parallel corpora o Represents a text in its original language (L 1), with a translation in another language (L 2) n long history: Medieval polyglot bibles were among the first “parallel” corpora o Alignment: n Many parallel corpora align L 1 and L 2 at sentence level, sometimes also at word level… n Sentence-level alignment can be achieved automatically with very high accuracy! LIN 3098 -- Corpus Linguistics
Example: SMULTRON corpus o Developed and released in 2007 -8 o Relatively small o Aligned texts in English, Swedish and German n E. g. Sophie’s World is one of the texts o Annotated with syntax, POS, morphology o Comes with a tool to view parallel syntactic trees. LIN 3098 -- Corpus Linguistics
SMULTRON example: English (Sophie’s World) <s id=“s 3”> <terminals> <t id="s 3_1" word="Sophie" pos="NNP" morph="--"/> <t id="s 3_2" word="Amundsen" pos="NNP" morph="--"/> <t id="s 3_3" word="was" pos="VBD" morph="--"/> <t id="s 3_4" word="on" pos="IN" morph="--"/> <t id="s 3_5" word="her" pos="PRP$" morph="--"/> <t id="s 3_6" word="way" pos="NN" morph="--"/> <t id="s 3_7" word="home" pos="RB" morph="--"/> <t id="s 3_8" word="from" pos="IN" morph="--"/> <t id="s 3_9" word="school" pos="NN" morph="--"/> <t id="s 3_10" word=". " pos=". " morph="--"/> </terminals> </s> This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc) LIN 3098 -- Corpus Linguistics
SMULTRON: Same sentence in German <s id=“ 3”> <terminals> <t id="s 3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " /> <t id="s 3_2“ word="Amundsen" pos="NE" morph="--" lemma="Amundsen“ /> <t id="s 3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/> <t id="s 3_4" word="auf" pos="APPR" morph="--" lemma="auf" /> <t id="s 3_5" word="dem" pos="ART" morph="--" lemma="der" /> <t id="s 3_6" word="Heimweg" pos="NN" morph="MASK" lemma="Heimweg“ /> <t id="s 3_7" word="von" pos="APPR" morph="--" lemma="von" /> <t id="s 3_8" word="der" pos="ART" morph="--" lemma="die" /> <t id="s 3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" /> <t id="s 3_10" word=". " pos="$. " morph="--" lemma="--" /> </terminals> </s> Note: richer morphology, representation of lemmas, … LIN 3098 -- Corpus Linguistics
Translation corpora o Not parallel. o Have different texts in two or more different languages, of the same genre. o Examples: n PAROLE corpus is a translation corpus for EU languages LIN 3098 -- Corpus Linguistics
Why translation corpora? o Parallel corpora, by definition, contain translation (L 2) n can give rise to errors n artificiality and translation quality can be an issue o e. g. Mc. Enery & Wilson report a study on an English-Polish corpus. The Polish text reads “like a translation” o Problem can be overcome if the texts used are professionally translated. o Translation corpora have texts in two or more languages, “in the original”. n Data is more natural. LIN 3098 -- Corpus Linguistics
Summary o We have now concluded our initial incursion into: n corpus construction n corpus annotation n corpus typology o Next up: n using corpora for linguistic research LIN 3098 -- Corpus Linguistics
- Slides: 31