Treebanks and MWEs Part 1 Jan Haji Pavel
Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic
Outline • Treebanks – Phrase-(Constituency-) based: The Penn Treebank – Dependency: The Prague Dependency Treebanks • The Penn Treebank (basics) • The Prague Dependency Treebank – Layers of Annotation • • 19. 1. 2015 Morphology Syntax Semantics Valency PARSEME Training School Prague 2
THE PENN TREEBANK 19. 1. 2015 PARSEME Training School Prague 3
Phrase- vs. Dependency. Based Treebanks • The original: The Penn Treebank – Phrase-based style; good for parsing by CFG grammars • Followers – Almost all Penn-based treebanks • Chinese, Arabic, Korean, … – Negra (German), many others • Now: dependency parsing prevails • Conversion from phrase-based treebanks – Might lose information, heads added „ad hoc“ • “native” dependency treebanks: annotated as such – Considered “better” – Hindi/Urdu, TIGER (sort of); both styles manually annotated – PDT (of course) and similar ones » PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin 19. 1. 2015 PARSEME Training School Prague 4
The Penn Treebank • Published (first) in 1993, now LDC 99 T 42 (www. ldc. upenn. edu) – First the Wall Street Journal part (1 mil. words, 2312 documents) • Added other text types – ATIS corpus (dialogs, travel reservations) – Brown corpus annotated for syntax – Switchboard (spoken language, tel. conversations) 19. 1. 2015 PARSEME Training School Prague 5
Penn Treebank Format • • • • ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, , ) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, , ) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov. ) (CD 29) ))) (. . ) )) “Preterminal” POS tag (NNS) (noun, plural) Noun Phrase label (NP) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. 1. 2015 PARSEME Training School Prague 6
The Penn Treebank(s) • Extensions – Annotation of named entities, co-reference (BBN) – cf. also previous slides – Function labels (SBJ, OBJ, TMP, . . . ) – Prop. Bank • Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary) (S (NP-SBJ (PRP I Arg 0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg 1) (NP-IOBJ (DET the) (NN book Arg 2))). . . ) – Nom. Bank • Like Prop. Bank, but for nouns and their “arguments” • Other languages (Chinese, Arabic, . . . ) 19. 1. 2015 PARSEME Training School Prague 7
THE PRAGUE DEPENDENCY TREEBANK 19. 1. 2015 PARSEME Training School Prague 8
The Prague Dependency Treebanks: the Basics • Original Treebank: PDT 1. 0, 2001 (morf. , dep. syntax) • First full release: PDT 2. 0 – http: //ufal. mff. cuni. cz/pdt 2. 0 • LDC 2006 T 01, see http: //www. ldc. upenn. edu – Now: PDT 3. 0: http: //ufal. mff. cuni. cz/pdt 3. 0 • Basic general features – – – Multilayered annotation, interlinked layers Dependency-based syntax (both surface and deep) Information structure of the sentence (topic/focus) Grammatical and basic textual coreference New: discourse relations, MWEs • Languages: Czech, English (also parallel), Arabic – Student work on “samples”: Indonesian, Urdu, Russian, … – Spoken: work started on Czech and English (non-parallel, dialogs) 19. 1. 2015 PARSEME Training School Prague 9
The Prague Dependency Treebank • Three basic layers of annotation – Morphemic layer – Surface syntax (“analytical”) layer – “Tectogrammatical” layer: underlying syntax, semantic roles (valency), inf. structure, coreference • Size – 830, 000 words (tokens) = 50000 sentences in 3165 full documents (texts) • Format – Prague Markup Language (XMLbased) – Now also: . treex format • For smooth uise in the Tree. X platform • http: //ufal. mff. cuni. cz/treex 19. 1. 2015 PARSEME Training School Prague 10
PDT (Czech) Data • 4 sources: – – Lidové noviny (daily newspaper, incl. extra sections) DNES (Mladá fronta Dnes) (daily newspaper) Vesmír (popular science magazine, monthly) Českomoravský Profit (economical journal, weekly) • Full articles selected – article ~ DOCUMENT (basic corpus unit) • Time period: 1990 -1995 • 1. 8 million tokens (~110, 000 sentences total) 19. 1. 2015 PARSEME Training School Prague 11
PDT Annotation Layers • L 0 (w) Words (tokens) PDT 2. 01. 0 (2006) PDT (2001) – automatic segmentation and markup only • L 1 (m) Morphology – Tag (full morphology, 13 categories), lemma • L 2 (a) Analytical layer (surface syntax) – Dependency, analytical dependency function • L 3 (t) Tectogrammatical layer (“deep” syntax) – Dependency, “functor”, grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3. 0: 3. 0 mass, clauses, formemes, discourse, . . . 19. 1. 2015 PARSEME Training School Prague 12
PDT Annotation Layers • L 0 (w) Words (tokens) – automatic segmentation and markup only • L 1 (m) Morphology – Tag (full morphology, 13 categories), lemma • L 2 (a) Analytical layer (surface syntax) – Dependency, analytical dependency function • L 3 (t) Tectogrammatical layer (“deep” syntax) – Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon 19. 1. 2015 PARSEME Training School Prague 13
Morphological Attributes Tag: 13 categories Ex. : nejnezajímavějším “(to) the most uninteresting” Example: AAFP 3 ----3 N---Adjective Regular Feminine Plural Dative no poss. Gender no poss. Number no person no tense superlative Lemma: POS-unique identifier negated no voice reserve 1 reserve 2 base var. Books/verb -> book-1, went -> go, to/prep. -> to-1 19. 1. 2015 PARSEME Training School Prague 14
PDT Annotation Layers • L 0 (w) Words (tokens) – automatic segmentation and markup only • L 1 (m) Morphology – Tag (full morphology, 13 categories), lemma • L 2 (a) Analytical layer (surface syntax) – Dependency, analytical dependency function • L 3 (t) Tectogrammatical layer (“deep” syntax) – Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon 19. 1. 2015 PARSEME Training School Prague 15
Layer 2 (a-layer): Analytical Syntax • Dependency + Analytical Function governor dependent The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. 19. 1. 2015 PARSEME Training School Prague 16
Analytical Syntax: Functions • Main (for [main] semantic lexemes): • Pred, Sb, Obj, Adv, Atr, Atv(V), Aux. V, Pnom • “Double” dependency: Atr. Adv, Atr. Obj, Atr • Special (function words, punctuation, . . . ): • Reflexives, particles: Aux. T, Aux. R, Aux. O, Aux. Z, Aux. Y • Prepositions/Conjunctions: Aux. P, Aux. C • Punctuation, Graphics: Aux. X, Aux. S, Aux. G, Aux. K • Structural • Elipsis: Ex. D, Coordination etc. : Coord, Apos 19. 1. 2015 PARSEME Training School Prague 17
PDT Annotation Layers • L 0 (w) Words (tokens) – automatic segmentation and markup only • L 1 (m) Morphology – Tag (full morphology, 13 categories), lemma • L 2 (a) Analytical layer (surface syntax) – Dependency, analytical dependency function • L 3 (t) Tectogrammatical layer (“deep” syntax) – Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon 19. 1. 2015 PARSEME Training School Prague 18
Tectogrammatical Annotation • Underlying (deep) syntax • 5 sublayers (integrated and/or standoff annotation): – dependency structure, (detailed) functors • valency annotation – topic/focus and deep word order – coreference (mostly grammatical only) – discourse – all the rest (grammatemes): • detailed functors • underlying gender, number, mass nouns, . . . • Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer) 19. 1. 2015 PARSEME Training School Prague 19
Tectogrammatical vs. analytical syntax AR: All words Predicate verb “Location” TR: No function words Re-inserted elided actor of “making” In practice, that procedure will require making of certified copies. 19. 1. 2015 PARSEME Training School Prague 20
Dependency Structure • Similar to the surface (Analytical) layer. . . but: – certain nodes deleted • auxiliaries, non-autosemantic words, punctuation • (some) multiword expressions -> 1 node – some nodes added • based on word (mostly verb, noun) valency • some ellipsis resolution – detailed dependency relation labels (functors) 19. 1. 2015 PARSEME Training School Prague 21
Tectogrammatical Functors “syntactic” semantic • “Actants”: ACT, PAT, EFF, ADDR, ORIG – modify: verbs, nouns, adjectives – cannot repeat in a clause, usually obligatory • Free modifications (~ 50), semantically defined • Special MW ACMP, INTT, MANN; MAT, APP; ID, DPHR, CPHR, . . . Es – can repeat; optional, sometimes obligatory – Ex. : LOC, DIR 1, . . . ; TWHEN, TTILL, . . . ; RSTR; BEN, ATT, – Coordination, Rhematizers, Foreign phrases (#Forn), . . . 19. 1. 2015 PARSEME Training School Prague 22
Deep Word Order Topic/Focus • Example: Analytical dep. tree: • Baker bakes rolls. 19. 1. 2015 PARSEME Training School Prague vs. Baker. IC bakes rolls. 23
Deep Word Order Topic/Focus • Deep word order: – from “old” information to the “new” one (left-toright) at every level (head included) – projectivity by definition (almost. . . ) • i. e. , partial level-based order -> total d. w. o. • Topic/focus/contrastive topic – attribute of every node (t, f, c) – restricted by d. w. o. and other constraints 19. 1. 2015 PARSEME Training School Prague 24
Coreference • Grammatical (easy) – relative clauses • which, who lpromise l. PRED – Peter and Paul, who. . . – control • infinitival constructions – John promised to go home – reflexive pronouns • {him, her, thme}self(-ves) lgo l. PAT l. John l. ACT lhe l. ACT lhome l. DIR 3 – Mary saw herself in. . . 19. 1. 2015 PARSEME Training School Prague 25
Coreference • Textual – Ex. : Peter moved to Iowa after he finished his Ph. D. 19. 1. 2015 PARSEME Training School Prague 26
Grammatemes • Detailed functors (“subfunctors”) – needed for some functors: • TWHEN: before/after • LOC: next-to, behind, in-front-of, . . . • also: ACMP, BEN, CPR, DIR 1, DIR 2, DIR 3, EXT MW Es • Lexical (underlying) – number (Sg/Pl), tense, modality, degree of comparison, mass-noun? ; is_person_name, is_dsp_root, . . . 19. 1. 2015 PARSEME Training School Prague 27
VALENCY IN PDT 19. 1. 2015 PARSEME Training School Prague 30
Prague Dependency Treebank & Valency • Valency in the PDT – Valency lexicon for PDT – General valency lexicon • Valency in deep vs. surface syntax – Links between the layers w. r. t. valency • Valency and word sense – Sense-disambiguated occurrences: • Links from data to the lexicon • Valency in translation, text generation 19. 1. 2015 PARSEME Training School Prague 31
Definition of Valency • Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning • Properties of valency: – Specific for every word meaning (in general) • leave: sb left sth for sb vs. sb left from somewhere • similar to Prop. Bank leave. 02 vs. leave. 01 – Typically strongly correlates with surface form (Czech) • morphological case (~ ending), preposition+case, . . . ) – Semantic constraints 19. 1. 2015 PARSEME Training School Prague 32
Structure of Valency • word (lemma) – word sense group 1 • valency frame: – slot 1 slot 2 slot 3 • surface expression – word sense group 2 • . . . 19. 1. 2015 PARSEME Training School Prague vyměnit (to replace) vyměnit 1 ACT PAT EFF Nom. Acc. za+Acc. vyměnit 2. . . 33
PDT-Vallex Entry • dosáhnout: “to reach”, “to get [sb to do sth]” • browser/user-formatted example: 19. 1. 2015 PARSEME Training School Prague 34
nt) • Types included: – Reflexive particle (se, si) • smát se – to laugh • všimnout si – to notice smá t_ (t_le se mma ) MWEs in PDT-Vallex • dosáhnout svého - to achieve one’s goals • běhá mi mráz po zádech – to give me the shivers 19. 1. 2015 PARSEME Training School Prague CPH • uzavřit dohodu – to agree [on sth], strike an agreement, . . . • vzbuzovat pochybnosti – to doubt, to raise doubts gume n t) – Light verb constructions (and similar) R (ar DPH R (ar gume – Idiomatic constructions 35
Corpus ↔ Valency Lexicon • Corpus: Sentence 2035: l Lexicon: 19. 1. 2015 Sentence 15345: Sentence 51042: ENTRY: uzavřít (to close) vf 1: ACT(. 1) CPHR({smlouva}. 4) ex: u. dohodu (close a contract) vf 2: ACT(. 1) PAT(. 4) ex. : u. pokoj (close a room, house) PARSEME Training School Prague 36
Valency & Text Generation • Using valency for. . . –. . . getting the correct (lemma, tag) of verb arguments • Example: l. VALLEX entry: starat (se) ACT(. 1) PAT(o. [. 4]) starat V. . . starat_se “to take care of” PRED Martin ACT tygr PAT l“tiger” Martin se. . 1. . . Martin 19. 1. 2015 PARSEME Training School Prague o. . . . “Martin tygr. . 4. . takes se stará o tygry. care of tigers. ” 37
PARALLEL TREEBANK CZ-EN 19. 1. 2015 PARSEME Training School Prague 38
Parallel Czech-English Annotation • English text → Czech text (human translation) • Czech side (goal): all layers manual annotation • English side (goal): – Morphology and surface syntax: technical conversion • Penn Treebank style -> PDT Analytic layer – Tectogrammatical annotation: manual annotation • (Slightly) different rules needed for English • Alignment – Natural, sentence level only (now) 19. 1. 2015 PARSEME Training School Prague 39
English Annotation POS and Syntax • Automatic conversion from Penn Treebank – PDT morphological layer • From POS tags – PDT analytic layer • From: – Penn Treebank Syntactic Structure – Non-terminal labels – Function tags (non-terminal “suffixes”) • 2 -step process – Head determination rules – Conversion to dependency + analytic function 19. 1. 2015 PARSEME Training School Prague 41
Czech-English Example Dicku Darmane, zavolejte do své kanceláře! 19. 1. 2015 PARSEME Training School Prague Dick Darman, call your office! 46
SUMMARY OF PART 1/1 19. 1. 2015 PARSEME Training School Prague 47
PDT Treebanks at UFAL (written language) • Czech – Prague Dependency Treebank • Complex annotation, all levels, additional annotation – Translation of Penn Treebank • Tectogrammatical layer only, no t/f – Analytical, morphology: automatic tool • English – Re-annotation of Penn Treebank • Other languages – Arabic (own annotation) – Other: by conversion (Hamle. DT – 30 treebanks) 19. 1. 2015 PARSEME Training School Prague 48
t! a m Prague Dependency for y nc ies e d n nc e e p d Treebanks De epen l a rs l. D • Annotation: – 4 layers: ive ersa n e U /Univ h t o in b. com s l w a /githu o N s: / http • Words, lemmas/tags, surface dep. syntax, tectogrammatics – Tectogrammatical layer: • No function words, semantic relations • Valency/verb arguments (some MWE features) – Separate valency lexicon, fully linked from PDT nodes • Coreference, Topic/focus, Discourse • Links back to analytical layer (parsing!) 19. 1. 2015 PARSEME Training School Prague 49
Pointers • PDT 2. 0 (the “Original”), newest version: PDT 3. 0 – http: //ufal. mff. cuni. cz/pdt 2. 0 – http: //ufal. mff. cuni. cz/pdt 3. 0 • PCEDT – http: //ufal. mff. cuni. cz/pcedt 2. 0/ • PEDT – English side of PCEDT, additional: NE, coreference – http: //ufal. mff. cuni. cz/pedt 2. 0/ • PADT (Arabic, morphology + surface syntax) – http: //ufal. mff. cuni. cz/padt • Other corpora, PDT-Vallex, Eng. Vallex: – Search at http: //lindat. cz • LDC catalog numbers: – LDC 2006 T 01 (PDT 2. 0), LDC 2004 T 23 (PADT 1. 0), LDC 2004 T 25 (PEDT 1. 0) • Co. NLL 2009 shared task (7 languages, surface syntax + predicate arguments only) – http: //ufal. mff. cuni. cz/conll 2009 -st • Hamle. DT 2. 0 (30 treebanks in unified format) – http: //ufal. mff. cuni. cz/hamledt 19. 1. 2015 PARSEME Training School Prague 50
- Slides: 43