Syntactically annotated corpora of Estonian Heli Uibo Institute
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu Heli. Uibo@ut. ee
Outline • Who? • Why? • Three initiatives: – CG-corpus – Sofie Parallel Treebank – Arborest • What next?
Who are we? Kaili Müürisep, Ph. D Tiina Puolakainen, Ph. D Mare Koit, Ph. D Tiit Roosmaa, Ph. D Kadri Muischnek, M. A. Heli Uibo, M. Sc. Andriela Rääbis, M. A. Heili Orav, M. A. Kaarel Kaljurand, M. Sc. + students of computational linguistics (experienced in shallow syntactic annotating of texts)
Why do we need syntactically annotated corpora? • To evaluate language technological software (tools for information retrieval and extraction, automatic summarization, machine translation) • To build a new up-to-date description of Estonian syntax, taking into account real language usage
Three syntactically annotated corpora for Estonian 1. Constraint Grammar (CG) Corpus · size – 200 000 running words ≈ ca 15 000 sentences · 184 000 words of Estonian original fiction · 10 000 words of newspaper texts · 6 000 words of legal texts · shallow annotation, using Constraint Grammar: a syntactic function is determined for every word-form
Three syntactically annotated corpora for Estonian (2) Two small-scale experimental treebanks: 2. Sofie Parallel Treebank – a Penn-style phrase structure treebank of 50 sentences 3. Arborest – a VISL-style hybrid treebank of 2500 sentences (first 149 sentences manually revised)
Constraint Grammar Corpus · Has been built to train and test the Constraint Grammar shallow syntactic parser ESTCG · Currently the precision of ESTCG is 76, 479, 2 % and recall is 95, 5 -96, 9 %.
ESTCG: Syntactic tags @SUBJ – subject @OBJ – object @PRD – predicative @ADVL – adverbial @+FMV, @-FMV, @+FCV, @-FCV – parts of the predicate @AN> @<AN – adjective as attribute @NN> @<NN – noun as attribute, apposition @AD> @<AD – adverb as attribute @Q> @<Q – complements of quantor @P> @<P – complements of adposition. . .
CG-corpus: example <s> Mitmekesisus mitme_kesi=sus+0 //_S_ com sg nom #cap // **CLB @SUBJ on ole+0 //_V_ main indic pres ps 3 sg ps af #Fin. V #Intr // @+FMV elu+0 //_S_ com sg gen // @NN> vaieldamatu+0 //_A_ pos sg nom // @AN> omapära oma_pära+0 //_S_ com sg nom // @PRD $. . //_Z_ Fst // </s>
CG-corpus: the process of extending the corpus 1) Input: morphologically hand-annotated text 2) Automatic syntactic analysis (ESTCG parser) 3) Hand-correcting – two linguists in parallel (annotating manual + GUI-based annotation tool) 4) Automatic comparison 5) Discussion of problematic cases 6) Creation of final version
Sofie Parallel Treebank • Sofie Parallel Treebank is being developed inside Nordic Treebank Network, funded by Nor. FA language technology program and joining 15 academic institutions from Sweden, Norway, Denmark, Finland, Estonia and Iceland. • Material – the 1 st chapter of Jostein Gaarder's novel "Sophie's World". • Currently, the parallel treebank includes Swedish, German, Norwegian, Estonian and two versions of Danish, 50 -100 sentences from each language.
Sofie Parallel Treebank (cont-d) • The syntactic structure represented in the trees of different languages is not uniform: – Danish: Discontinuous Grammar dependency treebank and VISL-style phrase structure treebank – Swedish: dependency treebank – German: NEGRA-style treebank – Norwegian: phrase structure treebank – Estonian: Penn-style phrase structure treebank. • The representation format of trees is TIGER XML.
Estonian part of Sofie treebank: how we did it? • Trees drawn on paper by K. Muischnek and H. Nigol. • “Electronic” trees drawn with ANNOTATE tool, using Penn treebank tagset by H. Uibo and K. Kaljurand • Database of trees exported from ANNOTATE in NEGRA format • Tiger. Registry and Tiger. Search used to convert into TIGER XML • Website of Sofie Parallel Treebank: http: //omilia. uio. no/sofie
Sample trees from Sofie treebank Her begynte den dype skogen.
Straks Sofie hadde lukket porten bak seg, åpnet hun konvolutten.
Sofie Parallel Treebank – example from web-interface Sophie's father was the captain of a big oil tanker, and was away for most of the year.
Arborest • Joint work with dr. Eckhard Bick, University of Southern Denmark • VISL-style experimental treebank • Annotated for both function (S = subject, P = predicate, O = object, A = adverbial, STA = statement, QUE = question, etc. ) and form (np, vp, pp, advp, adjp, fcl = finite clause, par = paratagma, etc. )
Arborest (cont-d) • Automatically generated from a sample of CGcorpus (2500 sentences) with CG→PSG rules • 149 sentences revised • 1/3 of sentences correct • CG→PSG rules are under improvement Webpage http: //corp. hum. sdu. dk/arborest. html
Arborest – sample tree
What next? • To enlarge all three syntactically annotated corpora. • To improve the CG-to-PSG rules to facilitate the easy semi-automatic way of building an Estonian treebank. • To create another, syntactic-semantic dependency treebank for Estonian, which will be semiautomatically generated from one of the existing experimental phrase structure treebanks. → How many semantic information can be derived from the syntactic dependency structure?
- Slides: 20