Zdenk abokrtsk email zabokrtzcs felk cvut cz Czech
Zdeněk Žabokrtský e-mail: zabokrtz@cs. felk. cvut. cz Czech Technical University, Department of Computer Science the following presentation can be downloaded from http: //obelix. ijs. si/Zdenek. Zabokrtsky/PDT/ 1
The Prague Dependency Treebank (PDT) • long-term project aimed at a complex annotation of a part of the Czech National Corpus with rich annotation scheme • Institute of Formal and Applied Linguistics – established in 1990 at the Faculty of Mathematics and Physics, Charles University, Prague – Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, … – http: //ufal. ms. mff. cuni. cz 2
The Prague Dependency Treebank • inspiration: – the Penn Treebank (the most widely used syntactically annotated corpus of English) • motivation: – the treebank can obviously be used for further linguistic research – more accurate results can be obtained when using annotated corpora than when using texts in their raw form (unsupervised training) 3
Source of the text data • provided by Institute of the Czech National Corpus (ICNC) • text sample for PDT – 456 705 tokens (words and punctuations) in 26610 sentences, divided into 576 files, 50 sentences per file • 40 % - general newspaper articles (Lidové noviny, Mladá Fronta) • 20 % - economic new and analysis (Českomoravský profit) • 20 % - popular science magazine (Vesmír) • 20 % - information technology texts – divided into • a training set (19 126 sentences) • a development test set (3 697) • a cross-evaluation test data set (3 787) 4
Institute of the Czech National Corpus • founded 1994 at the Faculty of Philosophy, Charles University, • head of the institute: prof. František Čermák • 100 million words • freely accessible: http: //ucnk. ff. cuni. cz – querry language CQP (corpus query processor, developed at the university in Stuttgart) – regular expressions – examples of querries: disku[s|z]e. +nést 5
CNC: querry example • querry: . +nosit • response: tačí se trochu vybavit , <nanosit> kupu listí a sena - já ho ie Každý mistr by se měl <honosit> nějakým rekordem či jedin anční tísni by měly dítě <donosit>. Bezvýhradná povinnost p í hladovění bude schopna <donosit> plod. Mimochodem i u sou evítané těhotenství tzv. <donosit> a dítěte se vzdát ve pros mž sedíme , nepostavil. <Vynosit> tuny kamení na zádech , t byl v nebezpečí a naděje <donosit> dítě žádná. Jeden večer 6 - Živit mateř. mlékem <Nanosit> 57 - Ukončit létání 58 - odstatně větší a může se <honosit> řadou úctyhodných přívlas vy , v pokoji nekouřit , <nenosit> domů alkohol. Dodržovat ve městě , které se mělo <honosit> jen svým " dělnickým hnut . . . 6
Layered structure of PDT raw text • morphological level morphologically tagged text analytic tree structures (ATS) tectogrammatical tree structures (TGTS) – full morphological tagging (word forms, lemmas, mor. tags) • analytical level – surface syntax – syntactic annotation using depencency syntax (captures analytical functions such as Subject, Object, . . . ) • tectogrammatical level – level of linguistic meaning (tectogrammatical functions such as Actor, Patient, . . . ) 7
The Morphological Level • a tag and a lemma are assigned to each word form from the input text • 3030 tags (Czech is an inflectionally rich language) • 6 tag variables – number - degrees of comparison – case - person – gender - negation • example: – VPS 3 A - verb (indicative, present tense, sing. , 3 rd person, affirmative) 8
Morphological Analysis • an automatic process: – input: word form – output: a set of possible lemmas, each lemma accompanied by a set of possible tags • currently covers 720000 Czech lemmas, based on 210000 stems • can recognize 20 million word forms • output ambiguity: – there may be 5 different lemmas for a given word form – 27 different tags for a given lemma – example: učení - NNS 1 A, NNS 2 A, NNS 3 A, . . . , NNP 5 A 9
The whole process of morphological tagging raw text • automatic morphological analysis • manual disambiguation – 2 annotators – in the full text context – special software tool • automatic comparison • manual correction unambiguously tagged text 10
Data Format • Standard Generalized Markup Language (SGML) • a sample of DTD (Document Type Definition) related to the morphological level: <!ELEMENT MMl - O (#PCDATA & R? & E? & e? & T* & MMt*) -- lemma (base form), description see the l tag; machine assigned (by a morphological analysis program), NOT disambiguated --> <!ELEMENT MDl - O (#PCDATA & R? & E? & e? & T* & MDt*) -- lemma (base form), description see the l tag; machine assigned (by a tagger), disambiguated if more than 1: n-best --> . . . <!ELEMENT MMt - O (#PCDATA) -- morphological tag(s) as assigned by morphology, NOT disambiguated --> <!ELEMENT MDt - O (#PCDATA) -- morphological tag(s) as assigned by machine, disambiguated, possibly also with weight/prob; if more than 1: n-best --> 11
Example of tagged sentence • Ty mají pak někdy takovou publicitu, že to dotyčnou kancelář zlikviduje. <s id=cmpr 9415: 025 -p 19 s 2/bcc 14 zua. fs/#18> <f cap>Ty<MMl>ty<MMt>PP 2 S 1<MMt>PP 2 S 5<MMl>ten<MMt>. . . PDFP 1<MMt>PDFP 4<MMt>PDIP 1<MMt>PDIP 4<MMt>PDMP 4<A>Sb<r>1<g>2 <f>mají<MMl>mít<MMt>VPP 3 A<A>Pred<r>2<g>0 <f>pak<MMl>pak<MMt>DB<A>Adv<r>3<g>2 <f>někdy<MMl>někdy<MMt>DB<A>Adv<r>4<g>2 <f>takovou<MMl>takový<MMt>AFS 41 A<MMt>AFS 71 A<A>Atr<r>5<g>6 <f>publicitu<MMl>publicita<MMt>NFS 4 A<A>Obj<r>6<g>2 <D> <d>, <MMl>, <MMt>ZIP<A>Aux. X<r>7<g>8 <f>že<MMl>že<MMt>JS<A>Aux. C<r>8<g>6 <f>to<MMl>ten<MMt>PDNS 1<MMt>PDNS 4<A>Sb<r>9<g>13 <f>dotyčnou<MMl>dotyčný<MMt>AFS 41 A<MMt>AFS 71 A<A>Atr<r>10<g>11 <f>kancelář<MMl>kancelář<MMt>NFS 1 A<MMt>NFS 4 A<A>Obj<r>11<g>13 <f>prakticky<MMl>prakticky_^(*1ý)<MMt>DG 1 A<A>Adv<r>12<g>13 <f>zlikviduje<MMl>zlikvidovat_: W<MMt>VPS 3 A<A>Obj<r>13<g>8 <D> <d>. <MMl>. <MMt>ZIP<A>Aux. K<r>14<g>0 12
The Analytical Level • the dependency structure was chosen to represent the syntactic relations within the sentence. • output of the analytical level: analytical tree structure (ATS) – oriented, acyclic graph with one entry node – every word form and punctuation mark is represented as a node – the nodes are annotated by attribute-value pairs • new attribute: analytical function – determines the relation between the dependent node and its governing nodes – values: Sb, Obj, Adv, Atr, . . 13
Example of ATS • V návrzích na případné změny vycházejí ze svých většinou několikaletých podnikatelských zkušeností. 14
Selected attributes of ATS’s nodes 15
Selected values of the analytical function 16
Example of tagged sentence • . . . ve sledovaném období žádný okres nezlepšil svoji pozici. . . <f>ve<MMl>v<MMt>RV 4<MMt>RV 6<A>Aux. P<r>4<g>9 <f>sledovaném<MMl>sledovaný_^(*2 t)<MMt>AIS 61 A<MMt>AMS 61 A<MMt> ANS 61 A<A>Atr<r>5<g>6 <f>období<MMl>období<MMt>NNP 1 A<MMt>NNP 2 A<MMt>NNP 4 A<MMt>N NP 5 A<MMt>NNS 1 A<MMt>NNS 2 A<MMt>NNS 3 A<MMt>NNS 4 A<MMt>N NS 5 A<MMt>NNS 6 A<A>Adv<r>6<g>4 <f>žádný<MMl>ľádný<MMt>PNFIS 4<MMt>PNFYS 1<MMt>PNFYS 5<A>Atr<r >7<g>8 <f>okres<MMl>okres<MMt>NIS 1 A<MMt>NIS 4 A<A>Sb<r>8<g>9 <f>nezlepšil<MMl>zlepąit_: W<MMt>VRYSN<A>Pred_Co<r>9<g>11 <f>pozici<MMl>pozice<MMt>NFS 3 A<MMt>NFS 4 A<MMt>NFS 6 A<A>Obj<r>1 0<g>9 17
The Tectogrammatical Level • based on the framework of the Functional Generative Description as developed by Petr Sgall • in comparison to the ATSs, the tectogrammatical tree structures (TGTSs) have the following characteristics: – only autosemantic words have an own node, function words (conjunctions, prepositions) are attached as indices to the autosemantic words to which they belong – nodes are added in case of clearly specified deletions on the surface level – analytical functions are substituted by tectogrammatical functions (functors), such as Actor, Patient, Addressee, . . . 18
Example of TGTS • Podle předběžných odhadů se totiž počítá, že do soukromého vlastnictví bude prodáno minimálne 10000 bytů 19
Selected attributes of a TGTS‘s node 20
Functors • tectogrammatical counterparts of analytical functions • about 40 functors in 2 groups: – actants • Actor, Patient, Adressee, Origin, Effect – free modifiers • LOC, DIR 1, RSTR, TWHEN, TTIL, . . . • provide more detailed information about the relation to the governing node than the analytical function 21
Example of ATS. . . • Kdo chce investovat dvě stě tisíc korun do nového automobilu, nelekne se, že benzín byl změnou zákona trochu zdražen. 22
. . . and the corresponding TGTS 23
Tectogrammatical tagging • 2 parallel streams ATS treebank smaller set of fully tagged TGTSs larger set of partially tagged TGTSs (only changes of tree structure, functor and TFA assignment) 24
Problems of automatic functor assignment • za roh - DIR 3 • za hodinu - TWHEN • za svobodu - OBJ • po otci – TWHEN (Přišel po otci. ) – NORM (Jmenuje se po otci. ) – HER (Zdědil dům po otci. ) –. . . 25
Summary Czech National Corpus morphologically tagged corpus November, 1996 ATS treebank TGTS treebank March, 2000 September, 1994 • the current state of art: – there are several manually annotated files of TGTSs – methods for automatic transformation from ATS into TGTS form are in development 26
- Slides: 26