Using Corpora for Language Research COGS 523 Lecture
- Slides: 36
Using Corpora for Language Research COGS 523 -Lecture 5 METU Turkish Corpus and METU-Turkish Sabancı Treebank- A Developer’s Perspective 12. 9. 2021 COGS 523 - Bilge Say 1
Related Readings n n n Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development of a Corpus and a Treebank for Present-day Written Turkish, in Proceedings of the Eleventh International Conference of Turkish Linguistics, August 2002. Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür, Building a Turkish Treebank, Invited chapter in Building and Exploiting Syntactically-annotated Corpora, Anne Abeille Editor, Kluwer Academic Publishers, 2003. Nart B. Atalay, Kemal Oflazer, Bilge Say, The Annotation Process in the Turkish Treebank, in Proceedings of the EACL Workshop on Linguistically Interpreted Corpora - LINC, April 13 -14, 2003, Budapest, Hungary. 12. 9. 2021 COGS 523 - Bilge Say 2
Acknowledgements n n Funding: METU-BAP, TÜBİTAK METU-Sabancı Treebank: Joint work with Prof. Kemal Oflazer Main Contributors: Umut Özge and Nart Bedin Atalay, METU; around 5 research assistants and 13 student annotators and trainees at various phases of the project. Various members of faculty gave ideas esp at initial stages. Agreements with 14 publishers (incl. 3 newspapers and 4 magazines) 12. 9. 2021 COGS 523 - Bilge Say 3
Requirements for Corpora for Turkish ? n n n n Incorporating many registers representatively Diachronic and synchronic Electronic Annotated with standard practices (typographically, morphosyntactically, semantically, prosodically. . . ) Respecting copyright laws Accessible (free availabilty, support, etc) Searchable 12. 9. 2021 COGS 523 - Bilge Say 4
What is METU Turkish Corpus? n n A synchronic (1990+) corpus of written Turkish 2. 000 words from 201 books, 87 journal issues and issues of 3 daily newspapers totaling 999 samples Various kinds of annotation (creation of a treebank as separate subproject) Project: 1999 -2003 12. 9. 2021 COGS 523 - Bilge Say 5
Other Features of METU Turkish Corpus n n n Permissions for each sample obtained from the publishers Opportunistic representativeness !! Platform-independent; XML and TEIcompliant annotation Accompanying query software Free for academic research purposes on signature of a user agreement http: //www. ii. metu. edu. tr/~corpus/ 12. 9. 2021 COGS 523 - Bilge Say 6
Building the Corpus Text Compilation (permissions, scanning if necessary, control) n Computer-aided annotation (TEI-XCES for general-typographic; XML-compliant in-house scheme for the treebank) n Control n Query Workbench Development n 12. 9. 2021 COGS 523 - Bilge Say 7
Distribution of Text Types 12. 9. 2021 COGS 523 - Bilge Say 8
Annotation of the Corpus Text Encoding Initiative (TEI) compliant n XCES – XML based Corpus Encoding Standards compliant- a TEI application n Compliant with major current corpora such as British National Corpus n 12. 9. 2021 COGS 523 - Bilge Say 9
The TEI Structure - 1 tei. Corpus tei. Header front 12. 9. 2021 TEI. 2 text body COGS 523 - Bilge Say back (Burnard, 2001) 10
The TEI Structure - 2 front body divisions back e. g. <div 1> components e. g. <p>, <list>… phrase-level 12. 9. 2021 COGS 523 - Bilge Say e. g. <w>, <corr>… (Burnard, 2001) 11
A Typical Header <ces. Header> <file. Desc> <title. Stmt> <h. title>00017113</h. title> </title. Stmt> <extent> <word. Count>2008</word. Count> <byte. Count>17929</byte. Count> </extent>. . . 12. 9. 2021 COGS 523 - Bilge Say 12
A Typical Header (cont. ) <source. Desc> <bibl. Struct> <analytic> <h. title>Anadolu Dağlarının 'Bitki Avcısı': Prof. Dr. Turhan BAYTOP</h. title> <h. author>Nalân MAHSERECİ</h. author> </analytic> <imprint> <publisher>Bilim ve Ütopya</publisher> <pub. Date>Mart 2000</pub. Date> <pub. Place>İstanbul</pub. Place> </imprint> <idno>1301 - 6717</idno> </bibl. Struct> </source. Desc> 12. 9. 2021 COGS 523 - Bilge Say 13
A Typical Header (cont. ) <profile. Desc> <text. Class> <cat. Ref>Makale</cat. Ref> </text. Class> </profile. Desc> <revision. Desc> <change. Date>12. 10. 2000</change. Date> <respname>Sedef</respname> <h. item>The header part was changed. </h. item> </change> </revision. Desc> 12. 9. 2021 COGS 523 - Bilge Say 14
A Typical Body <text> <body> <p>Oktay biraz önce, <q>Hadi biz de Sitem'in yanına gidelim, </q> demişti. Sitem'in, kucağında Tomurcuk Beyle Yılanlı İncirlerden yana gittiğini o da görmüştü çünkü. Ben omuz silkmekle yetindim, Oktay da üstelemedi. Sitem ikimizin yüzüne karşı da görünmez kapılar kapamıştı. Benim de elinden kayıp gidivermemden korkan Oktay beni <hi>oyalamak</hi> için geçen yaz Giray Ağabeysiyle Kirazlı Yaylaya yaptıkları bir gezintiyi anlatmaya başladı. </p> <p>O gün ve sonrasında olanları elbet sana da anlatmışlardır, Dalya. Gene de o kargaşa, o şaşkınlık, o panik, o kafa karmaşası yaşanmadan bilinemez. . . </p> </body> </text> 12. 9. 2021 COGS 523 - Bilge Say 15
Entering XCES Annotations -1 12. 9. 2021 COGS 523 - Bilge Say 16
Entering XCES Annotations -2 12. 9. 2021 COGS 523 - Bilge Say 17
METU-Sabancı treebank project n n n Annotation of morphological and (surface) syntactic features in a dependencyinspired manner A subcorpus containing 7. 300 annotated sentences and 65. 000 words: initially whole samples selected from the main corpus. (Another version containing 5600 sentences) Genre distribution is proportional with the METU Corpus 12. 9. 2021 COGS 523 - Bilge Say 18
Building the Treebank Morphological Analysis of Selected Samples from the Corpus n Preprocessing of the Collocations n (Manual) Disambiguation of the Morphological Parses n Annotating with the Dependency Structure n Control n 12. 9. 2021 COGS 523 - Bilge Say 19
Annotation – Lexical Level n A word can be seen as a sequence of inflectional groups (IGs) of the form Lemma+Infl 1^DB+Infl 2^DB+…^DB+Infln n evinizdekilerden (from the ones at your house) ev+Noun+A 3 sg+P 2 pl+Loc^DB+Adj^DB+Noun+A 3 pl+Pnon+Abl Inflectional Group 12. 9. 2021 COGS 523 - Bilge Say 20
Annotation- Syntactic Level Bu çocuk okuldan erken geldi. This child school+Abl early come+Past+3 sg This child came from the school early. Determiner Bu çocuk Subject Modifier okuldan erken geldi. Abl. adj 12. 9. 2021 COGS 523 - Bilge Say 21
Annotation- Syntactic Level n n n n Sentence Object Subject Intensifier Modifier Determiner Question-Particle Total of 20 syntactic tags 12. 9. 2021 n n n n Relativizer Coordination Possessor Classifier Ablative Adjunct Dative Adjunct Locative Adjunct Instrumental Adjunct. . . COGS 523 - Bilge Say 22
Morphosyntactic processing n Tokenized text is annotated (ambiguously) by all possible morphological analyses for each token. n Involves also unknown word processing A constraint-based disambiguation module performs limited morphological disambiguation. n Recognizing and morphological annotation of collocations 12. 9. 2021 COGS 523 - Bilge Say n 23
Automatic Dependency Annotation Try to get most of the “easy” relations right automatically to help and speed up the human annotator n Human annotator can override if the selected dependency relation is not right. n Pilot work is done but not practised in the METU-Sabancı treebank n 12. 9. 2021 COGS 523 - Bilge Say 24
Automatic Dependency Annotation n A set of heuristic rules tentatively attach some of the relations automatically n n n Appropriately case-marked nouns to the immediately following unambiguous postposition as objects Indefinite nominative nouns to the first verb to the right as objects Adverbs and Adjuncts attach to the first verb to the right as modifiers and adjunct 12. 9. 2021 COGS 523 - Bilge Say 25
The Annotation Tool n n The text thus processed can now be further annotated with an annotation tool n Visualization n Review selections (morph/dependency) and override (for morphology) or annotate (for dependency) The output of the program is morphologically disambiguated annotated text which is encoded according to XML document and Turkish Treebank formats. 12. 9. 2021 COGS 523 - Bilge Say 26
Annotating the Treebank - 1 12. 9. 2021 COGS 523 - Bilge Say 27
Annotating the Treebank – 2 12. 9. 2021 COGS 523 - Bilge Say 28
Corpus Query Workbench n n n n A user-friendly query engine for linguists Organization through sessions Boolean or regular expression queries Filtering queries through bibliographic constraints such as author, genre, year Treebank entries viewed through a graphical interface Printing and saving options of outputs and session queries available Implemented in Java SE 1. 4. 1, compatible with Window XP/Linux 12. 9. 2021 COGS 523 - Bilge Say 29
12. 9. 2021 COGS 523 - Bilge Say 30
12. 9. 2021 COGS 523 - Bilge Say 31
Post-project developments n n About 100 user forms received Some uses (from a recent survey) n n n Word sense disambiguation Coherence in Turkish texts Subcategorization Frame Acquisition Teaching Turkish or NLP Co. NLL Dependency task for METUSabancı Treebank (~5000 sentences) Frequency lists available (due to Umut Özge and Serge Sharoff) 12. 9. 2021 COGS 523 - Bilge Say 32
What would we have done differently? n n More funding, more interdisciplinary organization, less turnover. . . Approaching a corpus development project like a software engineering project. . . n n n Doing a pilot project Better quality control processes, version control and documentation control processes. More and better automatic text capture and annotation 12. 9. 2021 COGS 523 - Bilge Say 33
Requests from Users n n n Extend the size and variety of the corpus POS tag the whole corpus Enable the users to enter their own corpora to query tool Implement statistical features to the query tools Add semantic annotation Treebank specific ones: n n 10, 000; 7, 000 or 5, 000 sentences? Detailed stylebook LEM and MORPH fields Better versioning, some nonconformant entries with XML 12. 9. 2021 COGS 523 - Bilge Say 34
Requirements for future generations of Turkish corpora n Turkish National Corpus (like ANC, BNC, or CNC) n n n n Spoken Part Automatic Tools Diachronic Part Linguistically motivated morphological and syntactic annotation Some motivation for text providers Well-funded, well-organized project Comparable corpora of Turkic languages 12. 9. 2021 COGS 523 - Bilge Say 35
Lecture 6 n n Bernardini et al. A Wacky Introduction. April 14, your tool evaluation presentations and reports – only two weeks left! 12. 9. 2021 COGS 523 - Bilge Say 36
- Income statement inventory
- We are well able
- Orf 523
- Ba 523
- Dpv08
- Cs 523
- Hymn 523
- 523
- Comp 523 unc
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Hind brain
- Arachnoid mater sheep brain
- Olfactory nerves
- Vision cheval couleur
- Typology of special corpus
- What is a corpus
- What is the brainstem
- Corpora quadrigemina pronunciation
- Nodular hyperplasia
- Optička hijazma
- What is corpus
- Corpora
- Old opie occasionally tries
- Notes on research methods
- Slidetodoc download
- To gain familiarity with a phenomenon
- Kontinuitetshantering
- Typiska drag för en novell
- Nationell inriktning för artificiell intelligens
- Vad står k.r.å.k.a.n för
- Varför kallas perioden 1918-1939 för mellankrigstiden?
- En lathund för arbete med kontinuitetshantering
- Underlag för särskild löneskatt på pensionskostnader
- Vilotidsbok
- A gastrica
- Förklara densitet för barn
- Datorkunskap för nybörjare