Using Corpora for Language Research COGS 523 Lecture

  • Slides: 36
Download presentation
Using Corpora for Language Research COGS 523 -Lecture 5 METU Turkish Corpus and METU-Turkish

Using Corpora for Language Research COGS 523 -Lecture 5 METU Turkish Corpus and METU-Turkish Sabancı Treebank- A Developer’s Perspective 12. 9. 2021 COGS 523 - Bilge Say 1

Related Readings n n n Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development

Related Readings n n n Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development of a Corpus and a Treebank for Present-day Written Turkish, in Proceedings of the Eleventh International Conference of Turkish Linguistics, August 2002. Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür, Building a Turkish Treebank, Invited chapter in Building and Exploiting Syntactically-annotated Corpora, Anne Abeille Editor, Kluwer Academic Publishers, 2003. Nart B. Atalay, Kemal Oflazer, Bilge Say, The Annotation Process in the Turkish Treebank, in Proceedings of the EACL Workshop on Linguistically Interpreted Corpora - LINC, April 13 -14, 2003, Budapest, Hungary. 12. 9. 2021 COGS 523 - Bilge Say 2

Acknowledgements n n Funding: METU-BAP, TÜBİTAK METU-Sabancı Treebank: Joint work with Prof. Kemal Oflazer

Acknowledgements n n Funding: METU-BAP, TÜBİTAK METU-Sabancı Treebank: Joint work with Prof. Kemal Oflazer Main Contributors: Umut Özge and Nart Bedin Atalay, METU; around 5 research assistants and 13 student annotators and trainees at various phases of the project. Various members of faculty gave ideas esp at initial stages. Agreements with 14 publishers (incl. 3 newspapers and 4 magazines) 12. 9. 2021 COGS 523 - Bilge Say 3

Requirements for Corpora for Turkish ? n n n n Incorporating many registers representatively

Requirements for Corpora for Turkish ? n n n n Incorporating many registers representatively Diachronic and synchronic Electronic Annotated with standard practices (typographically, morphosyntactically, semantically, prosodically. . . ) Respecting copyright laws Accessible (free availabilty, support, etc) Searchable 12. 9. 2021 COGS 523 - Bilge Say 4

What is METU Turkish Corpus? n n A synchronic (1990+) corpus of written Turkish

What is METU Turkish Corpus? n n A synchronic (1990+) corpus of written Turkish 2. 000 words from 201 books, 87 journal issues and issues of 3 daily newspapers totaling 999 samples Various kinds of annotation (creation of a treebank as separate subproject) Project: 1999 -2003 12. 9. 2021 COGS 523 - Bilge Say 5

Other Features of METU Turkish Corpus n n n Permissions for each sample obtained

Other Features of METU Turkish Corpus n n n Permissions for each sample obtained from the publishers Opportunistic representativeness !! Platform-independent; XML and TEIcompliant annotation Accompanying query software Free for academic research purposes on signature of a user agreement http: //www. ii. metu. edu. tr/~corpus/ 12. 9. 2021 COGS 523 - Bilge Say 6

Building the Corpus Text Compilation (permissions, scanning if necessary, control) n Computer-aided annotation (TEI-XCES

Building the Corpus Text Compilation (permissions, scanning if necessary, control) n Computer-aided annotation (TEI-XCES for general-typographic; XML-compliant in-house scheme for the treebank) n Control n Query Workbench Development n 12. 9. 2021 COGS 523 - Bilge Say 7

Distribution of Text Types 12. 9. 2021 COGS 523 - Bilge Say 8

Distribution of Text Types 12. 9. 2021 COGS 523 - Bilge Say 8

Annotation of the Corpus Text Encoding Initiative (TEI) compliant n XCES – XML based

Annotation of the Corpus Text Encoding Initiative (TEI) compliant n XCES – XML based Corpus Encoding Standards compliant- a TEI application n Compliant with major current corpora such as British National Corpus n 12. 9. 2021 COGS 523 - Bilge Say 9

The TEI Structure - 1 tei. Corpus tei. Header front 12. 9. 2021 TEI.

The TEI Structure - 1 tei. Corpus tei. Header front 12. 9. 2021 TEI. 2 text body COGS 523 - Bilge Say back (Burnard, 2001) 10

The TEI Structure - 2 front body divisions back e. g. <div 1> components

The TEI Structure - 2 front body divisions back e. g. <div 1> components e. g. <p>, <list>… phrase-level 12. 9. 2021 COGS 523 - Bilge Say e. g. <w>, <corr>… (Burnard, 2001) 11

A Typical Header <ces. Header> <file. Desc> <title. Stmt> <h. title>00017113</h. title> </title. Stmt>

A Typical Header <ces. Header> <file. Desc> <title. Stmt> <h. title>00017113</h. title> </title. Stmt> <extent> <word. Count>2008</word. Count> <byte. Count>17929</byte. Count> </extent>. . . 12. 9. 2021 COGS 523 - Bilge Say 12

A Typical Header (cont. ) <source. Desc> <bibl. Struct> <analytic> <h. title>Anadolu Dağlarının 'Bitki

A Typical Header (cont. ) <source. Desc> <bibl. Struct> <analytic> <h. title>Anadolu Dağlarının 'Bitki Avcısı': Prof. Dr. Turhan BAYTOP</h. title> <h. author>Nalân MAHSERECİ</h. author> </analytic> <imprint> <publisher>Bilim ve Ütopya</publisher> <pub. Date>Mart 2000</pub. Date> <pub. Place>İstanbul</pub. Place> </imprint> <idno>1301 - 6717</idno> </bibl. Struct> </source. Desc> 12. 9. 2021 COGS 523 - Bilge Say 13

A Typical Header (cont. ) <profile. Desc> <text. Class> <cat. Ref>Makale</cat. Ref> </text. Class>

A Typical Header (cont. ) <profile. Desc> <text. Class> <cat. Ref>Makale</cat. Ref> </text. Class> </profile. Desc> <revision. Desc> <change. Date>12. 10. 2000</change. Date> <respname>Sedef</respname> <h. item>The header part was changed. </h. item> </change> </revision. Desc> 12. 9. 2021 COGS 523 - Bilge Say 14

A Typical Body <text> <body> <p>Oktay biraz önce, <q>Hadi biz de Sitem'in yanına gidelim,

A Typical Body <text> <body> <p>Oktay biraz önce, <q>Hadi biz de Sitem'in yanına gidelim, </q> demişti. Sitem'in, kucağında Tomurcuk Beyle Yılanlı İncirlerden yana gittiğini o da görmüştü çünkü. Ben omuz silkmekle yetindim, Oktay da üstelemedi. Sitem ikimizin yüzüne karşı da görünmez kapılar kapamıştı. Benim de elinden kayıp gidivermemden korkan Oktay beni <hi>oyalamak</hi> için geçen yaz Giray Ağabeysiyle Kirazlı Yaylaya yaptıkları bir gezintiyi anlatmaya başladı. </p> <p>O gün ve sonrasında olanları elbet sana da anlatmışlardır, Dalya. Gene de o kargaşa, o şaşkınlık, o panik, o kafa karmaşası yaşanmadan bilinemez. . . </p> </body> </text> 12. 9. 2021 COGS 523 - Bilge Say 15

Entering XCES Annotations -1 12. 9. 2021 COGS 523 - Bilge Say 16

Entering XCES Annotations -1 12. 9. 2021 COGS 523 - Bilge Say 16

Entering XCES Annotations -2 12. 9. 2021 COGS 523 - Bilge Say 17

Entering XCES Annotations -2 12. 9. 2021 COGS 523 - Bilge Say 17

METU-Sabancı treebank project n n n Annotation of morphological and (surface) syntactic features in

METU-Sabancı treebank project n n n Annotation of morphological and (surface) syntactic features in a dependencyinspired manner A subcorpus containing 7. 300 annotated sentences and 65. 000 words: initially whole samples selected from the main corpus. (Another version containing 5600 sentences) Genre distribution is proportional with the METU Corpus 12. 9. 2021 COGS 523 - Bilge Say 18

Building the Treebank Morphological Analysis of Selected Samples from the Corpus n Preprocessing of

Building the Treebank Morphological Analysis of Selected Samples from the Corpus n Preprocessing of the Collocations n (Manual) Disambiguation of the Morphological Parses n Annotating with the Dependency Structure n Control n 12. 9. 2021 COGS 523 - Bilge Say 19

Annotation – Lexical Level n A word can be seen as a sequence of

Annotation – Lexical Level n A word can be seen as a sequence of inflectional groups (IGs) of the form Lemma+Infl 1^DB+Infl 2^DB+…^DB+Infln n evinizdekilerden (from the ones at your house) ev+Noun+A 3 sg+P 2 pl+Loc^DB+Adj^DB+Noun+A 3 pl+Pnon+Abl Inflectional Group 12. 9. 2021 COGS 523 - Bilge Say 20

Annotation- Syntactic Level Bu çocuk okuldan erken geldi. This child school+Abl early come+Past+3 sg

Annotation- Syntactic Level Bu çocuk okuldan erken geldi. This child school+Abl early come+Past+3 sg This child came from the school early. Determiner Bu çocuk Subject Modifier okuldan erken geldi. Abl. adj 12. 9. 2021 COGS 523 - Bilge Say 21

Annotation- Syntactic Level n n n n Sentence Object Subject Intensifier Modifier Determiner Question-Particle

Annotation- Syntactic Level n n n n Sentence Object Subject Intensifier Modifier Determiner Question-Particle Total of 20 syntactic tags 12. 9. 2021 n n n n Relativizer Coordination Possessor Classifier Ablative Adjunct Dative Adjunct Locative Adjunct Instrumental Adjunct. . . COGS 523 - Bilge Say 22

Morphosyntactic processing n Tokenized text is annotated (ambiguously) by all possible morphological analyses for

Morphosyntactic processing n Tokenized text is annotated (ambiguously) by all possible morphological analyses for each token. n Involves also unknown word processing A constraint-based disambiguation module performs limited morphological disambiguation. n Recognizing and morphological annotation of collocations 12. 9. 2021 COGS 523 - Bilge Say n 23

Automatic Dependency Annotation Try to get most of the “easy” relations right automatically to

Automatic Dependency Annotation Try to get most of the “easy” relations right automatically to help and speed up the human annotator n Human annotator can override if the selected dependency relation is not right. n Pilot work is done but not practised in the METU-Sabancı treebank n 12. 9. 2021 COGS 523 - Bilge Say 24

Automatic Dependency Annotation n A set of heuristic rules tentatively attach some of the

Automatic Dependency Annotation n A set of heuristic rules tentatively attach some of the relations automatically n n n Appropriately case-marked nouns to the immediately following unambiguous postposition as objects Indefinite nominative nouns to the first verb to the right as objects Adverbs and Adjuncts attach to the first verb to the right as modifiers and adjunct 12. 9. 2021 COGS 523 - Bilge Say 25

The Annotation Tool n n The text thus processed can now be further annotated

The Annotation Tool n n The text thus processed can now be further annotated with an annotation tool n Visualization n Review selections (morph/dependency) and override (for morphology) or annotate (for dependency) The output of the program is morphologically disambiguated annotated text which is encoded according to XML document and Turkish Treebank formats. 12. 9. 2021 COGS 523 - Bilge Say 26

Annotating the Treebank - 1 12. 9. 2021 COGS 523 - Bilge Say 27

Annotating the Treebank - 1 12. 9. 2021 COGS 523 - Bilge Say 27

Annotating the Treebank – 2 12. 9. 2021 COGS 523 - Bilge Say 28

Annotating the Treebank – 2 12. 9. 2021 COGS 523 - Bilge Say 28

Corpus Query Workbench n n n n A user-friendly query engine for linguists Organization

Corpus Query Workbench n n n n A user-friendly query engine for linguists Organization through sessions Boolean or regular expression queries Filtering queries through bibliographic constraints such as author, genre, year Treebank entries viewed through a graphical interface Printing and saving options of outputs and session queries available Implemented in Java SE 1. 4. 1, compatible with Window XP/Linux 12. 9. 2021 COGS 523 - Bilge Say 29

12. 9. 2021 COGS 523 - Bilge Say 30

12. 9. 2021 COGS 523 - Bilge Say 30

12. 9. 2021 COGS 523 - Bilge Say 31

12. 9. 2021 COGS 523 - Bilge Say 31

Post-project developments n n About 100 user forms received Some uses (from a recent

Post-project developments n n About 100 user forms received Some uses (from a recent survey) n n n Word sense disambiguation Coherence in Turkish texts Subcategorization Frame Acquisition Teaching Turkish or NLP Co. NLL Dependency task for METUSabancı Treebank (~5000 sentences) Frequency lists available (due to Umut Özge and Serge Sharoff) 12. 9. 2021 COGS 523 - Bilge Say 32

What would we have done differently? n n More funding, more interdisciplinary organization, less

What would we have done differently? n n More funding, more interdisciplinary organization, less turnover. . . Approaching a corpus development project like a software engineering project. . . n n n Doing a pilot project Better quality control processes, version control and documentation control processes. More and better automatic text capture and annotation 12. 9. 2021 COGS 523 - Bilge Say 33

Requests from Users n n n Extend the size and variety of the corpus

Requests from Users n n n Extend the size and variety of the corpus POS tag the whole corpus Enable the users to enter their own corpora to query tool Implement statistical features to the query tools Add semantic annotation Treebank specific ones: n n 10, 000; 7, 000 or 5, 000 sentences? Detailed stylebook LEM and MORPH fields Better versioning, some nonconformant entries with XML 12. 9. 2021 COGS 523 - Bilge Say 34

Requirements for future generations of Turkish corpora n Turkish National Corpus (like ANC, BNC,

Requirements for future generations of Turkish corpora n Turkish National Corpus (like ANC, BNC, or CNC) n n n n Spoken Part Automatic Tools Diachronic Part Linguistically motivated morphological and syntactic annotation Some motivation for text providers Well-funded, well-organized project Comparable corpora of Turkic languages 12. 9. 2021 COGS 523 - Bilge Say 35

Lecture 6 n n Bernardini et al. A Wacky Introduction. April 14, your tool

Lecture 6 n n Bernardini et al. A Wacky Introduction. April 14, your tool evaluation presentations and reports – only two weeks left! 12. 9. 2021 COGS 523 - Bilge Say 36