Tools for Historical corpus research and a corpus

  • Slides: 35
Download presentation
Tools for Historical corpus research, and a corpus of Latin Barbara Mc. Gillivray Oxford

Tools for Historical corpus research, and a corpus of Latin Barbara Mc. Gillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.

Outline ¢ Latin corpora ¢ Sketch Engine ¢ Latin. ISE: a Latin corpus for

Outline ¢ Latin corpora ¢ Sketch Engine ¢ Latin. ISE: a Latin corpus for Sk. E ¢ 2 l Collecting the texts l Metadata l Automatic annotation l Demo Conclusion

Latin corpora

Latin corpora

Overview ¢ ¢ 4 Index Thomisticus (1980) by R. Busa S. J. l First

Overview ¢ ¢ 4 Index Thomisticus (1980) by R. Busa S. J. l First electronic corpus l 11 million words; lemmatized Digital editions l Perseus Digital Library (10 million words) l Corpus Grammaticorum Latinorum l Library of Latin Texts (50 million) l Musisque Deoque

Morphological annotation ¢ Manual l ¢ 5 LASLA (1. 5 million words) Automatic l

Morphological annotation ¢ Manual l ¢ 5 LASLA (1. 5 million words) Automatic l Morpheus (Perseus) l CHLT-LEMLAT (ILC-CNR) l Words (W. Whitaker), Quick Latin

Treebanks ¢ Latin Dependency Treebank 53, 000 tokens l ¢ Index Thomisticus Treebank 100,

Treebanks ¢ Latin Dependency Treebank 53, 000 tokens l ¢ Index Thomisticus Treebank 100, 000 l ¢ Thomas Aquinas PROIEL Project 100, 000 l 6 Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust, Vergil Translations of the New Testament in Latin, Greek, Old Church Slavonic, Armenian, Gothic

Motivation ¢ Latin is still a less-resourced language ¢ Features of our corpus l

Motivation ¢ Latin is still a less-resourced language ¢ Features of our corpus l Size: 13 million words l Provided with metadata l Automatically annotated • Lemmatized • Part-of-speech tagged l 7 Included in a clever corpus query system

Sketch Engine

Sketch Engine

¢ Corpus query tool, since 2003 ¢ Widely used by lexicographers l Commercial •

¢ Corpus query tool, since 2003 ¢ Widely used by lexicographers l Commercial • OUP, Collins, Macmillan, Le Robert, Cornelsen, Shogukakan l National dictionary projects • Bulgaria, Czech Republic, Estonia, Netherlands, Slovakia, Slovenia ¢ Universities l 9 Linguistics, language research, NLP, language teaching

44 languages and counting Large corpora ready-to-use for Arabic Bengali Bulgarian Chinese Czech Croatian

44 languages and counting Large corpora ready-to-use for Arabic Bengali Bulgarian Chinese Czech Croatian Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Latin Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Setswana Slovak Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Urdu Vietnamese 10

¢ Handles large corpora l Largest to date: 8 billion words Fast ¢ Web-based:

¢ Handles large corpora l Largest to date: 8 billion words Fast ¢ Web-based: no software to install ¢ Build ‘instant corpora’ from the web ¢ Load your own corpus ¢ l ¢ Word sketches l ¢ 11 Quota of space on Sk. E server One-page, automatic accounts of a word’s grammatical and collocational behaviour Free 30 -day trial: sketchengine. co. uk

Adam Kilgarriff Lexical Computing Ltd. 12

Adam Kilgarriff Lexical Computing Ltd. 12

Add your language/corpus? ¢ In your personal area or maybe ¢ For all Sk.

Add your language/corpus? ¢ In your personal area or maybe ¢ For all Sk. E users Always interested in adding more resources l If it’s a corpus that others may want: quid pro quo: free use of tool l • Contact: inquiries@sketchengine. co. uk 13

Latin. ISE: a Latin corpus in the Sketch Engine

Latin. ISE: a Latin corpus in the Sketch Engine

Collecting the texts ¢ Three online digital libraries l Lacus. Curtius http: //penelope. uchicago.

Collecting the texts ¢ Three online digital libraries l Lacus. Curtius http: //penelope. uchicago. edu/Thayer/I/Roman/home. html l Intra. Text http: //www. intratext. com l Musique Deoque http: //www. mqdq. it ¢ 15 From HTML to verticalised text

Metadata ¢ Author; title ¢ Genre (prose or poetry) ¢ Era; date; century ¢

Metadata ¢ Author; title ¢ Genre (prose or poetry) ¢ Era; date; century ¢ 16 l Oldest: Senatus consulta de Baccanalibus (186 B. C. ) l Most recent: Congregazione per la Dottrina della Fede, Dominus Iesus (2000) Metadata used to delete duplicated texts

Annotation ¢ Natural Language Processing l Lemmatization • Proiel Project’s morphological analyser (Dag Haug)

Annotation ¢ Natural Language Processing l Lemmatization • Proiel Project’s morphological analyser (Dag Haug) • Quick Latin l Pos-tagging • Tree. Tagger (H. Schmid, IMS, University of Stuttgart) ¢ Advantages l 17 Not prone to human errors, fast, less costly

The corpus in Sk. E 18

The corpus in Sk. E 18

Subcorpora ¢ Early (VII-II cent. B. C. ) 401, 557 ¢ Classical (I cent.

Subcorpora ¢ Early (VII-II cent. B. C. ) 401, 557 ¢ Classical (I cent. B. C. ) 2, 275, 030 ¢ Post-classical (I-VI cent. A. D. ) 6, 080, 181 ¢ Medieval (VII-XIV cent. A. D. ) 2, 920, 446 ¢ Modern (XV-XXI cent. A. D. ) 2, 034, 940 ¢ Poetry 3, 818, 603 ¢ Prose 9, 935, 401 19

20

20

A first search 21

A first search 21

22

22

Cum (conjunction) 23

Cum (conjunction) 23

24

24

Cum (preposition) 25

Cum (preposition) 25

26

26

Search a phrase 27

Search a phrase 27

28

28

29

29

Magna pars vs. pars magna 30

Magna pars vs. pars magna 30

Context: Dico/puto/credo quod 31

Context: Dico/puto/credo quod 31

32

32

33

33

Conclusion

Conclusion

¢ A new large resource for a less-resourced language ¢ NLP tools on a

¢ A new large resource for a less-resourced language ¢ NLP tools on a dead language ¢ Advanced corpus queries with Sketch Engine l ¢ 35 http: //www. sketchengine. co. uk Future l Morphological tags (case, mood, voice, …) l Syntactic tags (Word Sketches)