Gigafida Kres cc Gigafida and cc Kres Simon















- Slides: 15

Gigafida, Kres, cc. Gigafida and cc. Kres Simon Krek Centre for Language Resources and Technologies, University of Ljubljana, "Jožef Stefan" Institute, Artificial Intelligence Laboratory

Contents • Centre for Language Resources and Technologies (University of Ljubljana) • History (FIDA – Fida. PLUS – Gigafida) • From Gigafida 1. 0 to Gigafida 2. 0 • • • Project plan New corpus material Processing tools and web concordancer Corpus of standard Slovene Availability • Plans for the future Slavi. Corp, Prague, 25. 9. 2018 2

Centre for Language Resources and Technologies • Part of the Network of Research and Infrastructural Centres at the University of Ljubljana • Planned from 2012, established in 2015 • Faculty of Arts, Faculty of Computer and Information Science, Faculty of Social Sciences, Faculty of Education, Faculty of Electrical Engineering • Aimed at scientific research and the development of language infrastructure for contemporary (standard) Slovene • Language description, language standardization, language technologies, terminology, multilinguality, special needs • Web page: https: //www. cjvt. si/en/ Slavi. Corp, Prague, 25. 9. 2018 3

History – FIDA and Fida. PLUS • FIDA (1997 -2000) • • Faculty of Arts (Uni-Lj), Jožef Stefan Institute, DZS publishing house, Amebis 100 million words Amebis rule-based tagger (≈85% accuracy) Web concordancer (for partners) • Fida. PLUS (2004 -2006) • • Two applicative research projects 620 million words Available on the web (general access with registration) Same tagger & web concordancer Slavi. Corp, Prague, 25. 9. 2018 4

History – Gigafida • Communication in Slovene (2008 -2013) • Web page: http: //eng. slovenscina. eu/ • Corpora (written, spoken, learners), lexicon, lexical database, tagger, parser, training corpus, web portals (concordancer etc. ) • Gigafida, Kres, cc. Gigafida, cc. Kres (2012) • • • Gigafida – 1. 2 billion words Kres – 100 million (balanced) cc. Gigafida & cc. Kres (open licence) – 100/10 million words Statistical tagger (≈92% accuracy) New concordancer, focus on user-friendliness (surveys, log analysis etc. ) Slavi. Corp, Prague, 25. 9. 2018 5

History – Gigafida/Kres (genre) Gigafida – 1. 2 billion Kres – 100 million Slavi. Corp, Prague, 25. 9. 2018 6

Gigafida 2. 0 – project plan • Project • Upgrade of Gigafida, Kres, cc. Gigafida and cc. Kres (2014 -2018) • Ministry of Culture • Centre for Language Resources and Technologies • Plan • • New material (1. 5 B) Processing (new tools) Different focus (standard Slovene) Availability (distribution) Slavi. Corp, Prague, 25. 9. 2018 7

Gigafida 2. 0 – new material • Solving two problems • Outdatedness • Underrepresentation • Outdatedness • News corpus (http: //newsfeed. ijs. si/) • news portals (rtvslo. si, 24 ur. com, siol. net, žurnal 24. si, sta. si) • daily newspapers (delo. si, dnevnik. si, vecer. si etc) • From 2012 onwards, two batches (300 M words) • Underrepresentation • textbooks and literature (high sales or borrowing in libraries) – 10 M Slavi. Corp, Prague, 25. 9. 2018 8

Gigafida 2. 0 – processing & concordancer • Tagging • Obeliks tagger (Gigafida 1. 0) • Reldi tagger (94, 27% accuracy) • Meta-tagger • Training corpus: 10, 000 tokens with different decisions • Tokenization • Rule-based (from Obeliks tool) • Obeliks 4 J (https: //github. com/clarinsi/Obeliks 4 J) • Deduplication Slavi. Corp, Prague, 25. 9. 2018 9

Gigafida 2. 0 – tagging • Meta-tagger (training corpus: 10, 000 tokens with different decisions) • • • Both taggers are wrong or not enough context: 1, 121 Same tag, different lemma: 858 Excluded: 1, 979 Training set: 8, 021 Obeliks is right: 2, 654, 33. 09% Reldi is right: 5, 367, 66. 91% • Improvement: 0. 71 (tags) 0. 73 (lemmas) • Tool: https: //github. com/clarinsi/meta-tagger Slavi. Corp, Prague, 25. 9. 2018 10

Gigafida 2. 0 – deduplication • Old material from FIDA and Fida. PLUS • Onion tool (http: //corpus. tools/wiki/Onion) parameters • n-gram (n, default 7) • Percent of n-gram overlap (p, default 0. 5) • Parameters considered • 7 0. 7 (75. 8%) • 9 0. 5 (75. 1%) chosen • Gigafida 1. 1 De. Dup • http: //clarin. si/noske/ Slavi. Corp, Prague, 25. 9. 2018 11

Gigafida 2. 0 – web concordancer Slavi. Corp, Prague, 25. 9. 2018 12

Gigafida 2. 0 – corpus of standard Slovene • Three categories • (1) the first category includes texts with a high degree of probability that the authors were required or wanted to produce texts in standard Slovene – textbooks, literature, magazines, newspapers, legislation etc. • (2) media that intentionally use non-standard language, e. g. a newspaper published by Slovenian minority in Italy (Novi Matajur) • (3) computer-mediated communication (CMC) typical of social media, forums etc. • Gigafida 2. 0 (standard) – Janes (CMC) – specialized (KAS, IMP etc. ) Slavi. Corp, Prague, 25. 9. 2018 13

Gigafida 2. 0 – availability • Distributer: Centre for Language Resources and Technologies • Access: repository CLARIN. SI • cc. Gigafida & cc. Kres • Creative Commons (CC BY-SA) • Gigafida & Kres • • Standardized (online) contract Authentication Scrambling January 2019 Slavi. Corp, Prague, 25. 9. 2018 14

Plans for the future • Developmenent: Centre for Language Resources and Technologies • Distribution: CLARIN. SI repository • Research programme Uni-Lj (Dec 2018) • Yearly upgrade? • Neural tagger • Ongoing projects • “New grammar” & KOLOS • ELEXIS (European Lexicographic Infrastructure) • Annotation (syntactic treebank, SRL, NER, semantic types? ) • Extraction of data (n-grams, collocations, valency patterns etc. ) Slavi. Corp, Prague, 25. 9. 2018 15