Gigafida Kres cc Gigafida and cc Kres Simon

  • Slides: 15
Download presentation
Gigafida, Kres, cc. Gigafida and cc. Kres Simon Krek Centre for Language Resources and

Gigafida, Kres, cc. Gigafida and cc. Kres Simon Krek Centre for Language Resources and Technologies, University of Ljubljana, "Jožef Stefan" Institute, Artificial Intelligence Laboratory

Contents • Centre for Language Resources and Technologies (University of Ljubljana) • History (FIDA

Contents • Centre for Language Resources and Technologies (University of Ljubljana) • History (FIDA – Fida. PLUS – Gigafida) • From Gigafida 1. 0 to Gigafida 2. 0 • • • Project plan New corpus material Processing tools and web concordancer Corpus of standard Slovene Availability • Plans for the future Slavi. Corp, Prague, 25. 9. 2018 2

Centre for Language Resources and Technologies • Part of the Network of Research and

Centre for Language Resources and Technologies • Part of the Network of Research and Infrastructural Centres at the University of Ljubljana • Planned from 2012, established in 2015 • Faculty of Arts, Faculty of Computer and Information Science, Faculty of Social Sciences, Faculty of Education, Faculty of Electrical Engineering • Aimed at scientific research and the development of language infrastructure for contemporary (standard) Slovene • Language description, language standardization, language technologies, terminology, multilinguality, special needs • Web page: https: //www. cjvt. si/en/ Slavi. Corp, Prague, 25. 9. 2018 3

History – FIDA and Fida. PLUS • FIDA (1997 -2000) • • Faculty of

History – FIDA and Fida. PLUS • FIDA (1997 -2000) • • Faculty of Arts (Uni-Lj), Jožef Stefan Institute, DZS publishing house, Amebis 100 million words Amebis rule-based tagger (≈85% accuracy) Web concordancer (for partners) • Fida. PLUS (2004 -2006) • • Two applicative research projects 620 million words Available on the web (general access with registration) Same tagger & web concordancer Slavi. Corp, Prague, 25. 9. 2018 4

History – Gigafida • Communication in Slovene (2008 -2013) • Web page: http: //eng.

History – Gigafida • Communication in Slovene (2008 -2013) • Web page: http: //eng. slovenscina. eu/ • Corpora (written, spoken, learners), lexicon, lexical database, tagger, parser, training corpus, web portals (concordancer etc. ) • Gigafida, Kres, cc. Gigafida, cc. Kres (2012) • • • Gigafida – 1. 2 billion words Kres – 100 million (balanced) cc. Gigafida & cc. Kres (open licence) – 100/10 million words Statistical tagger (≈92% accuracy) New concordancer, focus on user-friendliness (surveys, log analysis etc. ) Slavi. Corp, Prague, 25. 9. 2018 5

History – Gigafida/Kres (genre) Gigafida – 1. 2 billion Kres – 100 million Slavi.

History – Gigafida/Kres (genre) Gigafida – 1. 2 billion Kres – 100 million Slavi. Corp, Prague, 25. 9. 2018 6

Gigafida 2. 0 – project plan • Project • Upgrade of Gigafida, Kres, cc.

Gigafida 2. 0 – project plan • Project • Upgrade of Gigafida, Kres, cc. Gigafida and cc. Kres (2014 -2018) • Ministry of Culture • Centre for Language Resources and Technologies • Plan • • New material (1. 5 B) Processing (new tools) Different focus (standard Slovene) Availability (distribution) Slavi. Corp, Prague, 25. 9. 2018 7

Gigafida 2. 0 – new material • Solving two problems • Outdatedness • Underrepresentation

Gigafida 2. 0 – new material • Solving two problems • Outdatedness • Underrepresentation • Outdatedness • News corpus (http: //newsfeed. ijs. si/) • news portals (rtvslo. si, 24 ur. com, siol. net, žurnal 24. si, sta. si) • daily newspapers (delo. si, dnevnik. si, vecer. si etc) • From 2012 onwards, two batches (300 M words) • Underrepresentation • textbooks and literature (high sales or borrowing in libraries) – 10 M Slavi. Corp, Prague, 25. 9. 2018 8

Gigafida 2. 0 – processing & concordancer • Tagging • Obeliks tagger (Gigafida 1.

Gigafida 2. 0 – processing & concordancer • Tagging • Obeliks tagger (Gigafida 1. 0) • Reldi tagger (94, 27% accuracy) • Meta-tagger • Training corpus: 10, 000 tokens with different decisions • Tokenization • Rule-based (from Obeliks tool) • Obeliks 4 J (https: //github. com/clarinsi/Obeliks 4 J) • Deduplication Slavi. Corp, Prague, 25. 9. 2018 9

Gigafida 2. 0 – tagging • Meta-tagger (training corpus: 10, 000 tokens with different

Gigafida 2. 0 – tagging • Meta-tagger (training corpus: 10, 000 tokens with different decisions) • • • Both taggers are wrong or not enough context: 1, 121 Same tag, different lemma: 858 Excluded: 1, 979 Training set: 8, 021 Obeliks is right: 2, 654, 33. 09% Reldi is right: 5, 367, 66. 91% • Improvement: 0. 71 (tags) 0. 73 (lemmas) • Tool: https: //github. com/clarinsi/meta-tagger Slavi. Corp, Prague, 25. 9. 2018 10

Gigafida 2. 0 – deduplication • Old material from FIDA and Fida. PLUS •

Gigafida 2. 0 – deduplication • Old material from FIDA and Fida. PLUS • Onion tool (http: //corpus. tools/wiki/Onion) parameters • n-gram (n, default 7) • Percent of n-gram overlap (p, default 0. 5) • Parameters considered • 7 0. 7 (75. 8%) • 9 0. 5 (75. 1%) chosen • Gigafida 1. 1 De. Dup • http: //clarin. si/noske/ Slavi. Corp, Prague, 25. 9. 2018 11

Gigafida 2. 0 – web concordancer Slavi. Corp, Prague, 25. 9. 2018 12

Gigafida 2. 0 – web concordancer Slavi. Corp, Prague, 25. 9. 2018 12

Gigafida 2. 0 – corpus of standard Slovene • Three categories • (1) the

Gigafida 2. 0 – corpus of standard Slovene • Three categories • (1) the first category includes texts with a high degree of probability that the authors were required or wanted to produce texts in standard Slovene – textbooks, literature, magazines, newspapers, legislation etc. • (2) media that intentionally use non-standard language, e. g. a newspaper published by Slovenian minority in Italy (Novi Matajur) • (3) computer-mediated communication (CMC) typical of social media, forums etc. • Gigafida 2. 0 (standard) – Janes (CMC) – specialized (KAS, IMP etc. ) Slavi. Corp, Prague, 25. 9. 2018 13

Gigafida 2. 0 – availability • Distributer: Centre for Language Resources and Technologies •

Gigafida 2. 0 – availability • Distributer: Centre for Language Resources and Technologies • Access: repository CLARIN. SI • cc. Gigafida & cc. Kres • Creative Commons (CC BY-SA) • Gigafida & Kres • • Standardized (online) contract Authentication Scrambling January 2019 Slavi. Corp, Prague, 25. 9. 2018 14

Plans for the future • Developmenent: Centre for Language Resources and Technologies • Distribution:

Plans for the future • Developmenent: Centre for Language Resources and Technologies • Distribution: CLARIN. SI repository • Research programme Uni-Lj (Dec 2018) • Yearly upgrade? • Neural tagger • Ongoing projects • “New grammar” & KOLOS • ELEXIS (European Lexicographic Infrastructure) • Annotation (syntactic treebank, SRL, NER, semantic types? ) • Extraction of data (n-grams, collocations, valency patterns etc. ) Slavi. Corp, Prague, 25. 9. 2018 15