The Crbadn Project Corpus building for underresourced languages

The Crúbadán Project: Corpus building for under-resourced languages Kevin P. Scannell Presented by: Ben King

Problems with Minority Languages • Often no public funding available and no commercial interest in this work • Few native speakers with NLP training • Data are too scarce for “representativeness”

Overcoming these Problems An open source development model Volunteer labor by language enthusiasts Free, web-crawled corpora Language-independent tools (like the crawler) deployed for a large number of languages • Unsupervised machine learning algorithms • •

An Crúbadán • A web-crawler that builds corpora for minority languages • Phase 1 aims to collect text in all languages on the web • Phase 2 aims to produce NLP resources in these languages

Statistics • 1872 languages • 836, 000 documents • 1, 900, 000 words

Design Principles • Orthographies instead of languages • A corpus should consist of real running text – Not just word lists • It should be possible to collect all the Web text for small languages

The pipeline – step 1: finding languages • Lots of manual work – Web searching – Scanning and OCRing books – Monitor for new languages at • • Wikipedia UDHR Watchtower bible. is – Add language metadata • Expected polluting languages • Encodings

The pipeline – step 1: finding languages

The pipeline – step 2: building a query • Separate the known words into stopwords and other words • OR together content words • AND with a stopword

The pipeline – step 3: processing documents • Google returns many types of documents • Convert to plain text – pdf 2 text – Remove markup – Remove boilerplate – Convert to Unicode • Identify the document language • Update lexicon, lang-id trigrams

The pipeline – step 4: repeat • Repeat

Applications

Spell checkers • Most languages do not have a spell checker • Building a spell checker requires – List of roots – Basic morphological rules • A native speaker must identify roots and affixes • 28 new spellcheckers created

Indigenous Tweets and Language Revitalization

Indigenous Tweets and Language Revitalization • Indigenous Tweets project identifies and tracks Twitter accounts that tweet in minority languages • New usage of endangered languages online: – Gamilaraay (Australia, ~3 speakers), – Nawat (El Salvador, ~20 speakers), – Delaware (USA, ~78 speakers) • Encourages a younger generation to use endangered languages

Diacritic Restoration • Many languages with diacritics have developed an ASCII orthography on the Web • Examples – – – – Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan → Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao → Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo Moi nguoi deu co quyen tu do ngon luan va bay to quan diem → Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→ Ẹnì kọ ọ kan ló ní è tọ sí òmì nira láti ní ìmọ ràn tí ó wù ú, kí ó sì sọ irú ìmọ ràn bè è jáde • Released as an open-source Firefox add-on, accentuate. us – Supports 84 languages

Diacritic Restoration

Continuing problems • Variations in scripts – Example: Azerbaijani • ИНСАН ҺҮГУГЛАРЫ ҺАГГЫНДА ҮМУМИ БӘЈАННАМӘ • İNSAN HÜQUQLARI HAQQINDA ÜMUMİ BƏYANNAMƏ • INSAN HUQUQLARI HAQQINDA UMUMI BEYANNAME • Mixed-language documents • Manual intervention