Language Archiving at the MPI Peter Wittenburg MPI
Language Archiving at the MPI Peter Wittenburg MPI for Psycholinguistics NL DOBES Archive G (DOkumentation BEdrohter Sprachen Documentation of Endangered Languages) Nijmegen Rhein (funded by Volkswagen. Foundation) Language Archiving at the MPI
Still a large variety of languages • currently 6500 languages world-wide • Distribution • Africa • S/SE Asia • Neuguinea • Southamerica • North-Asia • Central-America • Pazific Area • Australia • North-America • Europe 1995 1400 1109 419 380 300 250 209 Language Archiving at the MPI
Language endangerment • 97 % of the people use 4% of the languages • 96% of the languages are being spoken by 3% of the people • approx 6000 of the languages are spoken by about 200 Mio people • in average: 30. 000 speaker per language • for 50% less than 10. 000, for 25% less than 1000 • for 50% the number of speakers is decreasing dramatically • pessimistic view (according to Crystal): • 90 % of the languages will be extinct around 2100!! • i. e. every second week a language becomes extinct!! Language Archiving at the MPI
what can we do? Documentation + Revitalization • 2000 DOBES Programme of the Volkswagen. Foundation • many other initiatives and institutions – all to be complementary • Volkswagen. Foundation is devoted to primarily support research • teams get funds for documentation (in general 3 years +) • had a very intensive pilot phase full of useful discussions • it was obvious that all teams felt the need to help the language communities (including the archiving team) Language Archiving at the MPI
How to do a language documentation? • based on N. Himmelmann “Documentary and Descriptive Linguistics” • Documentation: primary focus is on collection, transcription and translation of primary data (observations, elicitations, . . . ) • Description: primary focus is on linguistic analysis and special phenomena • the methods and the results are different Collection Analysis Result Corpus of utterances, notes on observations, comments of involved persons Descriptive statements illustrated by a few examples Procedures Observation, elicitation, recording, transcription, translation Phonetic, phonological, morphosyntactic, semantic analysis Methodological issues Sampling, reliability, naturalness Definition of terms and levels, justification, adequacy of analysis Language Archiving at the MPI
How to do a language documentation? • there is an overlap between the two poles: documentation and description no interlinear description without a morphological analysis • Documentation has to • deliver a comprehensive representation of the “linguistic habitudes and traditions” • document spoken language in its communicative and cultural background • observed linguistic habitudes and meta knowledge • holistic view of language is important • be interesting for other disciplines – in particular primary data • help the language community • therefore a natural focus on audio&video recordings Language Archiving at the MPI
DOBES language documentation • language on its cultural background • “theory-neutral” representation • lots of multimedia (audion, video) recordings as basis • where possible base everything on primary data • linguistic goals • annotations (orthographic transcription, translation, . . . ) • only for a small part a morphological/syntactical analysis • sketch grammar, limited topic-oriented lexicon • also ethnologists, musikologogists, ethnobiologists involved • in total about 3 years • idea: later generations should be able to reproduce the language • material could later be extended Language Archiving at the MPI
Traditional annotation Text Annotation
Modern annotation Multimedia Annotation
DOBES Map Hocank Archiv Svan/Udi/ Sami Tsova-Tush Nenets Tofa Chintang/ Puma Beaver Wichita Mawe/Bakairi/ Katxuyana Chontal Salar/Monguor Totoli Lacandon Tsafiki Bora/Ocaina Chipaya Marquesan Chaco Languages Kuikuro Trumai Aweti Akhoe !Xoo Hai//om Sri Lanka Malay Semang Waima’a Teop Saliba Iwaidja Jaminjung • 30 documentation teams (at MPI also 30 expeditions per year) • 1 Archiving Team Language Archiving at the MPI
Waima’a (East Timor) Mauricio Belo, Caisido village John Bowden, Australian National University John Hajek, University of Melbourne Nikolaus Himmelmann, Ruhr-Universität Bochum Labial (Post-) alveolar Velar Glottal voiceless unaspirated (p) t k ' voiceless aspirated (ph) th kh voiceless ejective p' t' k' voiced b d g Fricatives plain (f) s ? ? ? s' Nasals plain m n ? ? ? mh nh ? ? ? m' n' Laterals plain l ? ? ? lh ? ? ? l' Tap / trill r rr Glides plain w ? ? ? wh ? ? ? w' Stops la enen i at before PTL Once upon a time bu taha k’omu ruo bu wai-dura loo ligasaun ini HON mud ball and HON cricket make closeness RCP A ball of mud and a cricket were friends sire ruo laka khuu rahmhutu busa 3 p two go clean together garden The two of them went to clean the gardens together h
Trumai (Amazonas) Stephen Levinson, MPI Nijmegen Raquel Guirardello-Damian, Museu Paraense Emílio Goeldi • about 100 people • about 51 speaking Trumai Language Archiving at the MPI
Salar/Monguor (China) Salar villages along the Yellow River Salar children above Dashyinix village Shaman in Huzhu Mongghul county Drummers in the Nadun festival Minhe county Painting the faces of possessed Wutu, Niandehu township Language Archiving at the MPI
Tofa, Tozhu, Tsengel Tuva, Tuha (Sibiria) • David Harrison (Yale) • Brian Donahoe (Manchester) • Sven Grawunder (Halle) • Language—its structure and sounds. • Oral folklore—texts, narratives and personal stories, belief systems, naming systems. • Music—singing and sound mimesis. • Traditional ecology— nomadsm, pastoralism, hunting and reindeer herding Shaman Ceremony Language Archiving at the MPI
Language documentation for whom? • for interested researchers • for students and schools • for journalists • for the interested public • for the language communities • for future generations Language Archiving at the MPI
For language communities • language maintenance or even revitalization • maintainenance of the language, identity, self-conciousness • creation of school and other educational material • support local/regional centers (create and dl complete copies) • improve access to archives • in communities big interest in recordings – in particular video Language Archiving at the MPI
For future generations • in a future world of mono cultures it will become important to know about earlier diversity • as now it will be important to know the own roots • it may be relevant to point to the different types of languages • let’s be honest: we don’t know what future generations will do with the material Language Archiving at the MPI
Why archives? • many reasons • Dietrich Schüller: 80% of our recordings about culture and languages are endangered! storage inadequate (Meda, Formats, PC, . . . ) • selection of suitable technologies requires expert knowledge • creation of redundat storage and migration is important requires discipine and has to be independent on persons • migration to new technologies can be very expensive • only centres can do this • AND: requires explicitness – at the end a viewable corpus • international trend: DOBES, AILLA, ELAR, PARADISEC, LACITO, . . . Language Archiving at the MPI
What is a “modern” digital archive? • traditional archives • focus on preserving physical content • access not permitted • digital archives • physical object is almost irrelevant (Tape, CDROM) • content has to be preserved • why this revolutionary change? • copies can be made lossless (let’s be careful with compression) • copies can be created with low costs • modern digital archive • long-term preservation fo the content (Migration, Distribution) • access to the content • enrichement without affecting the content • sensitive management of access • DOBES has to be a living archive (interactive, expandible) Language Archiving at the MPI
Long-term preservation • can we guarantee survival of bit-streams? NO • we can increase the chances of survival? YES • our storage media are not adequate 0 years 250 years 500 years 1000 years various e-media 2000 years clay tablets • how to do it • continuous migration (copies to new generation) • world-wide distribution (now within Germany/NL) • problem of interpretability not solvable • have to take care of ethical/legal aspects • crucial for survival are maintenance costs • all MPI material is available in 7 copies at different locations Language Archiving at the MPI
Pillars of Digital Archives I • strict separation of physical and logical access layers • physical domain is for System Managers and Archive Managers and changes • • logical domain (created by linguists) remains and is stable metadata is the glue – have to be maintained system manager domain of physical resources corpus manager user creator conceptual domain of resources Language Archiving at the MPI
Pillars of Digital Archives II Archive Organization Layer of Language Layer of Sessions Song Book Intro Films Notes Lexicon Video Recording Sound Recording Annotations Language Archiving at the MPI
Pillars of Digital Archives III • separation between object and instance • • MPI Portal need Unique Resource IDs and robust “Resolving” mechanism Metadata mapping MPI Repository mapping XYZ Portal Metadata GWDG Repository URID Resolver Language Archiving at the MPI
Pillar of Digital Archives IV • need Versioning • • nothing may be deleted, but annotations will be changed! research world is dynamic – we want enrichment/extension userx=read usery=read etc userx=write usery=read etc URID Resolver Language Archiving at the MPI
Principles V – Authentication&Authorization • authentication and authorization has to be separated • URIDs are central link to authorization information • • need to have space for policies, procedures, declarations etc but administrative effort has to be minimized!!! userx=read usery=read etc userx=write usery=read etc URID Resolver Language Archiving at the MPI
Principles VI – Formats • only open, well-documented and widely used formats (encoding standards) should be used in the archive • where possible generic schemas should be the basis • • • in DOBES strong recommendations for a few archival formats • JPEG/TIFF/PNG, MPEG 2, Linear PCM, UNICODE, XML • Plain Text, HTML, (PDF) possible at MPI less restrictive (therefore great danger with some types) for presentation purposes also MPEG 1/4, MP 3, HTML as import formats large variety (Shoebox, CHAT, WORD, . . . ) conversion as much as possible towards generic files (LMF, EAF, ? ) • archived objects have to be stored in a neutral way and accessible as individual objects • no encapsulation for primary objects • nevertheless: MPI archive takes almost all data (even 16 mm films) • but conversion can be very costly Language Archiving at the MPI
MPI Archive – state • more than 150. 000 Objects (in online archive - ~1/3 of the data) • • in total more than 15 TB per year about 4 TB in addition • several sub-archives (EL, SL, ESF, CGN, . . . ) • MPI archive ingest is open for other people !!! • completely structured by open XML files based on IMDI schema • • a complete machinery available are working on URIDs & Versioning at this moment Language Archiving at the MPI
MPI Archive – Access Archive Utility Layer User Authentication Access Rights Metadata Tools The Archive Data Ingestion& Management Ontological Knowledge Archive Access Annotation Exploitation Lexicon Exploitation Text Exploitation Domain of Registered Primary and Secondary Resources User Primary Resources: Texts Images Sound Movies Domain of Descriptive Metadata Archive Enrichment Media Annotation Lexical Encoding Web Commentary
MPI Archive – Metadata and Simple Access • • metadata is open! what is minimal metadata? – ongoing discussion • • • IMDI Editor Batch. Modifier (to change lots of IMDI files) IMDI XML Browser (operates in distributed XML domain) IMDI HTML Browsing (on the fly transformation of XML) structured search in XML and HTML domain unstructured search in XML and HTML domain searchable via Google geographic browsing via Google Earth (work in progress) DC/OLAC bridge via OAI port (all IMDI stuff can be harvested) • manuals and training courses • • direct access to simple objects via plug-ins complete sub-tree download Language Archiving at the MPI
Geographic Browsing
Geographic Browsing
Geographic Browsing
• MPI Archive – Upload Access two options exceptions are easy too many teams (~60) LAMUS controlled integration exceptions are difficult users do it themselves (? ) • manual integration • • LAMUS features - web-based operation request of a work space specification of an accepted upload node (archive anchor) extend and manipulate the corpus structure upload metadata descriptions upload any type of resources (configurable format control) create a linked sub-archive in the workspace and integrate this into the archive checks to guarantee consistency and format compliance Language Archiving at the MPI
MPI Archive – Utilization Access tool is ANNEX Language Archiving at the MPI
MPI Archive – Utilization Access tool is LEXUS Language Archiving at the MPI
MPI Archive – Utilization Access Problem • • different structures and formats different terminologies tools are ANNEX/LEXUS Language Archiving at the MPI
MPI Archive – State of Access • • at this moment almost anything from DOBES is closed lots of requests by journalists • first 15 teams have to finish these months • • working hard changing a lot until last minute of course expect some stuff to become open but much to be handled on requests Language Archiving at the MPI
End Mark Abley (Canadian) Each time we lose a language the ghosts who made use of it cast a new bell. The voices magnify. Soon, listen, they’ll outpeal the tongues of earth. Thanks for your attention. Language Archiving at the MPI
Lots of differences Differences at all linguistic layers • Phonemic • Prosody • Phonology • Morphology • Syntax • Semantics • Pragmatics Reduced Languages • Whistling of Gomera fishermen • Sign Language of Plains-Indians • “Computer” Languages • . . . Language Archiving at the MPI
Sound Systems Vocal – Distribution (28 languages) Spectra and Formants F 2 F 1 F 2 F 3 F 4 F 5 F 1 Formants over time • Rotoka (Papua-Neuguinea) • Vokals a/e/i/o/u • 6 Consonants p/t/k/v/r/g • !Xoo (South-Africa) • 141 Sounds incl. click-sounds F 5 F 4 F 3 F 2 F 1
Tone Systems Intonation dr ai st • modulation of segmental information by Prosody • stretches across phrases and sentences • Tones: meaning of words • Swedish: 2 Tones (anden – ándén) • German: aufbäumen – auf. Bäumen • • Mandarin Chinese: 4 tones Kantonese: 9 tones Vietnamese: 8 tones some so-aisan languages: up to 15 tones Mandarin Chin. 4 i Zeug i vermuten i Stuhl/Sessel i Bedeutung Language Archiving at the MPI
Morphosyntax • Rules for the generation of words and grammatical structures • strictly isolating languages: one morpheme – one word • Chinese is an isolating language • another extreme are the polysynthetic languages • example of the Yup’ik inuit tuntussurqatarniksaitengqiggtuq tuntu ssur qatar ni ksaite ngqiggt uq Renntier jagen FUT sagen NEG wieder 3 SG: IND er hatte noch nicht gesagt dass er Renntiere jagen wolle basic principle: stem is inflected by many affixes verb stem for us unusual: isolated core morphemes cannot be interpreted “ssur” uttered in isolation does not make sense Language Archiving at the MPI
Dialog style • norms to express things/activities is different • example from Kilivila (Trobriand Islands – Neuguinea) Person: Ambeya Where do you go to? Gunter: (wants to say: I will wash myself) Bala bakakaya I will go I will take a bath Host: Bila bikakaya bike’ita bisisu bipaisewa 3. Fut-gehen 3. Fut-baden 3. Fut-zurückkommen 3. Fut-sein 3. Fut-arbeiten He will go – he will take a bath - he will come back – he will stay he will work. He will take a bath, come back again and work with us Language Archiving at the MPI
Pronoms • in Kilivila • the inclusive and exclusive Dual we two – myself and the others except you • in Paamese (Vanuatu - Archipel) • in addition the Paukal “a few” Language Archiving at the MPI
Spatial orientation egocentric behind above north above east system absolute system right west below south • Herberger would use the egocentric system to describe the scene • Aborigines would chose the absolute system – for us hardly possible: “the ball lies east of the player” Language Archiving at the MPI
Awareness • since 1866 efforts to preserve diversity in nature • • 1991 1992 1993 problem in focus of American Linguistic Society discussion at the Intern. Conference of Linguistics AG for endangered languages in German linguistic society UNESCO project to create the red list • 2000 DOBES programme of the Volkswagen. Foundation • within 2 decades broad awareness amongst linguists • David Crystals amongst first semester students: • 75% don’t know anything about the problem • most don’t see a problem • how does this come: attention for tigers etc but not for languages? Language Archiving at the MPI
Factors are known • external factors • military suppression • religious conversion • economic dominance • cultural dominance • educational suppression • internal faktors • negative attitude towards own language • avoidance of discrimination • hope to earn (more) money • improvement of mobility • youngsters are trend followers • . . . Language Archiving at the MPI
MPI Archive – Content Overview MD session s video files audio files photo s other media textual files sub-types MPI 18524 14085 5131 7774 1315 13979 365 EAF, 2377 CHAT, 5580 Media. Tagger, 3568 Plain. Text/Shoebox, 1589 others DOBES 1396 1043 1250 63 20 205 46 EAF, 85 Shoebox, 72 others Dutch Spoken Corpus 12767 41832 to be converted to EAF 874 191 714 CHAT, EAF Dutch Bilingual Database ECHO Sign Language 168 296 181 EAF ESF corpus 994 546 1775 in CHAT Total 34723 15970 58686 136. 555 objects 19339 7837
- Slides: 48