Experiences from the Spoken Dutch Corpus Nelleke Oostdijk

Spoken Dutch Corpus – Corpus Gesproken Nederlands (CGN) • Result of a joint Dutch-Flemish

Outline 1. The Spoken Dutch Corpus: design and compilation 2. Experiences at the time

Ambition A Spoken Dutch Corpus that would be • Comparable in size to e.

Corpus size targeted: 10 million words Full corpus 1, 000 hours of recordings (~10

Prospective users: conflicting interests Prospective user groups/stakeholders included • Users from the fields of

Corpus design considerations To be made in the light of the prospective users, the

Corpus design Overall structure is defined on the basis of parameters that distinguish specific

Corpus design: composition (1) dialogue / multilogue 8, 110, 000 private 6, 635, 000

Corpus design: composition (2) dialogue / multilogue 8, 110, 000 monologue 1, 890, 000

Metadata • Information about the speakers, e. g. - Gender - Age - Regional

Concerns • Naturalness/sponteneity of the speech • Quality of the audio recordings • Coverage

Realization • Joint Dutch-Flemish project • Distributed project: involving several universities, carried out in

Orthographic transcription Importance • Simplest access to speech data • Available for all data

Orthographic transcription Rules • In principal conventional spelling • Transcription of false starts, hesitations,

Orthographic transcription Problems experienced • New words: dikkemurenkasteel, flunken, haatzaaiartikelen, downgeloade, geplaystationd • +/-

POS tagging and lemmatization • Available for all data: contextually appropriate tag for each

POS tagging and lemmatization Procedure • Automatically, using a tagger-lemmatiser • Check, and possibly

POS tagging and lemmatization Problems during checking and correction • Errors in orthographic transcription

Lexicon link-up Lemmatization of multi-word units • multi-part proper names e. g. Kim Clijsters,

Phonetic transcription • • 22 Available for about 1 million words Broad phonetic: representation

Phonetic transcription General transcription rules 1. Make sure that there is a one-to-one relation

Phonetic transcription What transcribers find difficult 1. Syllable-final /r/: deleted or not? 2. Syllable-final

Syntactic annotation • Available for approx. 1 million words • Dependency structure • Semi-automatic

Prosodic annotation • • • 26 Available for about 25 hours of speech (~

COREX Corpus Exploitation Environment 27

Pronunciation variation: eigenlijk (EN: ‘actually’) Canonical form vs actually observed pronunciations (from rather careful

Pronunciation variation: regional differences For example, between Flanders and the Netherlands: dat (EN: that)

What CGN project has brought us: Tangible results • A sizeable corpus of spoken

But also. . . CGN has had enormous impact on the development of •

Dutch language resources before the Spoken Dutch Corpus Before 1998 - Data - -

Dutch language resources since the Spoken Dutch Corpus 1998 -2003 Spoken Dutch Corpus project

Reflections in retrospect While looking back on ambitions and experiences then (1998 -2003), what

Experiences from actual practice • • • Use conventional spelling Mark chunks Conform to

Things we would do differently? • In view of developments since the corpus came

Open. CGN (2015, ongoing) Aim: To make So. Na. R and CGN data available

The CGN lexicon • Consists of a single word and a multi word lexicon:

The CGN lexicon Lexicon Status B = southern Dutch INF = informal *d =

Slides: 39

Download presentation

Experiences from the Spoken Dutch Corpus Nelleke Oostdijk

Spoken Dutch Corpus – Corpus Gesproken Nederlands (CGN) • Result of a joint Dutch-Flemish project (1 June 1998 until 1 Dec. 2003) • Funded by the Dutch and Flemish governments and research foundations (NWO, FWO, AWI/EWI) • Total budget: 4. 6 million euro • Intended to constitute a language database of contemporary standard Dutch as spoken by adults in the Netherlands and Flanders • Intended use: - for research in various areas, e. g. linguistics, language and speech technology, business communication, language education - (indirectly) for business (SMEs) and teaching - culturally and historically, as a record of Dutch as spoken in the Low Countries around the year 2000 2

Outline 1. The Spoken Dutch Corpus: design and compilation 2. Experiences at the time 3. Reflections in retrospect While looking back on ambitions and experiences then (1998 -2003), what would/might we do differently? – – In view of developments since the corpus came into being Considering the actual use of the corpus In the light of user feedback … 4. Open. CGN 5. Onwards. . . 3

Ambition A Spoken Dutch Corpus that would be • Comparable in size to e. g. the spoken part of the British National Corpus • A plausible sample of standard Dutch as spoken in the Netherlands and Flanders • Enriched with transcriptions and annotations that were theory-neutral to the extent possible • State-of-the-art: Transcriptions, annotations and file formats were to conform to national and international standards and guidelines or ‘best practices’ 4

Corpus size targeted: 10 million words Full corpus 1, 000 hours of recordings (~10 M words) + orthographic transcription + POS tagging, lemmatization and lexicon link-up “Core” corpus 100 hours, transcription and annotation as for full corpus + broad phonetic transcription + syntactic annotation + manually verified word-signal alignment + (25 hours) prosodic annotation 5

Prospective users: conflicting interests Prospective user groups/stakeholders included • Users from the fields of descriptive and theoretical linguistics and communication studies, incl. phoneticians, syntacticians, discourse specialists, … • Users from the fields of computational linguistics • Users from the field of speech recognition Had conflicting ideas about - The content of the corpus, e. g. - Audio quality - Continuous speech vs words/phrases - Spoken vs Read aloud - Situational contexts - The transcriptions and annotations 6

Corpus design considerations To be made in the light of the prospective users, the intended use as well as budgetary and other constraints (e. g. duration of the project, availability of qualified personnel, availability of suitable tools, ethical standards, …) As regards • Corpus and subcorpus sizes • Composition • Sampling • Formats • Metadata • Audio quality • Transcriptions and annotations 7

Corpus design Overall structure is defined on the basis of parameters that distinguish specific communicative / situational settings: • Number of speakers: monologues, dialogues, multilogues • Public (with or targeted at audience) or private • Medium: radio, television • Degree of preparedness • Direct (‘face-to-face’) vs distanced (telephone) 8

Corpus design: composition (1) dialogue / multilogue 8, 110, 000 private 6, 635, 000 spontaneous 6, 635, 000 ‘direct’ 3, 460, 000 conversations (face-toface) 3, 000 interviews 460, 000 ‘distanced’ 3, 175, 000 telephone convers. 3, 000 business transactions 175, 000 public 1, 475, 000 broadcast 750, 000 more or less prepared nonbroadcast 725, 000 spontaneous 750, 000 725, 000 interviews and discussions 750, 000 discussions, debates, meetings 375, 000 lessons 350, 000 monologue 1, 890, 000 9

Corpus design: composition (2) dialogue / multilogue 8, 110, 000 monologue 1, 890, 000 private more or less prepared 40, 000 public 1, 850, 000 broadcast 950, 000 40, 000 spontaneous descriptions of pictures 40, 000 commentary 250, 000 more or less prepared 700, 000 250, 000 current affairs programmes 250, 000 news 250, 000 nonbroadcast 900, 000 10 more or less prepared 900, 000 opinion programmes, commentaries 200, 000 lectures, speeches 275, 000 read aloud text 625, 000 (+ 375, 000)

Metadata • Information about the speakers, e. g. - Gender - Age - Regional background - Educational background • Information about the samples, incl. - Recording conditions - Medium - Length - Number and IDs of speakers involved - Available transcriptions and annotations 11

Concerns • Naturalness/sponteneity of the speech • Quality of the audio recordings • Coverage of various registers and speaker populations • Richness of metadata In view of practical complications, e. g. • In order to allow for distribution of the corpus, speakers’ consent is required (as well as settlement of other IPR matters) • In many situations you have only limited control over the recording conditions • Not all recordings were made by CGN project • Speaker recruitment can be problematic, esp. for elderly people and people with lower education 12

Realization • Joint Dutch-Flemish project • Distributed project: involving several universities, carried out in several locations • Multiple transcription and annotation layers - Later transcription and annotation layers benefit from preceding annotations - Interaction between annotation layers provides checks and balances (quality, consistency) during the creation of the corpus - Allow you to make most of the data when using the corpus (e. g. orthographic + phonetic transcription) • Theory-neutral (to the extent possible) 13

Orthographic transcription Importance • Simplest access to speech data • Available for all data • Base for other transcriptions and annotations Properties • Verbatim transcription • Minimal interpretation • Alignment with speech signal (the marking of chunks facilitates subsequent word alignment) • Transcription useful for broad range of researchers 14

Orthographic transcription: PRAAT 15

Orthographic transcription Rules • In principal conventional spelling • Transcription of false starts, hesitations, grammatical errors, etc. (+ codes) • No capital letters to signal the beginning of sentences • Proper names (and parts of names) are capitalised • Punctuation restricted to [. ] [? ] and […] (Limited number of) codes, e. g. • xxx for unintelligible speech • ggg for speaker sounds • *a for unfinished words 16

Orthographic transcription Problems experienced • New words: dikkemurenkasteel, flunken, haatzaaiartikelen, downgeloade, geplaystationd • +/- word separation, e. g. separable verbs ervan uitgaan – er vanuit gaan – er van uitgaan • Reduced forms • Dialect words or words pronounced with a regional accent • … • Speaker identification 17

POS tagging and lemmatization • Available for all data: contextually appropriate tag for each word form (token) • POS tagging and lemmatization enable searches for word classes and lemmas and not only for word forms; e. g. naar – ADJ or PREP; fiets – N or WW; lemma ‘fietsen’: fiets, fietsen, fietsten, gefietst • Queries can remain underspecified for certain aspects; e. g. find all occurrences of naar as PREP followed by an N: naar huis, naar bed, naar keuze, naar school, … 18

POS tagging and lemmatization Procedure • Automatically, using a tagger-lemmatiser • Check, and possibly manually correct, output Tagset • Especially designed for CGN • Conforms to EAGLES guidelines and ANS • 316 tags (incl. tags for dialect words and * words from orthography) Principles • Word by word; e. g. hij belde hem op • Form over function; e. g. ik heb haar maandag gezien 19

POS tagging and lemmatization Problems during checking and correction • Errors in orthographic transcription • Easy to miss occasional errors • When what is said deviates from what is prescribed by grammar • Notoriously difficult cases, e. g. - Distinction ADJ – V (ingp/edp) - Idioms - Different tags for a word (token), depending on context 20

Lexicon link-up Lemmatization of multi-word units • multi-part proper names e. g. Kim Clijsters, … • separable verbs e. g. achteruitdeinzen – deinzde … achteruit dichtmaken – maakte … dicht navertellen – vertelde … na • foreign multi-word expressions e. g. pro forma, et cetera, chili con carne 21

Phonetic transcription • • 22 Available for about 1 million words Broad phonetic: representation of the phonemes that are being pronounced using a limited set of symbols (e. g. no diacritic symbols) Results from the manual correction of automatically generated transcriptions: - Transcription of phoneme insertions, deletions and substitutions - No transcription of gradual processes such as degree of voicing Symbol set: Dutch SAMPA set with extras 1. /J/ for <oranje> 2. /E: /, /Y: /, /O: / for resp. <scène>, <freule> en <zone> 3. /E~/, /Y~/, /O~/, /A~/ for resp. <vaccin>, <parfum>, <congé> and <croissant> Symbols under 2 en 3 can only by used in loan words

Phonetic transcription General transcription rules 1. Make sure that there is a one-to-one relation with the orthography 2. Make sure that the transcription shows which phonemes were pronounced 3. If in doubt: do not change the automatic transcription 4. Note if words have to be inserted, removed or substituted Specific transcription rules • If two adjacent words share a phoneme, use an underscore /Als_s. Int k. Omt/, /Ob_b. AIEt/ • Use hyphens to link connecting phonemes with preceding and following words /p. Apa-n-Em_m. Ama/ /du-w-@t/ • Use [] for untranscribable phonemes or words 23

Phonetic transcription What transcribers find difficult 1. Syllable-final /r/: deleted or not? 2. Syllable-final /n/: deleted or not? 3. Plosives and fricatives: voiced or voiceless? (esp. distinction /G/ - /x/) 4. Plosives without release: voiced or voiceless? 5. Use of /S/, /Z/, /J/: is forgotten 24

Syntactic annotation • Available for approx. 1 million words • Dependency structure • Semi-automatic annotation using @nnotatesoftware (cf. NEGRA corpus) 25

Prosodic annotation • • • 26 Available for about 25 hours of speech (~ 250, 000 words) 2 annotations per sample As much as possible theory-neutral Perception-based (cf. Portele & Heuft; Grover et al. ) Annotation of 1. prominent syllables, i. e. syllables that are stressed to make a word more important or to indicate contrast with another word 2. prosodic boundaries, strong and weak 3. abnormal lengthening of phonemes

COREX Corpus Exploitation Environment 27

Pronunciation variation: eigenlijk (EN: ‘actually’) Canonical form vs actually observed pronunciations (from rather careful to highly sloppy) Type Total NL FL FL norm. freq E+G@l@k 930 283 647 1294 E+x@l@k 135 0 0 E+G@l@g 114 78 36 72 E+xl@k 86 86 0 0 E+Gl@k 80 79 1 2 E+l@k 65 33 32 64 E+k 42 39 3 6 E+h@l@k 41 1 40 80 E+xk 39 39 0 0 E+G@k 31 30 1 2 E+x@k 30 30 0 0 1921 1092 829 (88 more) All 99 variants 28

Pronunciation variation: regional differences For example, between Flanders and the Netherlands: dat (EN: that) subordinating conjunction OR demonstrative pronoun Type Total NL FL FL norm. freq d. At 12, 629 8, 763 3, 866 7, 732 d. A 4, 329 1, 196 3, 133 6, 266 d. Ad 3, 758 2, 551 1, 207 2, 414 t. At 2, 545 1, 867 678 1, 356 t. A 765 187 578 1, 156 t. Ad 711 501 210 420 d@t 457 0 0 d@d 249 0 0 27, 991 18, 094 9, 897 (164 more) All 172 variants 29

What CGN project has brought us: Tangible results • A sizeable corpus of spoken Dutch (~ 800 hours recorded speech) with transcriptions, annotations, metadata and documentation • Word frequency lists • COREX: Corpus Exploitation environment Available from the Dutch HLT Centre! 30

But also. . . CGN has had enormous impact on the development of • Standards (e. g. dataformats, metadata specifications, definition of tagset, adaptation/validation of SAMPA for Dutch) • Tools (e. g. PRAAT, tagger/lemmatizer, word alignment software, syntactic parser) • Guidelines for e. g. handling IPR, various types of transcription and annotation CGN has set an example for other projects that have benefitted from the pioneering work 31

Dutch language resources before the Spoken Dutch Corpus Before 1998 - Data - - Text collections held by the Institute for Dutch Lexicology (for lexicological and lexicographical purposes) Corpus Uit den Boogaart (word frequencies) Private collections of individual researchers (small, widely diverse, no IPR) Tools for Dutch: hardly any 1998 -2003 Spoken Dutch Corpus project 32

Dutch language resources since the Spoken Dutch Corpus 1998 -2003 Spoken Dutch Corpus project Influential as regards corpus design, standardization, IPR, tool development 2003 -2004 E-lexicon 2005 -2006 Dutch Language Corpus Initiative (D-Coi) project 2004 -2008 Jasmin (children, non-natives, elderly people for HLT applications) 2006 Co. DAS (Corpus of Dutch Aphasic Speech) 2008 -2011 STEVIN Nederlandstalig Referentiecorpus (So. Na. R) project 2009 -2012 Basi. Lex (corpus of texts for Dutch school children) 2013 -2015 Basi. Script (corpus of texts written by Dutch school schildren) 2013 -2014 CLARIN-NL Data curation service 33 2015 Open. CGN 2015 - CLARIAH, incl. CLARIAH DCS

Reflections in retrospect While looking back on ambitions and experiences then (1998 -2003), what would/might we do differently? • In view of developments since the corpus came into being • Considering the actual use of the corpus • In the light of user feedback 34

Experiences from actual practice • • • Use conventional spelling Mark chunks Conform to EAGLES guidelines Time spent on development of protocols Use tools to generate initial annotation and then do post-editing (e. g. POS tagging/lemmatization, phonetic transcription, syntactic annotation, word alignment) • Timely compilation of the lexicon so that it could optimally support the various transcription and annotation processes was not achieved • In view of the project duration some types of transcription or annotation had to be carried out in parallel • Prosodic annotation not for non-specialists 35

Things we would do differently? • In view of developments since the corpus came into being availability of standards, tools and other resources, guidelines, … • Considering the actual use of the corpus What has corpus been used for (and what not, although we had anticipated that)? • In the light of user feedback: spontaneity lacking, corpus not suited for conversation studies, groups of speakers badly represented or not included at all, . . . 36

Open. CGN (2015, ongoing) Aim: To make So. Na. R and CGN data available for exploitation within a single environment Funding: CLARIN NL Data curation includes: • Audio: Conversion of WAV to MP 3 • CGN XML converted to Fo. Li. A • Metadata: harmonize with So. Na. R metadata • Check to what extent current output of FROG conforms to tagset specified and documented in manual 37

The CGN lexicon • Consists of a single word and a multi word lexicon: Single word lexicon 181, 579 word form entries (type-word class pairs) 229, 104 entries (incl. syntactic patterns) Multi-word lexicon 23, 567 unique multi-word expressions 18, 593 unique multi-word lemmas 53, 704 multi-word entries • Comprises all types of word forms that occur in the corpus • Contains information about spelling, word class, lemma, canonical pronunciation, subcategorization, status, etc. 38

The CGN lexicon Lexicon Status B = southern Dutch INF = informal *d = dialect *u = (possibly deliberate) mispronunciation *v = foreign word without loan word status *x = unintelligible word *z = word pronounced with a strong regional accent, transcribed in standard Dutch spelling Corpus Status C = correct spelling of corpus type I = incorrect spelling of corpus type O = non-validated spelling of corpus type V = validated spelling of corpus type 39