TexttoSpeech Synthesis Bernd Mbius Language Science and Technology

Text-to-Speech Synthesis Bernd Möbius Language Science and Technology Saarland University Lecture 7 June 25, 2020 Prediction of Prosody: Intonation B Möbius Intonation Modeling

l Prosody: Duration and intonation Temporal and tonal structure in speech synthesis all synthesis methods use models to predict duration and F 0 models are trained on observed duration and F 0 data Unit Selection: phone duration and phone-level F 0 used in target specification F 0 smoothness considered in join costs B Möbius Intonation Modeling

l Prosodic features in TTS Temporal and tonal features in corpus-based synthesis symbolic representation of intonation (e. g. To. BI tones) as part of the target specification (of units and context) acoustic duration of units as part of target specification F 0 continuity as dimension of concatenation cost model-based prediction of F 0 and duration to generate or modify prosody of concatenated utterance Prosodic features in HMM synthesis: see lecture on July 9 B Möbius Intonation Modeling

l Intonation prediction Task of intonation model in TTS compute a continuous acoustic parameter (F 0) from a symbolic representation of intonation inferred from text B Möbius Intonation Modeling

l Intonation B Möbius Intonation Modeling

l F 0 as a complex phenomenon Main problem for intonation models: linguistic, paralinguistic, extralinguistic factors expressed by F 0 lexical tones syllabic stress, word accent stress groups, accent groups prosodic phrasing sentence mode discourse intonation pitch range, register phonation type, voice quality microprosody: intrinsic and coarticulatory F 0 B Möbius Intonation Modeling

l Intonation prediction Task of intonation model in TTS compute a continuous acoustic parameter (F 0) from a symbolic representation of intonation inferred from text Intonation models commonly applied in TTS systems: phonological tone-sequence models (Pierrehumbert) acoustic-phonetic superposition models (Fujisaki) acoustic stylization models (Tilt, Pa. Int. E, Int. Sint) perception-based models (IPO) function-oriented models (KIM) B Möbius Intonation Modeling

l Intonation prediction Task of intonation model in TTS compute a continuous acoustic parameter (F 0) from a symbolic representation of intonation inferred from text Intonation models commonly applied in TTS systems: phonological tone-sequence models (To. BI) acoustic-phonetic superposition models (Fujisaki) acoustic stylization models (Tilt, Pa. Int. E, Int. Sint) perception-based models (IPO) function-oriented models (KIM) B Möbius Intonation Modeling

l Tone sequence model Autosegmental-metrical theory of intonation [Pierrehumbert 1980] intonation is represented by sequence of high (H) and low (L) tones H and L are members of a primary phonological contrast hierarchy of intonational domains IP – Intonation Phrase; boundary tones: H%, L% ip – intermediary phrase; phrase tones: H-, L pw – prosodic word; pitch accents: H*, H*L, L*H, … B Möbius Intonation Modeling

l Pierrehumbert's model Finite-state grammar of well-formed tone sequences pw ip IP Example [adapted from Pierrehumbert 1980, p. 276] That's a remarkably clever suggestion. | | %H H* H*L L- L% B Möbius Intonation Modeling

l Pierrehumbert's model Finite-state graph pw ip B Möbius Intonation Modeling IP

l To. BI: Tones and Break Indices Formalization of intonation model as transcription system [Silverman et al. 1992] phonemic (=broad phonetic) transcription originally designed for American English limited applicability to other varieties/languages language-specific inventory of phonological units language-specific details of F 0 contours adapted to many languages (e. g. GTo. BI, JTo. BI, KTo. BI) implemented in many TTS systems abstract tonal representation converted to F 0 contours by means of phonetic realization rules B Möbius Intonation Modeling

l Intonation prediction Task of intonation model in TTS compute a continuous acoustic parameter (F 0) from a symbolic representation of intonation inferred from text Intonation models commonly applied in TTS systems: phonological tone-sequence models (Pierrehumbert) acoustic-phonetic superposition models (Fujisaki) acoustic stylization models (Tilt, Pa. Int. E, Int. Sint) perception-based models (IPO) function-oriented models (KIM) B Möbius Intonation Modeling

l Fujisaki's model [Fujisaki 1983, 1988; Möbius 1993] B Möbius Intonation Modeling

l Fujisaki's model Properties superpositional physiological basis and interpretation of components and control parameters linguistic interpretation of components applied to many (typologically diverse) languages Origins Öhman and Lindqvist (1966), Öhman (1967) Fujisaki et al. (1979), Fujisaki (1983, 1988) B Möbius Intonation Modeling

l Fujisaki's model: components [Möbius 1993] B Möbius Intonation Modeling

l Fujisaki's model: components [Möbius 1993] Approximation of natural F 0 by optimal parameter values within linguistic constraints (accents, phrase structure) B Möbius Intonation Modeling

l Intonation models in TTS Predicting tonal structure from text sentence mode location of accents and phrase boundaries Why more would be better prediction of accent position and accent type produces perceptually more adequate F 0 contours illustrated by manual mark-up Why more is difficult predicting accent type from text is difficult and unreliable B Möbius Intonation Modeling

l Essential content: Intonation modeling What is the task of the intonation prediction module in TTS? Why is intonation and F 0 prediction difficult? Characterize the two major types of intonation models applied in TTS (Pierrehumbert's model / To. BI; Fujisaki's model). B Möbius Intonation Modeling