TexttoSpeech Synthesis Bernd Mbius Language Science and Technology

Text-to-Speech Synthesis Bernd Möbius Language Science and Technology Saarland University Lecture 3 May 6, 2021 Formant Synthesis B Möbius Formant synthesis 1

l Formant synthesis acoustic-parametric synthesis method modeling the acoustic properties of speech sounds based on acoustic theory of speech production [Fant 1960] source-filter model B Möbius Formant synthesis 2

Source-filter model of speech production

l Source-filter model of speech production B Möbius Formant synthesis 4

Source-filter model of speech production Glottal excitation Vocal tract: frequency response Sound spectrum

l Vocal tract as acoustic filter Vocal tract geometry, determined by tongue position (and jaw opening and lip protrusion, not shown) B Möbius Formant synthesis 6

l Vocal tract: acoustic tube model [Clark et al. , 2007 a, p. 241] B Möbius Formant synthesis 7

l Idealized simple tube model acoustic signals evolve as longitudinal waves in vocal tract 2 physical parameters of acoustic waves sound pressure p : change of air pressure evoked by sound at place of measurement sound velocity v : speed of air particles caused by sound event (note: this is not speed of sound c !) perfect reflexion at sound-hard (lossless) walls of tube v = 0 at place of reflexion (lossy) reflexion at sound-soft transition from vocal tract to free acoustic field (i. e. from lips to air) p = 0 at place of radiation B Möbius Formant synthesis 8

l Sound pressure waves in vocal tract p=0 v=0 [Hess, ms. ] B Möbius Formant synthesis 9

l Computing formant frequencies resonance frequencies of neutral vocal tract computed as speed of sound divided by wave length: f i = c / λ i frequencies of resonances/formants: F 1 = 340 / (4 * 0. 17) = 340 / 0. 68 = 500 Hz F 2 = 340 / (4/3 * 0. 17) = 3 * 340 / (4 * 0. 17) = 1500 Hz F 3 = 340 / (4/5 * 0. 17) = 5 * 340 / (4 * 0. 17) = 2500 Hz distribution of formant frequencies in neutral vocal tract corresponds to formants of central vowel 'schwa' [ǝ] simple tube model, with constant cross-section, is inadequate for computing formants of other vowels (cf. acoustic theory of vowel articulation [Ungeheuer 1962]) B Möbius Formant synthesis 10

l Tube model with varying cross-section [Clark et al. , 2007 a, p. 246] B Möbius Formant synthesis 11

l N O T R EQ U IR ED Acoustic theory of vowel articulation B Möbius Formant synthesis 12

l Vowels (IPA) F 2 F 1 B Möbius Formant synthesis 13

l Vowels (German, [Pompino-Marschall 1995]) B Möbius Formant synthesis 14

l Vowels (German, F 1/F 2/F 3 B Möbius [Möbius 2001 a]) Formant synthesis 15

l Cascade vs. parallel resonators [Allen et al. 1987] B Möbius Formant synthesis 16

l Cascade/parallel resonators and voice source [Allen et al. 1987] B Möbius Formant synthesis 17

l Klatt's formant synthesizer [Klatt 1980] B Möbius Formant synthesis 18

l Klatt parameter values [Allen et al. 1987] B Möbius Formant synthesis 19

l IMSkpe: Klatt parameter editor GUI interactive tool for doing formant synthesis http: //sourceforge. net/projects/imskpe/ https: //github. com/imskpe/ (Andreas Madsack, IMS, Univ. Stuttgart) B Möbius Formant synthesis 20

l Formant synthesis: Summary acoustic-parametric synthesis method modeling the acoustic properties of speech sounds based on acoustic theory of speech production [Fant 1960] source-filter model explicit control of voice source parameters and prosody fair approximation of formant structure of speech sounds extensive knowledge acquisition and rule building phases TTS Systems: Klatt-Talk (MITalk, DECtalk), Delta, Infovox B Möbius Formant synthesis 21

l Essential content Formant synthesis architecture and functional principle of a formant synthesizer, here: Klatt synthesizer relationship between a formant synthesizer and the source-filter model of speech production B Möbius Formant synthesis 22