SubProject I Prosody Tones and TextToSpeech Synthesis SinHorng

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang

Outline n Members n Theme of Sub-project I n Research Roadmap n Current Achievements n Research Infrastructure n Future Direction 2

Members 3 Sin-Horng Chen Professor (PI) NCTU Chiu-yu Tseng Professor & Research Fellow (Co-PI) Academia Sinica Yih-Ru Wang, Associate Professor (Co-PI) , NCTU Yuan-Fu Liao Assistant Professor (Co-PI) , NTUT Lin-shan Lee Professor , NTU Hsin-min Wang Associate Research Fellow Academia Sinica

Theme of Sub-Project I Hierarchical modeling of fluent prosody Latent Factor-based pitch contour model Mean model: Prosody Analysis and Modeling Shape model: Prosodic model-based tone recognizer Tone Behavior and Modeling High performance TTS 4 Applications in Text-to-speech Synthesis Tone Sandhi Applications in Speech/Speaker Recognition Less breaks More breaks Speaker recognition Fast speakers Slow speakers

Research Focus n How to analyze and model fluent speech prosody – Approach 1: Hierarchical modeling of fluent speech prosody • Develop a hierarchical prosody framework of fluent speech • Construct modular acoustic models for: (1) F 0 contours, (2) duration patterns, (3) Intensity distribution and (4) boundary breaks – Approach 2: Latent factor analysis-based modeling • Assume there are some latent affecting factors • Latent factor analysis for syllable duration, pitch contour, energy and Intersyllable coarticulation • Explore the relation between latent factors and syntactic information n How to integrate these two approaches and apply them to – Text-to-speech synthesis – Speech/tone/speaker recognition 5

Research Roadmap Current Achievements Future Direction 6 • COSPRO corpus/Toolkits • Hierarchical modeling of fluent speech prosody • Investigation in relation to prosody organization: F 0 range and reset, naturalness and measurement, voice quality • RNN/VQ-based prosodic modeling • Latent factor analysis duration, pitch mean, shape, inter-syllable coarticulation • Automatic prosodic labeling • Prosodic phrase analysis • Model-based TTS • Corpus-based TTS • High performance TTS Mandarin, Min-south, Hakka • Tone modeling and recognition, MLP/RNN • HMM • Model-based tone recognizer • Eigen prosody analysis-based speaker recognition • Prosodic model-based speaker recognition • Language model+pause, PM • Prosodic cues-dependent LM

Hierarchical Prosody Framework of Fluent Speech (1/4) n Hierarchical framework of fluent speech prosody for multiphrase speech paragraphs – Hierarchical cross-phrase patterns and contributions are found in all 4 acoustic dimensions. – Acoustic templates are derived for each prosody level • • 7 F 0 template Syllable duration templates and temporal allocation patterns Intensity distribution patterns Boundary break patterns

Hierarchical Prosody Framework of Fluent Speech (2/4) n The Prosody Hierarchy with Prosodic Boundaries B 5 Prosodic Group B 4 Initial PP Middle Prosodic Phrase B 3 PWPW. . B 2 B 2 8 B 4 Breath Group Final PP B 3. . . . B 2 B 2 . . PW B 2 B 2

Hierarchical Prosody Framework of Fluent Speech (3/4) n F 0 cadence of multi-phrase PG (Prosodic Phrase Group ) n Syllable duration cadence of multiphrase PG Tide over Wave and Ripple PG-initial PPh l 9 the PW level PG-medial PPh l the PPh level PG-final PPh l

Hierarchical Prosody Framework of Fluent Speech (4/4) n Duration Re-synthesis, F 054 C n F 0 Re-synthesis, F 054 C Original n Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s Modified 10 Original

Latent Factor Analysis-based Prosody Modeling (1/3) n Syllable Duration Model – Multiplicative model – Additive model n Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models mean: 42. 3 frames 43. 9 frames variance: 180 frame 2 2. 52 frame 2 RMSE: 1. 93 frames (5 ms/frame) 11

Latent Factor Analysis-based Prosody Modeling (2/3) n Syllable Pitch Contour Model – Mean model – Shape model n The patterns of x-3 -3 12 n Reconstructed pitch mean

Latent Factor Analysis-based Prosody Modeling (3/3) n Inter-syllable coarticulation pitch contour model n The relationship of syllable pitch contours and affecting factors n Reconstructed pitch contour 13

Mandarin/Taiwanese TTS n Block diagram of TTS system 14 n TTS samples Model-based TTS Corpusbased TTS female 1 female 2 female 3 female 4 female 5 Taiwanese -

Tone Behavior Modeling and Recognition with Inter -Syllabic Features n Gabor-IFAS-based pitch detection n Four inter-syllabic features – – Ratio of duration of adjacent syllables Averaged pitch value over a syllable Maximum pitch difference within a syllable Averaged slope of the pitch contour over a syllable n Context-dependent tone behavior modeling 15

Eigen-Prosody Analysis-based Robust Speaker Recognition n Use latent semantic analysis (LSA) to efficiently extract useful speaker cues to resist handset mismatch from few training/test data – Step 1: Automatic prosodic state labeling and speaker-keyword statistics prosody keywords Prosody keyword parsing Co-occurrence Matrix 1 A ……. . 2 dictionary breaks 16 Fast speakers eigenprosody space Eigen-prosody analysis (SVD) S A Ten different handsets 302 speakers 7/3 utterances for training/test respectively 1 Step 2: Eigen-prosody space construction using Latent semantic analysis Less high dimensional prosody space – – – ……. VQ-based Prosody modeling 1 …. . Prosody State Labeling – speakers sequences of prosody states prosodic features n Experimental results on HTIMIT corpus VT U More breaks Slow speakers

Research Infrastructure (1/2) n Sinica COSPRO and Toolkits: http: //www. myet. com/COSPRO/ – – 9 sets of Mandarin Chinese fluent speech corpora collected Platform developed Each corpus was designed to bring out different prosody features involved in fluent speech. Annotation processes include labeling and tagging perceived units and boundaries in fluent speech, especially the ultimate unit the multiple phrase speech paragraph. – Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship characteristic to narrative or discourse organization. 17

Research Infrastructure (2/2) n Tree-Bank Speech Database – – 18 Uttered by a single female speaker Short paragraphs, 110, 000 syllables Sentence-based syntactic tree annotated manually Pitch contour and syllable segmentation corrected manually

Future Direction (1/5) n Automatic prosodic labeling of Mandarin speech corpus n Analysis of prosodic phrase structure n Model-based tone recognition n High performance TTS n Speech recognition/language modeling using prosodic cues n Prosodic modeling-based robust speaker recognition 19

Future Direction (2/5) n Automatic prosodic labeling of Mandarin Speech corpus – Goal: To construct a prosody-syntax model by exploiting the relationship of prosodic features and linguistic features and use it to automatic labeling of various acoustic cues: • • • 20 Prosodic phrase boundary detection Inter-syllable/inter-word coarticulation classification Full/half/sandhi tone labeling for Tone 3 Syllable pronunciation clustering Homograph determination The grouping of monosyllabic words with their neighboring words

Future Direction (3/5) n Analysis of prosodic phrase structure – 4 -level prosody hierarchy: PW, PPh, BG, PG – Issues to be studied • Detection and classification of prosodic phrases • Relation between syntactic phrase structure and prosodic phrase structure • Other affecting factors: speaking rate, speaking style, emotion type, spontaneity of speech n Model-based tone recognition – Current approach • Acoustic feature normalization • Context-dependent tone modeling – Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour 21

Future Direction (4/5) n High performance TTS – Applying the sophisticated prosody models • Modular model of fluent speech prosody • Latent factor analysis-based modeling – Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient. • Consider both linguistic information and acoustic cues • Specially treat to monosyllabic words – Use the above prosody-syntax models to assist in the generation of prosodic information 22

Future Direction (5/5) n Speech recognition/language modeling using prosodic cues – Automatic prosodic states labeling – Prosodic state-dependent acoustic modeling – Prosodic state-dependent language modeling n Prosodic modeling-based robust speaker recognition – Automatic prosodic cues labeling – N-gram language model to learn the prosodic behavior of speakers – Applying principle component analysis (PCA) to N-gram to find a compact prosodic speaker space 23