Predictions of Mandarin Syllable Duration and Boundary Pauses

Predictions of Mandarin Syllable Duration and Boundary Pauses from HPG Prosody Structure Chun-Hsiang Chang Institute of Linguistics Academia Sinica lawrence@phslab. sinica. edu. tw 10/27/2020 Ne. GSST 2007 1

Outline n Purposes: 1. How HPG Governs and Constrains Boundary Breaks and Boundary Effects in Fluent Speech 2. How to Utilize HPG Structure to Predict Boundary and Boundary Breaks 3. How to Further Improve Constructed Prediction Model n ¨ Pause Features and Cues for Prosodic Boundaries ¨ Pause Prediction Model ¨ Duration Features and Cues for Prosodic Boundaries ¨ Duration Prediction Model Conclusion 10/27/2020 Ne. GSST 2007 2

Reference n Tseng Chiu-yu and Lee Yeh-lin (2004). “Speech rate and prosody units: Evidence of interaction from Mandarin Chinese”, Proceedings of the International Conference on Speech Prosody 2004, (Mar. 23 -26, 2004), Nara, Japan, pp. 251 -254. n Tseng Chiu-yu, Pin Shao-huang, Lee Yeh-lin, Wang Hsin-min and Chen Yong-cheng (2005). “Fluent speech prosody: framework and modeling”, Speech Communication, Vol. 46, issues 3 -4, (July 2005), Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, pp. 284 -309. n Tseng Chiu-yu and Fu Bau-Ling (2005). “Duration, Intensity and Pause Predictions in Relation to Prosody Organization”, Proceedings of Interspeech 2005 , (September 4 -8 , 2005) , Lisbon , Portugal, pp. 1405 -1408. 10/27/2020 Ne. GSST 2007 3

Sinica COSPRO 05 (Mandarin Chinese Continuous Speech Prosody Corpus) http: //www. myet. com/cospro n The speech data consisted of readings of 26 paragraphs (11592 syllables in total) of text ranging from 85 to 981 characters per paragraph by two speakers. n 1 female (F 051 P) and 1 male (M 051 P) radio announcers, both under 35 years of age, read the text at a normal speaking rate of 200 ms/syllable. n All labeling was also spot-checked by trained transcribers. 10/27/2020 Ne. GSST 2007 4

How Important Are Boundary Breaks in Fluent Speech ? All breaks removed Longest and shortest breaks swapped Original 10/27/2020 Ne. GSST 2007 5

Distribution of Pauses as Boundary Breaks 10/27/2020 Ne. GSST 2007 6

Nonzero Pauses in B 1 PW Preceding Syllable n B 1 Following Syllable Significant implications to synthesis and recognition: 10/27/2020 Ne. GSST 2007 7

Distributions of B 3 and Punctuation Marks in Text n Significant implications to prosody predictions of unlimited TTS 10/27/2020 Ne. GSST 2007 8

Revised Pause Model n Ynor = f(PW length, PW sequence) + Delta 1 (where the calculation of f(PW) is constrained in B 2 level) PW_I ………. . PPh ………. . PW_M ………. . B 2 10/27/2020 B 2 Ne. GSST 2007 PW_F B 3 9

Revised Pause Model (cont. ) n Delta 1= f(PPh marks, PPh length, PPh sequence) + Delta 2 n Delta 2 = f(BG IMF, PPh length, PPh sequence) + Delta 3 10/27/2020 Ne. GSST 2007 10

Distributions of Post-PW Pauses to Labeled Breaks (PG Levels) 10/27/2020 Ne. GSST 2007 11

Results of Pause Prediction 10/27/2020 Ne. GSST 2007 13

Duration Model Normalized Duration Syllable Layer COSPRO Database Residues - Delta 3 BG Layer 10/27/2020 Residues - Delta 1 Residues - Delta 2 PPh Layer Ne. GSST 2007 PW Layer 14

Duration Model at Syllable Layer n Ynor = Const + CCt + CVt + Ton + PCt + PVt + PTt + FCt + FVt + FTt + 2 -way factors of each factor above + 3 -way factors of each syllable + PW Boundary Constraint of each factor above ………. . + Delta 1 PW Syl_I Syl_M B 1 10/27/2020 Ne. GSST 2007 Syl_M B 1 Syl_F B 1 B 2 15

Distribution of Delta 1 (Syllable Residues) Speaker Variation 10/27/2020 Ne. GSST 2007 16

Speaker Variation in PW-I 10/27/2020 Ne. GSST 2007 17

Speaker Variation in PW-M 10/27/2020 Ne. GSST 2007 18

Speaker Variation in PW-F 10/27/2020 Ne. GSST 2007 19

Revised Duration Model at PW Layer n Delta 1 = f(PW length, PW sequence) + Delta 2 (where the calculation of f(PW) is constrained in B 2 level and adds PPh ………. . speaker intention) PW_I ………. . PW_M ………. . Syl_I Syl_M Syl_F 10/27/2020 Ne. GSST 2007 PW_F Syl_I Syl_M Syl_F 20

Previous Model (Interspeech 2005)-- Duration Patterns at PW Layer 10/27/2020 Ne. GSST 2007 21

Revised Model--Duration Pattern at PW Layer from f(PW) only constrained in B 2 level 10/27/2020 Ne. GSST 2007 22

Revised Model--Duration Pattern at PW Layer – General Pattern When Threshold=0. 5 10/27/2020 Ne. GSST 2007 23

Revised Model--Duration Pattern at PW Layer – Speaker Intention Pattern (Speaker Variation) When Threshold=0. 5 10/27/2020 Ne. GSST 2007 24

Revised Model--Distributions of Delta 2 (PW Residues) f(PW) only constrained in B 2 level 10/27/2020 Ne. GSST 2007 25

Revised Model--Distribution of Delta 2 (PW Residues) When Threshold=0. 5 10/27/2020 Ne. GSST 2007 26

Duration Model at PPh & BG Layers n Delta 2 = f(PPh length, PPh sequence) + Delta 3 n Delta 3 = f(BG IMF, PPh length, PPh sequence)+ Delta 4 10/27/2020 Ne. GSST 2007 27

Revised Model--Distributions of Delta 3 (PPh Residues) f(PW) only constrained in B 2 level 10/27/2020 Ne. GSST 2007 28

Revised Model--Distribution of Delta 3 (PPh Residues) When Threshold=0. 5 10/27/2020 Ne. GSST 2007 29

Previous Model (Interspeech 2005)-- Duration Patterns at PPh Layer 10/27/2020 Ne. GSST 2007 30

Revised Model--Duration Patterns at PPh Layer 10/27/2020 Ne. GSST 2007 31

Correlations bw Previous and Revised Duration Models 10/27/2020 Ne. GSST 2007 32

T. R. E. Results of Previous and Revised Duration Models 10/27/2020 Ne. GSST 2007 33

Conclusions (1/2) n Break/Pause Model 1. Break calculations can be refined. Calculation model is revised by separating break levels and information. 2. Prediction of prosodic boundary breaks across continuous speech was greatly improved by fine tuning lower level breaks (B 1 and B 2). 3. Punctuation marks are also useful information to prediction model. 10/27/2020 Ne. GSST 2007 34

Conclusions (2/2) n Duration Model 1. Analyzing residue distributions of every prosodic layer (from syllable to PPh), yielded more stable patterns that lead to better prediction. 2. Duration patterns at the PPh layer yielded clearer evidences that the coefficients of the last four syllables are similar irrespective of PPh lengths (from 6 to 13 syllables). 3. The improved duration patterns can be implemented to speech synthesis directly. 10/27/2020 Ne. GSST 2007 35

Future Works n To find speaker variation patterns from further perceptual experiments. n To understand better ways to adjust the coefficient of both fixed and speaker-intended duration patterns to help improve prediction accuracy. n To learn more about boundaries and boundary effects in fluent speech towards technology development. 10/27/2020 Ne. GSST 2007 36