Predictions of Mandarin Syllable Duration and Boundary Pauses
Predictions of Mandarin Syllable Duration and Boundary Pauses from HPG Prosody Structure Chun-Hsiang Chang Institute of Linguistics Academia Sinica lawrence@phslab. sinica. edu. tw 10/27/2020 Ne. GSST 2007 1
Outline n Purposes: 1. How HPG Governs and Constrains Boundary Breaks and Boundary Effects in Fluent Speech 2. How to Utilize HPG Structure to Predict Boundary and Boundary Breaks 3. How to Further Improve Constructed Prediction Model n ¨ Pause Features and Cues for Prosodic Boundaries ¨ Pause Prediction Model ¨ Duration Features and Cues for Prosodic Boundaries ¨ Duration Prediction Model Conclusion 10/27/2020 Ne. GSST 2007 2
Reference n Tseng Chiu-yu and Lee Yeh-lin (2004). “Speech rate and prosody units: Evidence of interaction from Mandarin Chinese”, Proceedings of the International Conference on Speech Prosody 2004, (Mar. 23 -26, 2004), Nara, Japan, pp. 251 -254. n Tseng Chiu-yu, Pin Shao-huang, Lee Yeh-lin, Wang Hsin-min and Chen Yong-cheng (2005). “Fluent speech prosody: framework and modeling”, Speech Communication, Vol. 46, issues 3 -4, (July 2005), Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, pp. 284 -309. n Tseng Chiu-yu and Fu Bau-Ling (2005). “Duration, Intensity and Pause Predictions in Relation to Prosody Organization”, Proceedings of Interspeech 2005 , (September 4 -8 , 2005) , Lisbon , Portugal, pp. 1405 -1408. 10/27/2020 Ne. GSST 2007 3
Sinica COSPRO 05 (Mandarin Chinese Continuous Speech Prosody Corpus) http: //www. myet. com/cospro n The speech data consisted of readings of 26 paragraphs (11592 syllables in total) of text ranging from 85 to 981 characters per paragraph by two speakers. n 1 female (F 051 P) and 1 male (M 051 P) radio announcers, both under 35 years of age, read the text at a normal speaking rate of 200 ms/syllable. n All labeling was also spot-checked by trained transcribers. 10/27/2020 Ne. GSST 2007 4
How Important Are Boundary Breaks in Fluent Speech ? All breaks removed Longest and shortest breaks swapped Original 10/27/2020 Ne. GSST 2007 5
Distribution of Pauses as Boundary Breaks 10/27/2020 Ne. GSST 2007 6
Nonzero Pauses in B 1 PW Preceding Syllable n B 1 Following Syllable Significant implications to synthesis and recognition: 10/27/2020 Ne. GSST 2007 7
Distributions of B 3 and Punctuation Marks in Text n Significant implications to prosody predictions of unlimited TTS 10/27/2020 Ne. GSST 2007 8
Revised Pause Model n Ynor = f(PW length, PW sequence) + Delta 1 (where the calculation of f(PW) is constrained in B 2 level) PW_I ………. . PPh ………. . PW_M ………. . B 2 10/27/2020 B 2 Ne. GSST 2007 PW_F B 3 9
Revised Pause Model (cont. ) n Delta 1= f(PPh marks, PPh length, PPh sequence) + Delta 2 n Delta 2 = f(BG IMF, PPh length, PPh sequence) + Delta 3 10/27/2020 Ne. GSST 2007 10
Distributions of Post-PW Pauses to Labeled Breaks (PG Levels) 10/27/2020 Ne. GSST 2007 11
Results of Pause Prediction 10/27/2020 Ne. GSST 2007 13
Duration Model Normalized Duration Syllable Layer COSPRO Database Residues - Delta 3 BG Layer 10/27/2020 Residues - Delta 1 Residues - Delta 2 PPh Layer Ne. GSST 2007 PW Layer 14
Duration Model at Syllable Layer n Ynor = Const + CCt + CVt + Ton + PCt + PVt + PTt + FCt + FVt + FTt + 2 -way factors of each factor above + 3 -way factors of each syllable + PW Boundary Constraint of each factor above ………. . + Delta 1 PW Syl_I Syl_M B 1 10/27/2020 Ne. GSST 2007 Syl_M B 1 Syl_F B 1 B 2 15
Distribution of Delta 1 (Syllable Residues) Speaker Variation 10/27/2020 Ne. GSST 2007 16
Speaker Variation in PW-I 10/27/2020 Ne. GSST 2007 17
Speaker Variation in PW-M 10/27/2020 Ne. GSST 2007 18
Speaker Variation in PW-F 10/27/2020 Ne. GSST 2007 19
Revised Duration Model at PW Layer n Delta 1 = f(PW length, PW sequence) + Delta 2 (where the calculation of f(PW) is constrained in B 2 level and adds PPh ………. . speaker intention) PW_I ………. . PW_M ………. . Syl_I Syl_M Syl_F 10/27/2020 Ne. GSST 2007 PW_F Syl_I Syl_M Syl_F 20
Previous Model (Interspeech 2005)-- Duration Patterns at PW Layer 10/27/2020 Ne. GSST 2007 21
Revised Model--Duration Pattern at PW Layer from f(PW) only constrained in B 2 level 10/27/2020 Ne. GSST 2007 22
Revised Model--Duration Pattern at PW Layer – General Pattern When Threshold=0. 5 10/27/2020 Ne. GSST 2007 23
Revised Model--Duration Pattern at PW Layer – Speaker Intention Pattern (Speaker Variation) When Threshold=0. 5 10/27/2020 Ne. GSST 2007 24
Revised Model--Distributions of Delta 2 (PW Residues) f(PW) only constrained in B 2 level 10/27/2020 Ne. GSST 2007 25
Revised Model--Distribution of Delta 2 (PW Residues) When Threshold=0. 5 10/27/2020 Ne. GSST 2007 26
Duration Model at PPh & BG Layers n Delta 2 = f(PPh length, PPh sequence) + Delta 3 n Delta 3 = f(BG IMF, PPh length, PPh sequence)+ Delta 4 10/27/2020 Ne. GSST 2007 27
Revised Model--Distributions of Delta 3 (PPh Residues) f(PW) only constrained in B 2 level 10/27/2020 Ne. GSST 2007 28
Revised Model--Distribution of Delta 3 (PPh Residues) When Threshold=0. 5 10/27/2020 Ne. GSST 2007 29
Previous Model (Interspeech 2005)-- Duration Patterns at PPh Layer 10/27/2020 Ne. GSST 2007 30
Revised Model--Duration Patterns at PPh Layer 10/27/2020 Ne. GSST 2007 31
Correlations bw Previous and Revised Duration Models 10/27/2020 Ne. GSST 2007 32
T. R. E. Results of Previous and Revised Duration Models 10/27/2020 Ne. GSST 2007 33
Conclusions (1/2) n Break/Pause Model 1. Break calculations can be refined. Calculation model is revised by separating break levels and information. 2. Prediction of prosodic boundary breaks across continuous speech was greatly improved by fine tuning lower level breaks (B 1 and B 2). 3. Punctuation marks are also useful information to prediction model. 10/27/2020 Ne. GSST 2007 34
Conclusions (2/2) n Duration Model 1. Analyzing residue distributions of every prosodic layer (from syllable to PPh), yielded more stable patterns that lead to better prediction. 2. Duration patterns at the PPh layer yielded clearer evidences that the coefficients of the last four syllables are similar irrespective of PPh lengths (from 6 to 13 syllables). 3. The improved duration patterns can be implemented to speech synthesis directly. 10/27/2020 Ne. GSST 2007 35
Future Works n To find speaker variation patterns from further perceptual experiments. n To understand better ways to adjust the coefficient of both fixed and speaker-intended duration patterns to help improve prediction accuracy. n To learn more about boundaries and boundary effects in fluent speech towards technology development. 10/27/2020 Ne. GSST 2007 36
- Slides: 36