Towards an Improved Modeling of the Glottal Source


![Introduction HMM-based speech synthesizer [Tokuda et al] Training speech F 0 extraction Text Spectral Introduction HMM-based speech synthesizer [Tokuda et al] Training speech F 0 extraction Text Spectral](https://slidetodoc.com/presentation_image/2c77f1831019d582681aa52ae7a260c4/image-3.jpg)






![Voice source model Estimation of tc, tp and to [Gobl & Chasaide] 10 Voice source model Estimation of tc, tp and to [Gobl & Chasaide] 10](https://slidetodoc.com/presentation_image/2c77f1831019d582681aa52ae7a260c4/image-10.jpg)













- Slides: 23
Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi The Centre for Speech Technology Research The University of Edinburgh
Outline • • • Introduction Voice source model System Perceptual evaluation Concluding remarks Future work 2
Introduction HMM-based speech synthesizer [Tokuda et al] Training speech F 0 extraction Text Spectral features estimation Text analysis HMMs F 0 Pulse train Noise component spectrum + Synthesis filter Synthetic Speech 3
Voice source model Obtaining the glottal source signal • Source-filter model: Source Vocal tract Lip radiation Ug A(z) d/dz Speech • Inverse filtering: Speech Inverse Filter Lip radiation 1/A(z) cancellation (∫) 4
Voice source model Liljencrants-Fant model (LF-model) T : period to : opening instant tp : instant of max airflow te : instant of max excitation ta : return phase duration tc : closing instant Ee : excitation amplitude 5
Voice source model Other parameters of the LF-model Open quotient: Speed quotient: Return quotient: 6
Voice source model Description of the LF-model spectrum Linear stylization of the LF-model spectrum [Doval and d’Alessandro] Fg glottal spectral peak Fc spectral tilt 7
Voice source model Features extraction • utterances sampled at 16 k. Hz • pitch-synchronous analysis (ESPS tools) • LPCs calculated with windows centered at the glottal epochs and duration 20 ms • inverse filtering to estimate DGS • pre-emphasis filter (α=0. 97) • low-pass filtering of the residual at 4 k. Hz 8
Voice source model Estimation of te and Ee Ø te and Ee are estimated from the pitch-marks 9
Voice source model Estimation of tc, tp and to [Gobl & Chasaide] 10
Voice source model Estimation of ta Fs : sampling frequency m : slope of the tangent at t=te 11
Voice source model Examples of the estimated parameters Curves of the LF-parameters for 2 voiced regions of an utterance 12
System General description - Nitech-HTS 2005 system - STRAIGHT method for analysis and synthesis - mixed multi-band excitation with phase manipulation / pulse train - Mel Log Spectrum Approximation (MLSA) filter How was the LF-model integrated in the synthesizer? 13
System Generation of the periodic excitation (pulse signal) • Pulse centered within the frame • multiplied by asymmetric widows • summed with Gaussian noise 14
System Periodic excitation with the LF-model • 2 LF-waveforms centered at the instant te • multiplied by asymmetric widows • summed with Gaussian noise 15
System Technical problem Ø Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train Ø Solution: Post-filter Linear phase FIR filter: -6 d. B/dec 1 Hz ≤ f ≤ Fg (Hz) +6 d. B/dec Fg < f ≤ Fc (Hz) +12 d. B/dec Fc < f ≤ 16 k. Hz 16
System Effect of the post-filtering 17
Perceptual evaluation Generation of the stimuli • Built US-English voice EM 001 provided by ATR for the Blizzard Challenge • Glottal parameters were measured in 8 utterances and the mean values were calculated • Simple excitation, without multi-band noise or phase manipulation • Ten utterances were synthesized, using the LF-model and the pulse model 18
Perceptual evaluation Experiment • Forced-choice test • Presented via a web-interface browser • Subjects were asked if they used headphones or speakers, and if they were native speakers (U. K. /U. S. ) • 18 listeners (7 native speakers of English) • Listeners panel was mainly university students and staff Example of test speech signals: Pulse: LF-model: 19
Perceptual evaluation Results Excitation LF-Model Pulse train Non-native speakers 61% 39% Native speakers 68. 6% 31. 4% Total scores and 95% CI 64% ± 6. 7% 36% ± 6. 7% 20
Conclusions • Nitech-HTS 2005 speech synthesizer was implemented with the LFmodel for the voice source • Results showed that the LF-model can give better speech quality than the traditionally used pulse train • Direct methods used for the estimation of the mean LF-parameters seemed to perform well • A technical problem with the integration of the LF-model in the system was solved using a post-filter 21
Future work • To find better analysis/synthesis methods to use with the LF-model in the HMM-based speech synthesis • To evaluate the speech quality when using the mixed excitation with the LF-model • To implement voice quality transformations using the LF-model • To evaluate the parameterization methods • To model the glottal parameters with HMMs 22
Acknowledgements This work was financially supported by the Marie Curie Ed. SST programme. Thank you! 23