Towards an Improved Modeling of the Glottal Source

  • Slides: 23
Download presentation
Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis João

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi The Centre for Speech Technology Research The University of Edinburgh

Outline • • • Introduction Voice source model System Perceptual evaluation Concluding remarks Future

Outline • • • Introduction Voice source model System Perceptual evaluation Concluding remarks Future work 2

Introduction HMM-based speech synthesizer [Tokuda et al] Training speech F 0 extraction Text Spectral

Introduction HMM-based speech synthesizer [Tokuda et al] Training speech F 0 extraction Text Spectral features estimation Text analysis HMMs F 0 Pulse train Noise component spectrum + Synthesis filter Synthetic Speech 3

Voice source model Obtaining the glottal source signal • Source-filter model: Source Vocal tract

Voice source model Obtaining the glottal source signal • Source-filter model: Source Vocal tract Lip radiation Ug A(z) d/dz Speech • Inverse filtering: Speech Inverse Filter Lip radiation 1/A(z) cancellation (∫) 4

Voice source model Liljencrants-Fant model (LF-model) T : period to : opening instant tp

Voice source model Liljencrants-Fant model (LF-model) T : period to : opening instant tp : instant of max airflow te : instant of max excitation ta : return phase duration tc : closing instant Ee : excitation amplitude 5

Voice source model Other parameters of the LF-model Open quotient: Speed quotient: Return quotient:

Voice source model Other parameters of the LF-model Open quotient: Speed quotient: Return quotient: 6

Voice source model Description of the LF-model spectrum Linear stylization of the LF-model spectrum

Voice source model Description of the LF-model spectrum Linear stylization of the LF-model spectrum [Doval and d’Alessandro] Fg glottal spectral peak Fc spectral tilt 7

Voice source model Features extraction • utterances sampled at 16 k. Hz • pitch-synchronous

Voice source model Features extraction • utterances sampled at 16 k. Hz • pitch-synchronous analysis (ESPS tools) • LPCs calculated with windows centered at the glottal epochs and duration 20 ms • inverse filtering to estimate DGS • pre-emphasis filter (α=0. 97) • low-pass filtering of the residual at 4 k. Hz 8

Voice source model Estimation of te and Ee Ø te and Ee are estimated

Voice source model Estimation of te and Ee Ø te and Ee are estimated from the pitch-marks 9

Voice source model Estimation of tc, tp and to [Gobl & Chasaide] 10

Voice source model Estimation of tc, tp and to [Gobl & Chasaide] 10

Voice source model Estimation of ta Fs : sampling frequency m : slope of

Voice source model Estimation of ta Fs : sampling frequency m : slope of the tangent at t=te 11

Voice source model Examples of the estimated parameters Curves of the LF-parameters for 2

Voice source model Examples of the estimated parameters Curves of the LF-parameters for 2 voiced regions of an utterance 12

System General description - Nitech-HTS 2005 system - STRAIGHT method for analysis and synthesis

System General description - Nitech-HTS 2005 system - STRAIGHT method for analysis and synthesis - mixed multi-band excitation with phase manipulation / pulse train - Mel Log Spectrum Approximation (MLSA) filter How was the LF-model integrated in the synthesizer? 13

System Generation of the periodic excitation (pulse signal) • Pulse centered within the frame

System Generation of the periodic excitation (pulse signal) • Pulse centered within the frame • multiplied by asymmetric widows • summed with Gaussian noise 14

System Periodic excitation with the LF-model • 2 LF-waveforms centered at the instant te

System Periodic excitation with the LF-model • 2 LF-waveforms centered at the instant te • multiplied by asymmetric widows • summed with Gaussian noise 15

System Technical problem Ø Problem: the synthesis filter assumes the excitation to have a

System Technical problem Ø Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train Ø Solution: Post-filter Linear phase FIR filter: -6 d. B/dec 1 Hz ≤ f ≤ Fg (Hz) +6 d. B/dec Fg < f ≤ Fc (Hz) +12 d. B/dec Fc < f ≤ 16 k. Hz 16

System Effect of the post-filtering 17

System Effect of the post-filtering 17

Perceptual evaluation Generation of the stimuli • Built US-English voice EM 001 provided by

Perceptual evaluation Generation of the stimuli • Built US-English voice EM 001 provided by ATR for the Blizzard Challenge • Glottal parameters were measured in 8 utterances and the mean values were calculated • Simple excitation, without multi-band noise or phase manipulation • Ten utterances were synthesized, using the LF-model and the pulse model 18

Perceptual evaluation Experiment • Forced-choice test • Presented via a web-interface browser • Subjects

Perceptual evaluation Experiment • Forced-choice test • Presented via a web-interface browser • Subjects were asked if they used headphones or speakers, and if they were native speakers (U. K. /U. S. ) • 18 listeners (7 native speakers of English) • Listeners panel was mainly university students and staff Example of test speech signals: Pulse: LF-model: 19

Perceptual evaluation Results Excitation LF-Model Pulse train Non-native speakers 61% 39% Native speakers 68.

Perceptual evaluation Results Excitation LF-Model Pulse train Non-native speakers 61% 39% Native speakers 68. 6% 31. 4% Total scores and 95% CI 64% ± 6. 7% 36% ± 6. 7% 20

Conclusions • Nitech-HTS 2005 speech synthesizer was implemented with the LFmodel for the voice

Conclusions • Nitech-HTS 2005 speech synthesizer was implemented with the LFmodel for the voice source • Results showed that the LF-model can give better speech quality than the traditionally used pulse train • Direct methods used for the estimation of the mean LF-parameters seemed to perform well • A technical problem with the integration of the LF-model in the system was solved using a post-filter 21

Future work • To find better analysis/synthesis methods to use with the LF-model in

Future work • To find better analysis/synthesis methods to use with the LF-model in the HMM-based speech synthesis • To evaluate the speech quality when using the mixed excitation with the LF-model • To implement voice quality transformations using the LF-model • To evaluate the parameterization methods • To model the glottal parameters with HMMs 22

Acknowledgements This work was financially supported by the Marie Curie Ed. SST programme. Thank

Acknowledgements This work was financially supported by the Marie Curie Ed. SST programme. Thank you! 23