A Bayesian Approach to HMMBased Speech Synthesis Kei

A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto 1, Heiga Zen 1, 1 2 Yoshihiko Nankaku , Takashi Masuko , 1 and Keiichi Tokuda 1 Nagoya Institute of Technology 2 Tokyo Institute of Technology

Background u HMM-based speech synthesis system p p u Maximum likelihood (ML) criterion p p u Spectrum, excitation and duration are modeled Speech parameter seqs. are generated Train HMMs and generate speech parameters Point estimate ⇒ The over-fitting problem Bayesian approach p p Estimate posterior dist. of model parameters Prior information can be use ⇒ Alleviate the over-fitting problem 2

Outline u Bayesian speech synthesis p p u Variational Bayesian method Speech parameter generation Bayesian context clustering p Prior distribution using cross validation Experiments u Conclusion & Future work u 3

Bayesian speech synthesis (1/2) Model training and speech synthesis ML Bayes : Synthesis data seq. : Training data seq. : Model parameters : Label seq. for synthesis : Label seq. for training 4

Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) : HMM state seq. for synthesis data : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters Variational Bayesian method [Attias; ’ 99] 5

Variational Bayesian method (1/2) Estimate approximate posterior dist. ⇒ Maximize a lower bound （Jensen’s inequality）： Expectation w. r. t. : Approximate distribution of the true posterior distribution 6

Variational Bayesian method (2/2) u Random variables are statistically independent u Optimal posterior distributions : normalization terms Iterative updates as the EM algorithm 7

Approximation for speech synthesis u is dependent on synthesis data ⇒ Huge computational cost in the synthesis part u Ignore the dependency of synthesis data ⇒ Estimation from only training data 8

Prior distribution u Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist. Likelihood function Conjugate prior distribution u Determination using statistics of prior data ：# of prior data : Dimension of feature ：Mean of prior data ：Covariance of prior data 9

Speech parameter generation u Speech parameter Consist of static and dynamic features ⇒ Only static feature seq. is generated u Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 10

Relation between Bayes and ML Compare with the ML criterion Output dist. ML ⇒ Bayes ⇒ p p Use of expectations of model parameters Can be solved by the same fashion of ML 11

Outline u Bayesian speech synthesis p p u Variational Bayesian method Speech parameter generation Bayesian context clustering p Prior distribution using cross validation Experiments u Conclusion & Future work u 12

Bayesian context clustering Context clustering based on maximizing : Is this phoneme a vowel? yes no Gain of Select question ⇒ Split node based on gain Stopping condition 13

Impact of prior distribution u Affect model selection as tuning parameters ⇒ Require determination technique of prior dist. u Conventional: maximize the marginal likelihood p p u Lead to the over-fitting problem as the ML Tuning parameters are still required Determination technique of prior distribution using cross validation [Hashimoto; ’ 08] 14

Bayesian approach using CV Prior distribution based on Cross Validation Training data is randomly divided into K groups Cross valid prior dist. 2, 3 1, 2 Posterior dist. Calculate likelihood 15

Outline u Bayesian speech synthesis p p u Variational Bayesian method Speech parameter generation Bayesian context clustering p Prior distribution using cross validation Experiments u Conclusion & Future work u 16

Experimental conditions (1/2) Database ATR Japanese speech database B-set Speaker MHT Training data 450 utterances Test data 53 utterances Sampling rate 16 k. Hz Window Blackman window Frame size / shift 25 ms / 5 ms Feature vector 24 mel-cepstrum + ΔΔ and log F 0 + ΔΔ (78 dimension) HMM 5 -state left-to-right HMM without skip transition 17

Experimental conditions (2/2) u Compared approach Training Context clustering # of states ML-MDL ML MDL 2, 491 Bayes-Bayes using CV 25, 911 Bayes using CV Bayes-MDL Bayes 2, 553 Adjust threshold ML-Bayes u ML MDL Adjust threshold 27, 106 Mean Opinion Score (MOS) test p p Subjects were 10 Japanese students 20 sentences were chosen at random 18

Subjective listening test Mean opinion score 2, 491 25, 911 2, 553 27, 106 19

Conclusions and future work u A new framework based on Bayesian approach p p u All processes are derived from a single predictive distribution Improve the naturalness of synthesized speech Future work p p Introduce HSMM instead of HMM Investigate the relation between the speech quality and model structures 20