Bayesian Speech Synthesis Framework Integrating Training and Synthesis

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology September 23, 2010

Background u Bayesian speech synthesis [Hashimoto et al. , ’ 08] p p u Represent the problem of speech synthesis All processes can be derived from one single predictive distribution Approximation for estimating posterior Posterior is independent of synthesis data ⇒ Training and synthesis processes are separated p u Integration of training and synthesis processes p Derive an algorithm that posterior and synthesis data are iteratively updated 2

Outline u Bayesian speech synthesis p p u Variational Bayesian method Speech parameter generation Problem & Proposed method p p Approximation of posterior Integration of training and synthesis processes Experiments u Conclusion & Future work u 3

Bayesian speech synthesis (1/2) Model training and speech synthesis ML Training Synthesis Bayes : Synthesis data : Training data : Model parameters Training & Synthesis : Label seq. for synthesis : Label seq. for training 4

Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) : HMM state seq. for synthesis data : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters Variational Bayesian method [Attias; ’ 99] 5

Variational Bayesian method (1/2) Estimate approximate posterior distribution ⇒ Maximize a lower bound （Jensen’s inequality）： Expectation w. r. t. : Approximate distribution of the true posterior distribution 6

Variational Bayesian method (2/2) u Random variables are statistically independent u Optimal posterior distributions : Normalization terms Iterative updates as the EM algorithm 7

Speech parameter generation u Speech parameter generation based on Bayesian approach p p Lower bound approximates true marginal likelihood well Maximize the lower bound 8

Outline u Bayesian speech synthesis p p u Variational Bayesian method Speech parameter generation Problem & Proposed method p p Approximation of posterior Integration of training and synthesis processes Experiments u Conclusion & Future work u 9

Bayesian speech synthesis u Maximize the lower bound of log marginal likelihood consistently Estimation of posterior distributions p Speech parameter generation ⇒ All processes are derived from the single predictive distribution p 10

Approximation of posterior u depends on synthesis data ⇒ Synthesis data is not observed u Assume that synthesis data is independent of [Hashimoto et al. , ’ 08] ⇒ Estimate posterior from only training data 11

Separation of training & synthesis Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training Synthesis data 12

Use of generated data u Problem: p p u Posterior distribution depends on synthesis data Synthesis data is not observed Proposed method: p p Use generated data instead of observed data for estimating posterior distribution Iterative updates as the EM algorithm 13

Previous method Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training Synthesis data 14

Proposed method Update of posterior distribution (HMM state sequence of training data) Training data Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Synthesis data 15

Synthesis data u Synthesis data can include several utterances p p u Synthesis data impacts on posterior distributions How many utterances are generated in one update process? Two methods are discussed p Batch-based method l p Update posterior distributions for several test sentences Sentence-based method l Update posterior distributions for one test sentence 16

Update method (1/2) u Batch-based method p p Generated synthesis data of all test sentences is used for update of posterior distributions Synthesis data of all test sentences is generated by using the same posterior distributions Sentence 1 Sentence 2 ・・・・・・ Sentence N 17

Update method (2/2) u Sentence-based method p p Generated synthesis data of one test sentence is used for update of posterior distributions Synthesis data of each test sentence is generated by using different posterior distributions Sentence 1 Sentence 2 ・・・・・・ Sentence N 18

Outline u Bayesian speech synthesis p p u Variational Bayesian method Speech parameter generation Problem & Proposed method p p Approximation of posterior Integration of training and synthesis processes Experiments u Conclusion & Future work u 19

Experimental conditions Database ATR Japanese speech database B-set Speaker MHT Training data 450 utterances Test data 53 utterances Sampling rate 16 k. Hz Window Blackman window Frame size / shift 25 ms / 5 ms Feature vector 24 mel-cepstrum + ΔΔ and log F 0 + ΔΔ (78 dimension) HMM 5 -state left-to-right HSMM without skip transition 20

Iteration process u Update of posterior dists. and synthesis data 1. Posterior dists. are estimated from training data 2. Initial synthesis data is generated 3. Context-clustering using training and generated synthesis data 4. Posterior dists. are re-estimated from training data and generated synthesis data (Number of updates is 5) 5. Synthesis data is re-generated 6. Step 3, 4, and 5 are iterated 21

Comparison of number of updates Iteration 0 Iteration 1 Iteration 2 Iteration 3 Data for estimation of posterior distributions 450 training utterances 450 utterances + 1 utterance generated in Iteration 0 450 utterances + 1 utterance generated in Iteration 1 450 utterances + 1 utterance generated in Iteration 2 22

Experimental results u Comparison of the number of updates 23

Comparison of Batch and Sentence ML Training & Generation ML Baseline Bayes Batch Bayes Sentence Bayes Data for estimation of posterior distributions 450 utterances 450 + 53 generated utterances 450 + 1 generated utterance (53 different posterior dists. ) 24

Experimental results u Comparison of Batch and Sentence 25

Conclusions and future work u Integration of training and synthesis processes p p p u Generated synthesis data is used for estimating posterior distributions Posterior distributions and synthesis data are updated iteratively Outperform the baseline method Future work p p Investigation of the relation between the amount of training and synthesis data Experiments on various amounts of training data 26

Thank you