Overview of NIT HMMbased speech synthesis system for

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011

Background HMM-based speech synthesis l l Quality of synthesized speech depends on acoustic models Model estimation is one of the most important problem Appropriate training algorithm is required l Deterministic annealing EM (DAEM) algorithm n l To overcome the local maxima problem Step-wise model selection n To perform the joint optimization of model structures and state sequences 2

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 3

Overview of HMM-based system Speech database Speech signal Excitation parameters extraction Label Spectral parameters extraction Training of HMM Contest-dependent HMMs & duration models Training part Synthesis part TEXT Text analysis Parameter generation from HMM Label Excitation parameters Excitation generation Spectral parameters Synthesis filter Synthesized speech 4

Base techniques Hidden semi-Markov Model (HSMM) l l HMM with explicit state duration probability dist. Estimate state output and duration probability dists. STRAIGHT l l A high quality speech vocoding method Spectrum, F 0, and aperiodicity measures Parameter generation considering GV l l Calculate GV features from only speech region excluding silence and pause Context dependent GV models 5

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 6

EM algorithm Maximum likelihood (ML) criterion : Model parameter : Training data : HMM state seq. Expectation Maximization (EM) algorithm ・E-step: ・M-step: Occur the local maxima problem 7

DAEM algorithm Posterior probability : Temperature parameter Model update process ・E-step: ・M-step: ・Increase temperature parameter 8

Optimization of state sequence Likelihood function in the DAEM algorithm State output probability State transition probability Time All state sequences have uniform probability 9

Optimization of state sequence Likelihood function in the DAEM algorithm State output probability State transition probability Time Change from uniform to sharp 10

Optimization of state sequence Likelihood function in the DAEM algorithm State output probability State transition probability Time Estimate reliable acoustic models 11

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 12

Problem of context clustering Context-dependent model l Appropriate model structures are required Decision tree based context clustering Vowel? /a/? l Silence? Assumption: state occupancies are not changed n n State occupancies depend on model structures State sequences and model structures should be optimized simultaneously 13

Step-wise model selection Gradually change the size of decision tree l Perform joint optimization of model structures and state sequences Minimum Description Length (MDL) criterion : Tuning parameter : Number of nodes : Dimension of feature vec. : Amount of training data assigned to the root node 14

Model training process 1. Estimate monophone models (DAEM) l l # of temperature parameter updates is 10 # of EM-steps at each temperature is 5 Select decision trees by the MDL criterion using the tuning parameter 3. Estimate context-dependent models (EM) 2. l 4. Decrease the tuning parameter l 5. # of EM-steps is 5 Tuning parameter decreases as 4, 2, 1 Repeat from step. 2 15

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 16

Speech analysis conditions Training data 10, 000 utterances (pruned by the alignment likelihood) Sampling rate 48 k. Hz Window F 0 -adaptive Gaussian window Frame shift 5 ms Feature vector 49 -dim. STRAIGHT mel-cepstrum, log F 0 26 band-filtered aperiodicity measure + ΔΔ (231 dimension) HMM 5 -state left-to-right HSMM without skip transition 17

Likelihood & model structure Average log likelihood of monophone model Ave. Log Likelihood EM DAEM 227. 716 229. 174 Number of leaf nodes Tuning parameter Mel-Cep. Monophone n n Log F 0 Dur. Sum 290 58 638 4 1, 934 3, 454 914 6, 302 2 3, 270 7, 899 1, 760 12, 929 1 11, 721 24, 897 3, 923 40, 541 Phone set: Unilex (58 phoneme) Number of leaf nodes (Full-context): 6, 175, 466 18

Experimental results HMM-based (16 k. Hz) HMM-based (48 k. Hz) Unit-selection Naturalness ― ― ― Speaker similarity ― ― × Intelligibility ― ― ○ Compare with the benchmark HMM-based system l l NIT system achieved the same performance High intelligibility Compare with the benchmark unit-selection system l l Worse in speaker similarity Better in intelligibility 19

Speech samples Original NIT system l l l Generate high intelligible speech Include voiced/unvoiced errors Need to improve feature extraction and excitation 20

Conclusion NIT HMM-based speech synthesis system l DAEM algorithm n l Step-wise model selection n l Overcome the local maxima problem Perform joint optimization of state sequences and model structures Generate high intelligible speech Future work l l Improve feature extraction and excitation Investigate the schedule of temperature parameters and step-wise model selection 21

Thank you