Towards optimal TTS corpora CADIC Didier BOIDIN Cedric

Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe

Unit-selection TTS This is an example. Linguistic modules Unit selection Speaker database 2 Towards optimal TTS corpora France Telecom Group restricted Unit concatenation

Unit-selection TTS This is an example. Linguistic modules Unit selection How to prepare the recording script ? 3 Towards optimal TTS corpora France Telecom Group restricted Unit concatenation

Preparation of the recording script Classic optimization approach 4 Criterion = diphones and triphones coverage Algorithm = greedy, corpus condensation Towards optimal TTS corpora France Telecom Group restricted

Preparation of the recording script Classic optimization approach Criterion = diphones and triphones coverage Algorithm = greedy, corpus condensation The link between di- or triphones coverage and the final TTS quality is not clear The process is constrained by the limited combinations encountered in the finite reference corpus 5 Towards optimal TTS corpora France Telecom Group restricted

Preparation of the recording script Classic optimization approach Criterion = diphones and triphones coverage Algorithm = greedy, corpus condensation Our optimization approach 6 Criterion = vocalic sandwiches coverage Algorithm = greedy, sentence construction Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009) 7 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction Towards optimality Finite State Transducers compute "optimal" sequences of sandwiches, so that: - the coverage increment is maximized (greedy approach) - only sandwich transitions observed in a reference corpus are allowed Towards readability § Neither syntactic nor semantic consideration generated sequences are likely to be nonsense Development of a semi-automatic tool, allowing an operator to iteratively correct generated sequences, in order to build an acceptable and almost optimal sentence. 8 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction (I don't the week of the six. ) 9 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction (I don't…) 10 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction 11 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction (I don't take it out…) 12 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction 13 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction 14 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction 15 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction (I don't take it out the weeks…) 16 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction (I don't take it out the weeks like you. ) 17 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction (I don't take it out the black weeks, ) 18 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction (I don't take it out the black weeks, ) 19 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction The procedure is time-consuming (around 3 min – 50 steps – to build a plausible sentence) Most built sentences lack semantic coherence (redundancy is minimized at the price of semantics) Built scripts are much denser than with corpus condensation 20 Towards optimal TTS corpora France Telecom Group restricted

Sandwich coverage rate (%) Sentence construction 21 Density increase of 30 to 40% compared to condensation Towards optimal TTS corpora France Telecom Group restricted

Conclusion For the creation of unit-selection TTS recording scripts: • We suggested using the Vocalic Sandwiches Coverage Rate as optimization criterion (since it is a convenient symbolic approximation of the selection cost) • We presented a novel corpus building technique, based on sentence construction rather than sentence selection. The procedure is timeconsuming and built sentences tend to lack semantic coherence, but a density increase of 30 to 40% can be otained. Recent work (SSW 7 submission) • Extensive evaluation of the vocalic sandwiches as optimization criterion • Construction of full recordings scripts. Density estimations seem to be confirmed. However semantic limitations had significant repercussions on the reading stage. 22 Towards optimal TTS corpora France Telecom Group restricted

23 Towards optimal TTS corpora France Telecom Group restricted

Database constitution: two ways Rushes from DVD, websites… Expensive process, poor TTS quality Unique way to inaccessible voices OR Dedicated recordings (script read by a speaker) Control of the content best TTS quality 24 Towards optimal TTS corpora France Telecom Group restricted

Database constitution: two ways Rushes from DVD, websites… Expensive process, poor TTS quality Unique way to inaccessible voices OR Dedicated recordings (script read by a speaker) Control of the content best TTS quality 25 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009) Given an input sentence, the selection module searches the database for units presenting: § Maximum adequation to the target sequence (target cost) § Minimum distorsion between consecutive units (concatenation cost) Illustration 26 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009) Given an input sentence, the selection module searches the database for units presenting: § Maximum adequation to the target sequence (target cost) § Minimum distorsion between consecutive units (concatenation cost) Illustration 27 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009) Given an input sentence, the selection module searches the database for units presenting: § Maximum adequation to the target sequence (target cost) § Minimum distorsion between consecutive units (concatenation cost) Illustration 28 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009) Correlations of coverage rates with the selection cost: § Vocalic sandwiches -0. 78 § Diphones -0. 44 § Triphones -0. 64 Illustration 29 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction Finite State Transducers compute "optimal" sequences of sandwiches, so that: - the coverage increment is maximized (greedy approach) - only sandwich transitions observed in a reference corpus are allowed Coverage increment is averaged over the sequence length 15 FST give 15 optimal sandwich sequences for each length ≦ 15 Optimal sequence of length 1 Optimal sequence of length 2 Optimal sequence of length 3 Optimal sequence of length 4 … Optimal sequence of length 15 30 Towards optimal TTS corpora France Telecom Group restricted