TextToSpeech Synthesis An Overview What is a TTS

What is a TTS System o Goal n n n o A system that

Text-To-Speech o Text Processing n n n o Text Normalization Pronunciation Timing and Intonation

Functional Diagram TTS Synthesizer Natural Language Processing Text Morphosyntactic Analysis Letter-to-Sound Prosody Generation Narrow

The Natural Language Processing Module Text NLP Module Preprocessor Morphological Analyzer Contextual Analyzer Syntactic

Text Preprocessing Challenges o Text Segmentation – Tokenization n o Sentence End Detection n

Text Preprocessing Dealing with Non-Standard Words o Tokenizer n n o Classifier n n

Text Preprocessing Dealing with Non-Standard Words o o Not all tokens can be handled

Morphological Analysis o Function Words n n n o Determiners, Pronouns, Prepositions, Conjunctions Skeleton

Synthesis o Input n n o Sequence of phonemes Prosodic Information Output n Digital

Synthesis Strategies o Synthesis by Rule n n o Cognitive approach of the phonation

Synthesis by Rule Functional Diagram Phone Names Prosody DSP Module Speech Science Speech Corpus

Synthesis by Rule Analysis and Synthesis o Preparation n n o Words are read

Synthesis by Rule Segmental Quality o o Rule Efficiency Corpus Quality n n Choice

Synthesis by Rule Formant Synthesizers + Speech is a dynamic evolution of up to

Synthesis by Concatenation Functional Diagram Phone Names Prosody DSP Module Speech Science Speech Corpus

Synthesis by Concatenation Analysis – Database Preparation o Choose the appropriate speech units n

Synthesis by Concatenation Unit Database Issues o Very large combinatorial space of combinations of

Concatenating Segments The PSOLA Method o Pitch Synchronous Overlap and Add n n o

Concatenative and Rule Based Synthesis Comparison o Concatenative Synthesis is the state-of-the-art n Storage

Slides: 20

Download presentation

Text-To-Speech Synthesis An Overview

What is a TTS System o Goal n n n o A system that can read any text Automatic production of new sentences Not just audio playback o Simple voice response systems Definition n The production of speech by machines, by way of the automatic phonetization of the sentences to utter

Text-To-Speech o Text Processing n n n o Text Normalization Pronunciation Timing and Intonation Speech Generation n n Segmental Concatenation Waveform Synthesis

Functional Diagram TTS Synthesizer Natural Language Processing Text Morphosyntactic Analysis Letter-to-Sound Prosody Generation Narrow Phonetic Transcription Digital Signal Processing Phones Prosody Mathematical Models Algorithms Computations Speech

The Natural Language Processing Module Text NLP Module Preprocessor Morphological Analyzer Contextual Analyzer Syntactic and Prosodic Parser Morphosyntactic Analyzer Letter-to-Sound Module Natural Prosody Generator Phone Names Prosody

Text Preprocessing Challenges o Text Segmentation – Tokenization n o Sentence End Detection n o (i) () (know) ( ) (1) (, ) (000) ( ) (words) Jones lives at the end of St. James St. Normalization n Abbreviations o κ. : κύριος, κυρίου, κύριο o κ. : κύριος, κιλό Acronyms o ΦΠΑ, ΔΕΗ, ΝΑΤΟ Numbers o 1. 023, 32 12/1/2002 13: 23 12. 15πμ

Text Preprocessing Dealing with Non-Standard Words o Tokenizer n n o Classifier n n o Breaks up single tokens that need splitting 12: 35 AM -> 12 : 35 AM Determines the most likely class for a given token January 1956 – 1956 potatoes Expansion Module n Methods for expanding numbers and classes that can be handled algorithmically

Text Preprocessing Dealing with Non-Standard Words o o Not all tokens can be handled with a deterministic set of rules Methods for designing domain-dependent expansion and tagging modules n n Supervised: work on tagged text corpus Unsupervised: work on raw text o Determines the probability of a tag t given the observed string o p(o): the probability of the observed text p(t): the prior probability of observing the tag t in the text p(o|t): a trigram letter language model for predicting observations of a particulat tag t

Morphological Analysis o Function Words n n n o Determiners, Pronouns, Prepositions, Conjunctions Skeleton of sentence Stored in lexicon, along with pronunciation Content Words n n Inflection + Compounding Used for pronunciation and stressing

Synthesis o Input n n o Sequence of phonemes Prosodic Information Output n Digital Speech

Synthesis Strategies o Synthesis by Rule n n o Cognitive approach of the phonation mechanism Speech is produced by mathematical rules that formally describe the influence of phonemes on one another Synthesis by Concatenation n n Limited knowledge of the data to be handled Elementary speech units are stored in a database and then concatenated and processed to produce the speech signal

Synthesis by Rule Functional Diagram Phone Names Prosody DSP Module Speech Science Speech Corpus Parametric Speech Corpus Speech Analysis Rule Database Rule Matching Rule Finding Signal Processing Signal Synthesis Speech

Synthesis by Rule Analysis and Synthesis o Preparation n n o Words are read by professional speaker Data Parameterization through speech analyzer Rule extraction (manual) Trial and Error Optimization Synthesis n n n Rules are matched to phonetic input Production of parametric signal Synthesis of speech signal by re-implementing analysis model

Synthesis by Rule Segmental Quality o o Rule Efficiency Corpus Quality n n Choice of utterances and recording quality Intrinsic Errors: Accuracy of model describing highquality speech o n o Even simple analysis-resynthesis may produce problems! Extrinsic Errors: Parameter extraction algorithm Improvements during Trial-Error tuning

Synthesis by Rule Formant Synthesizers + Speech is a dynamic evolution of up to 60 parameters n n + − − Almost free of modeling errors Difficult to estimate Time consuming n − Formant, antiformant frequencies and bandwidths Glottal waveforms Intensive trial-error testing to cope with extrinsic errors Signal Buzziness – Low Signal Quality n High-quality synthesis rules are yet to be discovered

Synthesis by Concatenation Functional Diagram Phone Names Prosody DSP Module Speech Science Speech Corpus Parametric Segment DB Selective Segmentation Speech Analysis Speech Segment DB Segment Info Segment List Generation Equalization Speech Coding Signal Processing Synthesis Segment DB Prosody Matching Speech Decoding Concatenation Signal Synthesis Speech

Synthesis by Concatenation Analysis – Database Preparation o Choose the appropriate speech units n o o Compile and record utterances Segment signal and extract speech units Store segment waveforms (along with context) and extended information in database Extract parameters and create parametric segment database n n o Diphones, Half-Syllables and Triphones Useful for data compaction Easier prosody matching and modification Perform amplitude equalization to prevent mismatches

Synthesis by Concatenation Unit Database Issues o Very large combinatorial space of combinations of phonemes and prosodic contexts n n o In English: 43 phones, 79, 507 possible triphones, only 70, 000 used Which of them should we keep? Unit Selection vs Concatenative Synthesis n n We record a large speech corpus In unit selection, the corpus is segmented into phonetic units, indexed, and used as-is o n Unit selection is made on-line In Concatenative synthesis, the selection is made offline and manually!

Concatenating Segments The PSOLA Method o Pitch Synchronous Overlap and Add n n o Pitch Modification n o A window (2 -pitch periods long) is multiplied with the signal The signal is broken into a set of localized signals (non-zero only at the window intervals) Relative shifting of localized signals Spacing reflects pitch duration Good result for modification factor β=[0. 6 – 1. 5] Duration n Localized signals are added or deleted from output

Concatenative and Rule Based Synthesis Comparison o Concatenative Synthesis is the state-of-the-art n Storage is of little concern now o n Advances in ensuring smoothness in concatenations o n Storing the segment database is no longer an issue Rule-based synthesis output used to be smoother Certain sounds are too hard to be produced by rule o o Vowels are easy to create by rule Bursts, voiceless stops are too difficult, we do not fully understand their production mechanisms