5 Text To Speech TTS Speech Synthesis Concept

Speech Synthesis Concept Text to Phone Sequence Natural Language Processing (NLP) Speech Phone Sequence

Phone Units b. Paragraph ( b. Sentence ( ) ) b. Word (Depends on

Phone Units (Cont’d) Diphone : We model Transitions between two phonemes p 1 p

Phone Units (Cont’d) In farsi we have 30 Phoneme. so we have 30*30 Diphone

Phone Units (Cont’d) Syllable = Onset (Consonant) + Rhyme Syllable is a set of

Phone Sequence To Speech Concatenative Approaches : Trade-Off between Naturality And Memory usage and

Phone Sequence To Speech (Cont’d) Text to Phone Sequence NLP Phone Sequence to primitive

Speech Naturalness Obviation of undesirable noise and distortion and dissociation from speech Prosody generation

Speech Naturalness (Cont’d) Intonation and Stress are very effective in speech naturalness Intonation :

Concatenative Approaches In this approaches we store units of natural speech for reconstruction of

Concatenative Approaches (Cont’d) Benefits of storing compressed parameters instead of main waveform – Less

Concatenative Approaches (Cont’d) Phone Unit Type of Storing Paragraph Main Waveform Sentence Main Waveform

Concatenative Approaches (Cont’d) b. Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme

Rule-Based Approach Stages Determine the speech model and model parameters Determine type of phone

THE KLSYN 88 CASCADE PARALLEL FORMANT SYNTHESIZER GLOTTAL SOUND SOURCES FILTERED IMPULSE TRAIN SO

Three Voicing Source Model In KLATT 88 The old KLSYN impulsive source The KLGLOTT

Slides: 19

Download presentation

5 -Text To Speech (TTS) Speech Synthesis Concept Phone Units Phone Sequence To Speech Naturalness – Concatenative Approaches – Rule-Based Approaches 1

Speech Synthesis Concept Text to Phone Sequence Natural Language Processing (NLP) Speech Phone Sequence to Speech Processing 2

Phone Units b. Paragraph ( b. Sentence ( ) ) b. Word (Depends on the language. Usually more than 100, 000) b. Syllable b. Diphone & Triphone b. Phoneme (Between 10 , 100) 3

Phone Units (Cont’d) Diphone : We model Transitions between two phonemes p 1 p 2 p 3 p 4 p 5 . . . Diphone Phoneme 4

Phone Units (Cont’d) In farsi we have 30 Phoneme. so we have 30*30 Diphone Theoretically. Practically the only Diphone that we don’t have in farsi is /zho/ we have 27000 Triphone Theoretically. But practically we have about 15000 Triphone in farsi. 5

Phone Units (Cont’d) Syllable = Onset (Consonant) + Rhyme Syllable is a set of phonemes that exactly contains one vowel Syllables in Farsi : CV , CVCC We have about 4000 Syllables in farsi Syllables in English : V, CVC , CCVCC, CCCVCC, . . . Number of Syllables in English is very much 6

Phone Sequence To Speech Concatenative Approaches : Trade-Off between Naturality And Memory usage and variety of desired functions Rule-Based Approaches : The most important Rule-Based approach is Klatt method 7

Phone Sequence To Speech (Cont’d) Text to Phone Sequence NLP Phone Sequence to primitive utterance Speech to Natural Speech Processing 8

Speech Naturalness Obviation of undesirable noise and distortion and dissociation from speech Prosody generation – Speech energy – Duration – pitch – Intonation – Stress 9

Speech Naturalness (Cont’d) Intonation and Stress are very effective in speech naturalness Intonation : Variation of Pitch frequency along speaking Stress : Increasing the pitch frequency in a specific time 10

Concatenative Approaches In this approaches we store units of natural speech for reconstruction of desired speech We could select the appropriate phone unit for speech synthesis we can store compressed parameters instead of main waveform 11

Concatenative Approaches (Cont’d) Benefits of storing compressed parameters instead of main waveform – Less memory use – General state instead of a specific stored utterance – Generating prosody easily 12

Concatenative Approaches (Cont’d) Phone Unit Type of Storing Paragraph Main Waveform Sentence Main Waveform Word Main Waveform Syllable Coded/Main Waveform Diphone Coded Waveform Phoneme Coded Waveform 13

Concatenative Approaches (Cont’d) b. Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing b. Overlap-Add-Method is a standard DSP method b. PSOLA is a base action for Voice Conversion. b. In this method in analysis stage we select frames that are synchronous by pitch markers. 14

Rule-Based Approach Stages Determine the speech model and model parameters Determine type of phone units Determine some parameter amount for each phone unit Substitute sequence of phone units by its equivalent parameter sequence Put parameter sequence in speech model 15

KLATT 80 Model 16

KLATT 88 Model 17

THE KLSYN 88 CASCADE PARALLEL FORMANT SYNTHESIZER GLOTTAL SOUND SOURCES FILTERED IMPULSE TRAIN SO MODIFIED LF MODEL FRICATION NOISE GENERATOR NASAL POLE ZERO PAIR TRACHEAL POLE ZERO PAIR SS F 1 B 1 DF 1 DB 1 FIRST FORMANT RESONATOR F 2 B 2 F 3 B 3 F 4 B 4 SECOND FORMANT RESONATOR THIRTH FORMANT RESONATOR FOURTH FORMANT RESONATOR CP AH + ASPIRATION NOISE GENERATOR A 3 F A 4 F A 5 F A 6 F F 5 B 5 FIFTH FORMANT RESONATOR CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES SPECTRAL TILT LOW-PAS RESONANTOR A 2 F AF FTP FTZ BTP BTZ TL F 0 AV OO FL DI KL GLOTT 88 model (default) FNP FNZ BNP BNZ SECOND FORMANT RESONATOR THIRD FORMANT RESONATOR B 2 F B 4 F FIFTH FORMANT RESONATOR B 5 F AB A 1 V FIRST FORMANT RESONATOR + FIRST DIFFERENCE PREEMPHASIS A 2 V SECOND FORMANT RESONATOR - + A 3 V THIRTH FORMANT RESONATOR A 4 V FOURTH FORMANT RESONATOR ATV TRACHEAL FORMANT RESONATOR - + - B 6 F F 6 + + B 3 F FOURTH FORMANT RESONATOR SIXTH FORMANT RESONATOR ANV NASAL FORMANT RESONATOR + + - BYPASS PATH PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES PARALLEL VOCAL TRACT MODEL LYRYNGEAL SOUND SOURCES (NORMALLY NOT USED) 18

Three Voicing Source Model In KLATT 88 The old KLSYN impulsive source The KLGLOTT 88 model The modified LF model 19