Fundamentals of Speech Signal Processing 1 0 Speech

Waveform plots of typical vowel sounds - Voiced（濁音） tone 2 tone 1 tone 4

Speech Production and Source Model • Human vocal mechanism • Speech Source Model Vocal

Voiced and Unvoiced Speech u(t) x(t) voiced pitch unvoiced

Waveform plots of typical consonant sounds Unvoiced （清音） Voiced （濁音）

Frequency domain spectra of speech signals Voiced Unvoiced

Frequency Domain Voice d formant frequencies

Frequency Domain Unvoiced formant frequencies

Formant frequency contours He will allow a rare lie. Reference: 6. 1 of Huang,

Speech Signal Processing x(t) LPF x[n] Processing Algorithms output • Major Application Areas •

Double Levels of Information 字(Character) 詞(Word) 人人用電腦句(Sentence)

Speech Signal Processing – Processing of Double-Level Information • Speech Signal 今天常

Voice-based Network Access Internet User Interface l User Interface Content Analysis User-Content Interaction —when

User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals Text

Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content Future Integrated Networks

User-Content Interaction — Wireless and Multimedia Technologies are Creating An Era of Network Access

Waveform-based Approaches • Pulse-Coded Modulation (PCM) – binary representation for each sample x[n] by

Speech Source Model and Source Coding • Speech Source Model Ex Excitation Generator parameters

Speech Source Model and Source Coding l Analysis and Synthesis – High computation requirements

Simplified Speech Source Model G(z), G( ), g[n] unvoiced random sequence generator periodic pulse

LPC Vocoder(Voice Coder) N by pitch detection v/u by voicing detection {ak} can be

Multipulse LPC poor modeling of u(n) is the main source of quality degradation in

Multipulse LPC l Estimating (bk , nk) is a difficult problem – analysis by

Multipulse LPC l Perceptual Weighting P P k=1 W(z) = (1 ak z -k)／(1

Multipulse LPC l Error Minimization and Pulse search u[n] = bk [n nk] k

4. 0 Speech Recognition and Voice-based Network Access

Speech Recognition as a pattern recognition problem x(t) Feature Extraction unknown speech signal Pattern

Basic Approach for Large Vocabulary Speech Recognition • A Simplified Block Diagram Input Speech

State Transition Probabilities 1 -dim Gaussian Mixtures

Peripheral Processing for Human Perception

… this ……… 50000 … this is …… 500 … this is a …

Text-to-speech Synthesis • Transforming any input text into corresponding speech signals • E-mail/Web page

Text-to-Speech Synthesis · Three Major Steps text analysis prosody generation speech synthesis 假如今天胡適先生還活著，又會怎樣呢？ text

Automatic Prosodic Analysis for an Arbitrary Text Sentence 5 1 3 1 2 1

Speech Understanding • Understanding Speaker’s Intention rather than Transcribing into Word Strings • Limited

Speaker Verification • Verifying the speaker as claimed • Applications requiring verification • Text

Voice-based Information Retrieval • Speech Instructions • Speech Documents (or Multi-media Documents including Speech

Spoken Dialogue Systems • Almost all human-network interactions can be accomplished by spoken dialogue

References 1. “Speech and Language Processing over the Web”, IEEE Signal Processing Magazine, May

Slides: 60

Download presentation

Fundamentals of Speech Signal Processing

1. 0 Speech Signals

Waveform plots of typical vowel sounds - Voiced（濁音） tone 2 tone 1 tone 4 t

Speech Production and Source Model • Human vocal mechanism • Speech Source Model Vocal tract x(t) u(t)

Voiced and Unvoiced Speech u(t) x(t) voiced pitch unvoiced

Waveform plots of typical consonant sounds Unvoiced （清音） Voiced （濁音）

Waveform plot of a sentence

Frequency domain spectra of speech signals Voiced Unvoiced

Frequency Domain Voice d formant frequencies

Frequency Domain Unvoiced formant frequencies

Spectrogram

Formant Frequencies

Formant frequency contours He will allow a rare lie. Reference: 6. 1 of Huang, or 2. 2, 2. 3 of Rabiner and Juang

2. 0 Speech Signal Processing

Speech Signal Processing x(t) LPF x[n] Processing Algorithms output • Major Application Areas • Speech Signals 1. Speech Coding: Digitization and Compression x[n] Processing xk 110101… Inverse Processing ^ x[n] Storage/transmission Considerations : 1) bit rate (bps) 2) recovered quality 3) computation complexity/feasibility 2. Voice-based Network Access — User Interface, Content Analysis, User-content Interaction – Carrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc. – Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level – Processing and Interaction of the Double-level Information

Sampling of Signals X[n] X(t) t n

Double Levels of Information 字(Character) 詞(Word) 人人用電腦句(Sentence)

Speech Signal Processing – Processing of Double-Level Information • Speech Signal 今天常 • Sampling 天 • Processing 的氣 Algorithm 非好 Chips or Computers • Linguistic Structure • Linguistic Knowledge 今天的天氣 Lexicon Grammar 今天的非常好

Voice-based Network Access Internet User Interface l User Interface Content Analysis User-Content Interaction —when keyboards/mice inadequate l Content Analysis — help in browsing/retrieval of multimedia content l User-Content Interaction —all text-based interaction can be accomplished by spoken language

User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals Text Content Internet Networks l l l Multimedia Content at Any Time, from Anywhere Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Handsfree Interfaces, Home Appliances, Wearable Devices… Small in Size, Light in Weight, Ubiquitous, Invisible… Post-PC Era Keyboard/Mouse Most Convenient for PC’s not Convenient any longer — human fingers never shrink, and application environment is changed l l l Service Requirements Growing Exponentially Voice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere, and to the point in one utterance Speech Processing is the only less mature part in the Technology Chain

Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content Future Integrated Networks Real–time Information – weather, traffic – flight schedule – stock price – sports scores Knowledge Archieves – digital libraries – virtual museums Special Services – Google – Face. Book –You. Tube – Amazon Intelligent Working Environment e–mail processors – intelligent agents – teleconferencing – distant learning – electric commerce – Private Services – personal notebook – business databases – home appliances – network entertainments • Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text) • Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse • The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval • Multimedia Content Analysis based on Speech Information

User-Content Interaction — Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing voice information ce voi t/ u inp put out Text-to-Speech Synthesis Spoken and multi-modal Dialogue text information Text Content Voice-based Information Retrieval Multimedia Content Internet Text Information Retrieval Multimedia Content Analysis • Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech • User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues • Hand-held Devices with Multimedia Functionalities Commonly used Today • Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech Information

3. 0 Speech Coding

Waveform-based Approaches • Pulse-Coded Modulation (PCM) – binary representation for each sample x[n] by quantization • Differential PCM (DPCM) – encoding the differences d[n] = x[n] x[n 1] P d[n] = x[n] ak x[n k] k=1 • Adaptive DPCM (ADPCM) – with adaptive algorithms Ref : Haykin, “Communication Systems”, 4 -th Ed. 3. 7, 3. 13, 3. 14, 3. 15

Speech Source Model and Source Coding • Speech Source Model Ex Excitation Generator parameters u[n] U ( ) U (z) G( ), G(z), g[n] Vocal Tract Model parameters x[n]=u[n] g[n] X( )=U( )G( ) X(z)=U(z)G(z) – digitization and transmission of the parameters will be adequate – at receiver the parameters can produce x[n] with the model – much less parameters with much slower variation in time lead to much less bits required – the key for low bit rate speech coding

Speech Source Model x(t) t a[n] n

Speech Source Model and Source Coding l Analysis and Synthesis – High computation requirements are the price for low bit rate

Simplified Speech Source Model G(z), G( ), g[n] unvoiced random sequence generator periodic pulse train generator G v/u u[n] G(z) = 1 x[n] P 1 akz-k k=1 Vocal Tract Model voiced N Excitation – Excitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain excitation signal u[n] – Vocal Tract parameters {ak} : LPC coefficients formant structure of speech signals – A good approximation, though not precise enough

LPC Vocoder(Voice Coder) N by pitch detection v/u by voicing detection {ak} can be non-uniform or vector quantized to reduce bit rate further Ref : 3. 3 ( 3. 3. 1 up to 3. 3. 9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993

Multipulse LPC poor modeling of u(n) is the main source of quality degradation in LPC vocoder l u[n] replaced by a sequence of pulses u[n] = bk [n nk] k – roughly 8 pulses per pitch period – u[n] close to periodic for voiced – u[n] close to random for unvoiced l Estimating (bk , nk) is a difficult problem

Multipulse LPC l Estimating (bk , nk) is a difficult problem – analysis by synthesis – large amount of computation is the price paid for better speech quality

Multipulse LPC l Perceptual Weighting P P k=1 W(z) = (1 ak z -k)／(1 ak ck z -k) 0<c<1 for perceptual sensitivity W(z) = 1 , if c = 1 W(z) = 1 ak z -k , if c = 0 practically c 0. 8 l Error Evaluation E= ^ |X( ) |2 W( )d

Multipulse LPC l Error Minimization and Pulse search u[n] = bk [n nk] k ^ x[n] = b k k g[n nk] E = E ( b 1 , n 1 , b 2 , n 2……) – sub-optional solution finding 1 pulse at a time

Code-Excited Linear Prediction (CELP)

4. 0 Speech Recognition and Voice-based Network Access

Speech Recognition as a pattern recognition problem x(t) Feature Extraction unknown speech signal Pattern Matching Y Feature Extraction Decision Making output word feature vector sequence y(t) training speech W X Reference Patterns

Basic Approach for Large Vocabulary Speech Recognition • A Simplified Block Diagram Input Speech Front-end Signal Processing Speech Corpora Feature Vectors Acoustic Model Training Acoustic Models Output Sentence Linguistic Decoding and Search Algorithm Lexicon Language Model • Example Input Sentence this is speech • Acoustic Models (聲學模型) • (th-ih-s-ih-z-s-p-ih-ch) Lexicon (th-ih-s) → this (ih-z) → is (s-p-iy-ch) → speech • Language Model (語言模型) (this) – (speech) P(this) P(is | this) P(speech | this is) P(wi|wi-1) bi-gram language model P(wi|wi-1, wi-2) tri-gram language model, etc Language Model Construction Text Corpora

Observation Sequences

State Transition Probabilities 1 -dim Gaussian Mixtures

Simplified HMM RGBGGBBGRRR……

Peripheral Processing for Human Perception

Mel-scale Filter Bank

N-gram W 1 W 2 W 3 W 4 W 5 W 6 . . . WR tri-gram W 1 W 2 W 3 W 4 W 5 W 6 . . . WR

… this ……… 50000 … this is …… 500 … this is a … 5

Text-to-speech Synthesis • Transforming any input text into corresponding speech signals • E-mail/Web page reading • Prosodic modeling • Basic voice units/rule-based, non-uniform units/corpus-based, modelbased Lexicon and Rules Input Text Analysis and Letter-tosound Conversion Prosodic Model Prosody Generation Voice Unit Database Signal Processing and Concatenation Output Speech Signal

Text-to-Speech Synthesis · Three Major Steps text analysis prosody generation speech synthesis 假如今天胡適先生還活著，又會怎樣呢？ text analysis 假如|今天|胡適|先生| 還 | 活 | 著 |，又| 會|怎樣|呢|？ Cbb, Nd, Nb, Na, Dfa, VH, Di, ，D, D, D, T, ？ prosody generation 5 1 3 1 2 1 1 4 1 2 1 5 假如|今天胡適|先生還 | 活 | 著|，又| 會怎樣|呢|？ Cbb, Nd, Nb, Na, Dfa, VH, Di, ，D, D, D, T, ？ Tone Intonation Energy Pause speech synthesis Duration

Automatic Prosodic Analysis for an Arbitrary Text Sentence 5 1 3 1 2 1 1 4 1 2 1 5 假如|今天胡適 |先生還 | 活 |著|，又|會怎樣|呢|？ Cbb, Nd, Nb, Na, Dfa, VH, Di, ，D, D, D, · · T, ？ break indices words minor phrases major phrases breath groups prosodic groups Predict B 1, B 2, B 3 (B 4, B 5 determined by Punctuation marks) minor phrase patterns: (Cbb+Nd, Nb+Na, Dfa+VH+Di, D+D, D+T, etc)

Speech Understanding • Understanding Speaker’s Intention rather than Transcribing into Word Strings • Limited Domains/Finite Tasks acoustic models input utterance Syllable Recognition phrase lexicon syllable lattice Key Phrase Matching concept set phrase graph phrase/concept language model Semantic concept graph Decoding understanding results • An Example utterance: 請幫我查一下台灣銀行的電話號碼是幾號? key phrases: (查一下) - ( 台灣銀行) - (電話號碼) concept: (inquiry) - (target) - (phone number)

Speaker Verification • Verifying the speaker as claimed • Applications requiring verification • Text dependent/independent • Integrated with other verification schemes input speech Feature Extraction yes/no Verification Speaker Models

Voice-based Information Retrieval • Speech Instructions • Speech Documents (or Multi-media Documents including Speech Information) speech instruction text instruction 我想找有關新政府組成的新聞？ text documents speech documents d 1 d 2 d 1 d 3 d 2 d 3 總統當選人陳水扁今天早上…

Spoken Dialogue Systems • Almost all human-network interactions can be accomplished by spoken dialogue • Speech understanding, speech synthesis, dialogue management • System/user/mixed initiatives • Reliability/efficiency, dialogue modeling/flow control • Transaction success rate/average number of dialogue turns Users Output Speech Networks Sentence Generation and Speech Synthesis Response to the user Discourse Context Input Speech Internet Dialogue Manager User’s Intention Speech Recognition and Understanding Databases Dialogue Server

References 1. “Speech and Language Processing over the Web”, IEEE Signal Processing Magazine, May 2008 2. G. Tur, R. De Mori, “Spoken Language Understanding. Systems for Extracting Semantic Information from Speech”, John Wiley & Sons, 2011