Mandarin Chinese Speech Recognition Mandarin Chinese n Tonal

  • Slides: 17
Download presentation
Mandarin Chinese Speech Recognition

Mandarin Chinese Speech Recognition

Mandarin Chinese n Tonal language (inflection matters!) n n n Monosyllabic language n n

Mandarin Chinese n Tonal language (inflection matters!) n n n Monosyllabic language n n n 1 st tone – High, constant pitch (Like saying “aaah”) 2 nd tone – Rising pitch (“Huh? ”) 3 rd tone – Low pitch (“ugh”) 4 th tone – High pitch with a rapid descent (“No!”) “ 5 th tone” – Neutral used for de-emphasized syllables Each character represents a single base syllable and tone Most words consist of 1, 2, or 4 characters Heavily contextual language

Mandarin Chinese and Speech Processing n Accoustic representations of Chinese syllables n Structural Form

Mandarin Chinese and Speech Processing n Accoustic representations of Chinese syllables n Structural Form n (consonant) + vowel + (consonant)

Mandarin Chinese and Speech Processing n Phone Sets n Initial/final phones [1] e. g.

Mandarin Chinese and Speech Processing n Phone Sets n Initial/final phones [1] e. g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) n Initial phones: unvoiced n n n 1 phone Final phones: voiced (tone 1 -5) n Can consist of multiple phones

Mandarin Chinese and Speech Processing n n Strong tonal recognition is crucial to distinguish

Mandarin Chinese and Speech Processing n n Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) Creating tone models is difficult n Discontinuities exist in the F 0 contour between voiced and unvoiced regions

Prosody n Prosody: “the rhythmic and intonational aspect of language” [2] Embedded Tone Modeling[4]

Prosody n Prosody: “the rhythmic and intonational aspect of language” [2] Embedded Tone Modeling[4] n Explicit Tone Modeling[4] n

Tone Modeling n Embedded Tone Modeling n n Tonal acoustic units are joined with

Tone Modeling n Embedded Tone Modeling n n Tonal acoustic units are joined with spectral features at each frame [4] Explicit Tone Modeling n Tone recognition is completed independently and combined after post-processing [4]

Tone Modeling n n Pitch, energy, and duration (Prosody) combined with lexical and syntactic

Tone Modeling n n Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Coarticulation n n Variations in syllables can cause variations in tone: Bu 4 + Dui 4 = Bu 2 Dui 4 (wrong) Ni 3 + Hao 3 = Ni 2 Hao 3 (hello)

Emebedded Tone Modeling: Two Stream Modeling Ni, Liu, Xu n Spectral Stream –MFCC’s (Mel

Emebedded Tone Modeling: Two Stream Modeling Ni, Liu, Xu n Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) n Describe vocal tract information n Distinctive for phones (short time duration) n Pitch/Tone Stream – requires smoothing n n n Describe vibrations of the vocal chords Independent of Spectral features d/dt(pitch) aka tone and d 2/dt 2(pitch) are added Embedded in an entire syllable Affected by coarticulation (requires a longer time window) – i. e. Sandhi Tone – context dependency

Embedded Tone Modeling: Two Stream Modeling [4] n Tonal Identification Features n n n

Embedded Tone Modeling: Two Stream Modeling [4] n Tonal Identification Features n n n F 0 Energy Duration Coarticulation (cont. speech) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment? ) n Explicit tone modeling uses max. entropy framework [4] (discriminative model)

Explicit Tone Modeling [4] No. 1 Feature Description Duration of current, previous, and following

Explicit Tone Modeling [4] No. 1 Feature Description Duration of current, previous, and following syllables # of Features 3 2 3 Previous syllable is or is not sp 4 Statistical Parameters of pitch and log-energy of current syllable (i. e. max, min, mean, etc. ) 10 5 Normalized max and mean of pitch and energy in each syllable in the context window 12 6 7 Location of current syllable within word Slope and intercept of F 0 contour of current syllable, its delta, and delta-delta Tones of preceding and proceding syllables 1 6 1 2

Other Work Chang, Zhou, Di, Huang, & Lee [1] n 3 Methods n Powerful

Other Work Chang, Zhou, Di, Huang, & Lee [1] n 3 Methods n Powerful Language Model (no tone modeling) n n Embedded 2 Stream n n n CER = 7. 32% Tone Stream + Feature Stream CER = 6. 43% Embedded 1 Stream n Developed Pitch extractor n n pitch track added to feature vector CER = 6. 03%

Other Work Qian, Soong [3] n n F 0 contour smoothing Multi-Space Distribution (MSD)

Other Work Qian, Soong [3] n n F 0 contour smoothing Multi-Space Distribution (MSD) n Models 2 prob. Spaces n n Unvoiced: Discrete Voiced (F 0 Contour): Continuous

Other Work Lamel, Gauvain, Le, Oparin, Meng [6] n Multi-Layer Perceptron Features n n

Other Work Lamel, Gauvain, Le, Oparin, Meng [6] n Multi-Layer Perceptron Features n n Compare Language Models n n n Combined with MFCC’s and Pitch features N-Gram: Back-off Language Model Neural Network Language Model Adaptation

Other Work O. Kalinli [7] n Replace prosodic features with biologically inspired auditory attention

Other Work O. Kalinli [7] n Replace prosodic features with biologically inspired auditory attention cues n n Cochlear filtering, inner hair cell, etc. Other features are extracted from the auditory spectrum n n Intensity Frequency contrast Temporal contrast Orientation (phase)

Other Work Qian, Xu, Soong [8] n Cross-Lingual Voice Transformation n n Phonetic mapping

Other Work Qian, Xu, Soong [8] n Cross-Lingual Voice Transformation n n Phonetic mapping between languages Difficult for Mandarin and English n Very different prosodic features

References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large

References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones” [2] Meriam-Webster Dictionary, http: //www. merriam-webster. com/ [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009 [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework” [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004 [6] Lori Lamel, J. L. Gauvain, V. B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011 [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011