University of Illinois Acoustic Modeling for Multi Language

Motivation Applications in a Multilingual Society News Hound: Find all TV news segments, in

Method: Transform and Infer (ubiquitous methodology in ASR; see, e. g. , Jelinek, 1976)

Signal Transforms determined by a physical model of the signal A good signal model

Classifier Transforms Compute a precise and accurate estimate of p(obst|statet) Robust Machine Learning From

Inference Integrate information to choose best global labelset Labels = variables that matter globally

Example: Language-Independent Phone Recognition (Huang et al. , in preparation) Voice activity detection Perceptual

A Language-Independent Phone Set (Consonants) Plus secondary articulations (glottis, pharynx, palate, lips), sequences, and

A Language-Independent Phone Set (Vowels)

Training Data 10 languages, 11 corpora Arabic, Croatian, English, Japanese, Mandarin, Portuguese, Russian, Spanish,

Dictionaries (Hasegawa-Johnson and Fleck, http: //www. isle. uiuc. edu/dict/) Orthographic Transcriptions Urdu: No Vowels!!

Context-Dependent Phones Triphones: when is a /t/ not a /t/? “writer” /t/ is unusual;

Decision Tree State Tying Categories for decision tree questions Distinctive phone features (manner/place of

Phone Recognition Experiment (Huang et al. , in preparation) Language-independent triphone bigram language model

Recognition Results (Huang et al. , in preparation) Test set: 50 sentences per corpus

Example: Language-Independent Speech Information Retrieval (Zhuang et al. , in preparation) Voice activity detection

Information Retrieval Standard Methods Task Description: given a query, find the “most relevant” segments

Language-Independent IR: The Star Challenge A Multi-Language Multi-Media Broadcast News Retrieval Competition, sponsored by

Star Challenge Tasks VT 1, VT 2: Given image category (e. g. , “crowd,

Star Challenge: Simplified Results Rounds 1 and 3: 48, 000 CPU hours Round 1:

Open Research Areas When does “Transform and Infer” help? ROUND 3 (1000 cpus, 48

Existence Proof: ASR can beat Human Listeners (Sharma et al. , in preparation) The

Open Research Areas Remove the Constraints! ASR can beat a human listener if the

Decision Tree State Tying (Odell, Woodland Young, 1994) 1. Divide each IPA phone into

Slides: 27

Download presentation

University of Illinois Acoustic Modeling for Multi. Language, Multi-Style, Multi. Channel Automatic Speech Recognition Mark Hasegawa-Johnson Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting Huang, Xi Zhou, Zhen Li, and Thomas Huang including also the research results of Laehoon Kim and Harsh Sharma

Motivation Applications in a Multilingual Society News Hound: Find all TV news segments, in any language, mentioning “Barack Obama” Language Learner: Transcribe learner's accented speech; tell him which words sound accented Broadcaster/Podcaster: Automatically transcribe “man on the street” interviews in a multilingual city (LA, Sing) Problems Physical variability: noise, echo, talker Imprecise categories: dependent on context Content variability: language, topic, dialect, style

Method: Transform and Infer (ubiquitous methodology in ASR; see, e. g. , Jelinek, 1976) Signal transforms Classifier transforms Likelihood Vector bi=p(observationt|statet=i) Inference Algorithm A Parametric Model of p(state 1, . . . , state. T, label 1, . . . , label. T) Best label sequence = argmax p(label 1, . . . , label. T|observation 1, . . . , observation. T)

Signal Transforms determined by a physical model of the signal A good signal model tells you a lot: Reverberation model: y[n]=v[n]+ mh[m]x[n-m] x[n] produced by a human vocal tract, designed for efficient processing by a human auditory system A good signal transform improves the accuracy of all classifiers Denoising: Correct for additive noise Dereverberation: Correct for convolutional noise Perceptual freq warping: Hear what humans hear

Denoising Example (Kim et al. , 2006)

Classifier Transforms Compute a precise and accurate estimate of p(obst|statet) Robust Machine Learning From a limited amount of training data, Learn parameterized probability models as precise as possible, . . . with a known upper bound on generalization error Methods that trade off precision and generalization Decorrelate the signal measurements: PCA, DCT Select the most informative features from an inventory: Ada. Boost Train a linear or nonlinear function zt=f(yt) that Discriminates among the training examples from diff classes Has known upper bounds on generalization error (SVM, ANN) Train another nonlinear function p(zt|statet) with same properties

Inference Integrate information to choose best global labelset Labels = variables that matter globally Speech Recognition: what words were spoken? Information Retrieval: which segment best matches the query? Language Learning: where's the error? States = variables that can be classified locally May be scalar, e. g. , qt=sub-phoneme May be vector, e. g. , qt=[vector of articulatory states] Inference algorithm = Parametric model of p(states, labels) Scalar states: Hidden Markov model, Finite State Transducer Vector states: Dynamic Bayesian network, Conditional Random Field

Example: Language-Independent Phone Recognition (Huang et al. , in preparation) Voice activity detection Perceptual freq warping Gaussian mixtures Likelihood Vector bi=p(observationt|statet=i) Inference Algorithm: Hidden Markov Model with Token Passing p(state 1, . . . , state. T, phone 1, . . . , phone. T) Best label sequence = argmax p(phone 1, . . . , phone. T|observation 1, . . . , observation. T)

A Language-Independent Phone Set (Consonants) Plus secondary articulations (glottis, pharynx, palate, lips), sequences, and syllabics

A Language-Independent Phone Set (Vowels)

Training Data 10 languages, 11 corpora Arabic, Croatian, English, Japanese, Mandarin, Portuguese, Russian, Spanish, Turkish, Urdu 95 hours of speech Sampled from a larger set of corpora Mixed styles of speech: broadcast, read, and spontaneous

Summarization of Corpora

Dictionaries (Hasegawa-Johnson and Fleck, http: //www. isle. uiuc. edu/dict/) Orthographic Transcriptions Urdu: No Vowels!! ﺻﺎﺣﺐ , ﺻﺎﻋﻖ Diacriticized Version available on web? ﺻﺎﺣﺐ No Ruleset #1 = ﻕ q = ک k = گ g. . . ﺻﺎﻕ Yes Ruleset #2 =A =ligature =u. . . Phonetic Transcriptions /s. Ah{SV}b{SV}/, /s. A!iqƏ/

Context-Dependent Phones Triphones: when is a /t/ not a /t/? “writer” /t/ is unusual; call it /a. I-t+3 r/ “a tree” /t/ is unusual; call it /&-t+r/ “that soup” /t/ is unusual; call it /ae-t+s/ Lexical stress /i/ in “reek” longer than in “recover” Call them /r-i+k'/ vs. /r-i+k/ Punctuation, an easy-to-transcribe proxy for prosody /n/ in “I'm done. ” 2 X as long as /n/ in “Done yet? ” Call them /^-n+{PERIOD}/ vs. /^-n+j/ Language, Dialect, Style: /o/ in “atone: ” call it /t-o+n%eng/ /o/ in あとに: call it /t-o+n%jap/ Gender: handled differently (speaker adaptation) ^’-A+b%eng ^’-A+b’%eng >-A+d%cmn ….

Decision Tree State Tying Categories for decision tree questions Distinctive phone features (manner/place of articulation) of right or left context ^’-A+b%eng. L 2 Language identity ^’-A+b’%eng. L 2 Dialect identity (L 1 vs. L 2) >A+d%cmn …. Lexical stress Punctuation mark Each leaf node contains at least 3. 5 seconds of training data

Phone Recognition Experiment (Huang et al. , in preparation) Language-independent triphone bigram language model Standard classifier transforms (PLP+d+dd, CDHMM, 11 -17 Gaussians) Vocabulary size: top 60 K most frequent triphones (since 140 K is too much!) For the rest of infrequent triphones, map them back to center monophones

Recognition Results (Huang et al. , in preparation) Test set: 50 sentences per corpus

Example: Language-Independent Speech Information Retrieval (Zhuang et al. , in preparation) Voice activity detection Perceptual freq warping Gaussian mixtures Likelihood Vector bi=p(observationt|statet=i) Inference Algorithm: Finite State Transducer built from ASR Lattices E(count(query|observations)) Retrieval Ranking = E(count(query|segment observations))

Information Retrieval Standard Methods Task Description: given a query, find the “most relevant” segments in a database Published Algorithms: EXACT MATCH: segment = argmin d(query, segment) SUMMARY STATISTICS: segment = argmax p(query|segment), no concept of “word order” Fast Good for text, e. g. , google, yahoo, etc. TRANSFORM AND INFER: segment = argmax p(query|segment), ≈ E(count(query)|segment); word order matters Flexible, but slow. .

Language-Independent IR: The Star Challenge A Multi-Language Multi-Media Broadcast News Retrieval Competition, sponsored by A*STAR Elimination rounds, June-August 2008 Three rounds, each of 48 hours duration 56 teams entered from around the world 5 teams selected for the Grand Finals: 10/23/2008, Singapore

Star Challenge Tasks VT 1, VT 2: Given image category (e. g. , “crowd, ” “sports, ” “keyboard”), find examples AT 1: Given an IPA phoneme sequence (example: /ɻogutʃA/), find audio segments AT 2: Given a waveform containing a word or word sequence in any language, find audio segments containing the same word AT 1+VT 2: find specified video class, speech contains IPA (e. g. , “man monologue”+/groʊɵ/)

Star Challenge: Simplified Results Rounds 1 and 3: 48, 000 CPU hours Round 1: English, 20 queries Round 3: English and Mandarin, 3 queries each Grand Final: 6 CPU hours English, Mandarin, Malay, and Tamil, 2 queries each

Open Research Areas When does “Transform and Infer” help? ROUND 3 (1000 cpus, 48 hours): best algorithms were “transform and infer” GRAND FINAL (3 cpus, 2 hours): best algorithms were “exact match” Open research area #1: complexity “Inference algorithm: ” user constraints → simplified classifier Improved transforms and improved classifiers allow the use of a less-constrained user interface Open research area #2: accuracy

Existence Proof: ASR can beat Human Listeners (Sharma et al. , in preparation) The task: speech of talkers with gross motor disability (Cerebral Palsy) Familiar listeners in familiar situations understand most of what they say. . . ASR can also be talker-dependent and vocabulary-constrained

Open Research Areas Remove the Constraints! ASR can beat a human listener if the ASR knows more than the human (e. g. , knows the talker and the vocabulary) Better knowledge = better signal models better classifiers better inference

Thank You! Questions?

Decision Tree State Tying (Odell, Woodland Young, 1994) 1. Divide each IPA phone into three temporally sequential “states, ” /i/ -> /i/onset, /i/center, /i/offset 2. Start with one model for each state. Create a statistical model p(acoustics|state) using training data 3. Ask yes-no questions about context variables Left phone, right phone, lexical stress, language ID • If p(acoustics|state, yes) ≠ p(acoustics|state, no), split the training data into two groups The “yes” examples vs. the “no” examples If many such questions exist, choose the best Repeat this process as long as each group contains enough training data examples