Computer Speech Recognition Mimicking the Human System Li

  • Slides: 23
Download presentation
Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2,

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing (UCLA) Collaborators: Dong Yu (MSR), Xiang Li (CMU), A. Acero (MSR)

Speech Recognition--- Introduction • Converting naturally uttered speech into text and meaning • Human-machine

Speech Recognition--- Introduction • Converting naturally uttered speech into text and meaning • Human-machine dialogues (scenario demos) • Conventional technology --- statistical modeling and estimation (HMM) • Limitations – – – noisy acoustic environments rigid speaking style constrained task unrealistic demand of training data huge model sizes, etc. far below human speech recognition performance • Trend: Incorporate key aspects of human speech processing mechanisms

Production & Perception: Closed-Loop Chain LISTENER SPEAKER decoded message Internal model message p it

Production & Perception: Closed-Loop Chain LISTENER SPEAKER decoded message Internal model message p it on mot e ec r or/a to i d u a / r rs lato ticu ea Speech Acoustics in closed-loop chain r y r

Encoder: Two-Stage Production Mechanisms Phonology (higher level): SPEAKER message • Symbolic encoding of linguistic

Encoder: Two-Stage Production Mechanisms Phonology (higher level): SPEAKER message • Symbolic encoding of linguistic message • Discrete representation by phonological features • Loosely-coupled multiple feature tiers • Overcome beads-on-a-string phone model • Theories of distinctive features, feature geometry & articulatory phonology • Account for partial/full sound deletion/modification in casual speech rs lato ticu r or/a mot Phonetics (lower level): • Convert discrete linguistic features to continuous acoustics • Mediated by motor control & articulatory dynamics • Mapping from articulatory variables to VT area function to acoustics Speech Acoustics • Account for co-articulation and reduction (target undershoot), etc.

Encoder: Phonological Modeling SPEAKER Computational phonology: • Represent pronunciation variations as constrained factorial Markov

Encoder: Phonological Modeling SPEAKER Computational phonology: • Represent pronunciation variations as constrained factorial Markov chain • Constraint: from articulatory phonology • Language-universal representation ten themes message / t ε n ө i: m rs lato ticu r or/a mot Tongue Tip Tongue Body Mid / Front Speech Acoustics High / Front z/

Encoder: Phonetic Modeling Computational phonetics: SPEAKER message • Segmental factorial HMM for sequential target

Encoder: Phonetic Modeling Computational phonetics: SPEAKER message • Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domain • Switching trajectory model for target-directed articulatory dynamics • Switching nonlinear state-space model for dynamics in speech acoustics • Illustration: rs lato ticu r or/a mot Speech Acoustics

Phonetic Encoder: Computation SPEAKER targets articulation message distortion-free acoustics rs lato ticu r or/a

Phonetic Encoder: Computation SPEAKER targets articulation message distortion-free acoustics rs lato ticu r or/a mot distorted acoustics Speech Acoustics distortion factors & feedback to articulation

Phonetic Reduction Illustration yo-yo (formal) yo-yo (casual)

Phonetic Reduction Illustration yo-yo (formal) yo-yo (casual)

Decoder I: Auditory Reception • Convert speech acoustic waves into efficient & robust auditory

Decoder I: Auditory Reception • Convert speech acoustic waves into efficient & robust auditory representation • This processing is largely independent of phonological units • Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC, …, all themessage way to A 1 cortex • Principal roles: LISTENER decoded message Internal model 1) combat environmental acoustic distortion; 2) detect relevant speech features 3) provide temporal landmarks to aid decoding p e ec r or/a mot ticu • Key properties: rs lato 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc. it on to i d ea u a / r r y r

Decoder II: Cognitive Perception LISTENER decoded message Internal model it on p e ec

Decoder II: Cognitive Perception LISTENER decoded message Internal model it on p e ec rs lato ticu r or/a mot • Cognitive process: recovery of linguistic message • Relies on 1) “Internal” model: structural knowledge of the encoder (production system) 2) Robust auditory representation of features 3) Temporal landmarks message • Child speech acquisition process is one that gradually establishes the “internal” model • Strategy: analysis by synthesis • i. e. , Probabilistic inference on (deeply) hidden linguistic units using the internal model • No motor theory: the above strategy requires no articulatory recovery from speech acoustics to i d ea u a / r r y r

Speaker-Listener Interaction • On-line modification of speaker’s articulatory behavior (speaking effort, rate, clarity, etc.

Speaker-Listener Interaction • On-line modification of speaker’s articulatory behavior (speaking effort, rate, clarity, etc. ) based on listener’s “decoding” performance (i. e. discrimination) • Especially important for conversational speech recognition and understanding • On-line adaptation of “encoder” parameters • Novel criteria: – maximize discrimination while minimizing articulation effort • In this closed-loop model, the “effort” quantified as “curvature” of temporal sequence of articulatory vector zt. • No such concept of “effort” in conventional HMM systems

Stage-I illustration (effects of speaking rate)

Stage-I illustration (effects of speaking rate)

Sound Confusion for Casual Speech (model vs. data) model prediction hand measurements speaking rate

Sound Confusion for Casual Speech (model vs. data) model prediction hand measurements speaking rate • • Two sounds merge when they become “sloppy” Human perception does “extrapolation”; so does our model speaking rate • • 5000 hand-labeled speech tokens Source: J. Acoustical Society of America, 2000

Model Stage-I: • Impulse response of FIR filter (noncausal): • Output of filter:

Model Stage-I: • Impulse response of FIR filter (noncausal): • Output of filter:

Model Stage-II: • Analytical prediction of cepstra: Assuming P-th order all-pole model • Residual

Model Stage-II: • Analytical prediction of cepstra: Assuming P-th order all-pole model • Residual random vector for statistical bias modeling (finite pole order, no zeros): residual

Illustration: Output of Stage-II (green) Model data

Illustration: Output of Stage-II (green) Model data

Speech Recognizer Architecture • Stages I and II of the hidden trajectory model in

Speech Recognizer Architecture • Stages I and II of the hidden trajectory model in combination speech recognizer • No context-dependent parameters FIR bi -directional filter provides context dependence, as well as reduction • Training procedure • Recognition procedure

Procedure --- Training • training residual parameters training waveform phonetic xcript w/ time LPCC

Procedure --- Training • training residual parameters training waveform phonetic xcript w/ time LPCC feature extraction Table lookup and LPCC + residual target filtering w/ FIR target sequence nonlinear mapping VTR tracks predicted - - LPCC predicted monophone HMM trainer

Procedure --- N-best Evaluation LPCC feature extraction triphone HMM system H*= arg Max {

Procedure --- N-best Evaluation LPCC feature extraction triphone HMM system H*= arg Max { P(H 1), P(H 2), …P(H 1000)} test data + Hyp 1 table lookup Hyp 2 table lookup … … … Hyp N N-best list (N=1000); each hypothesis has phonetic xcript & time … … … table lookup FIR … … … FIR nonlinear mapping … … … nonlinear mapping parameter free Gaussian Scorer + Gaussian Scorer … … … + - Gaussian Scorer (k)

Results (recognition accuracy %) . . . HMM 11 1001 N in N-best

Results (recognition accuracy %) . . . HMM 11 1001 N in N-best

Summary & Conclusion • Human speech production/perception viewed as synergistic elements in a closed-looped

Summary & Conclusion • Human speech production/perception viewed as synergistic elements in a closed-looped communication chain • They function as encoding & decoding of linguistic messages, respectively. • In human, speech “encoder” (production system) consists of phonological (symbolic) and phonetic (numeric) levels. • Current HMM approach approximates these two levels in a crude way: – phone-based phonological model (“beads-on-a-string”) – multiple Gaussians as phonetic model for acoustics directly – very weak hidden structure

Summary & Conclusion (cont’d) • “Linguistic message recovery” (decoding) formulated as: – auditory reception

Summary & Conclusion (cont’d) • “Linguistic message recovery” (decoding) formulated as: – auditory reception for efficient & robust speech representation & for providing temporal landmarks for phonological features – cognition perception using “encoder” knowledge or “internal model” to perform probabilistic analysis by synthesis or pattern matching • Dynamic Bayes network developed as a computational tool for constructing encoder and decoder • Speaker-listener interaction (in addition to poor acoustic environment) cause substantial changes of articulation behavior and acoustic patterns • Scientific background and computational framework for our recent MSR speech recognition research

End & Backup Slides

End & Backup Slides