MultivariateState Models for Speech Recognition Mark HasegawaJohnson Ph

Key Point Speech is a one-dimensional signal which encodes multiple simultaneous partially independent information

Outline Background: Univariate Hidden Markov Models Problem Statement: Multivariate Content of Speech Ex. :

Background: Statistical Classification Class Definition: Functional Form with Trainable Parameters Training: Modify Parameters of

Recognition Scoring Find Q to maximize the “Recognition Probability, ” P(O, Q) = p(i)

Background: Stop Cons. Release n Three “Places of Articulation: ” u Lips (b, p)

Problem Statement: Content of Speech is Multivariate 1. Source Information: Prosody, Articulatory Features

Content of Speech is Multivariate 2. Useful Non-Source Information: Composite Acoustic Cues

Types of Measurement Error • Small Errors: Spectral Perturbation • Large Errors: Pick the

Large Errors are 20% of Total Std Dev of Small Errors = 45 -72

Measurement Error Predicts Classification Error

Solution: Composite Cues as State Variables

Complexity of Solution Without Additional Constraints

Useful Constraint #1: State Independence

Useful Constraint #2: Hierarchical Dependence

a Posteriori Measurement Distributions: 10 ms After /d/ in “dark” DFT Amplitude DFT Convexity

Conclusions n n Speech Signal is Affected by Multiple Information Streams. Multivariate State Models

Future Directions n Multivariate-State Training Algorithms u. Search for Provable Low-Cost Algorithms u. Test

Speech Production Research: Factor Analysis of MRIDerived Tongue Shapes Hypothesis 1 During speech, tongue

MRI Image Collection • GE Signa 1. 5 T • T 1 -weighted •

MRI Image Segmentation In CTMRedit: • Manual • Seeded Region Growing Tested: • Snake

Problem Statement: Content of Speech is Multivariate 2. Higher-Level Acoustic Information: Relational Spectral Cues

Slides: 30

Download presentation

Multivariate-State Models for Speech Recognition Mark Hasegawa-Johnson, Ph. D. mhj@icsl. ucla. edu NIH Post-Doctoral Fellow, Lecturer, UCLA Department of Electrical Engineering Research Associate, MIT Speech Communication Group

Key Point Speech is a one-dimensional signal which encodes multiple simultaneous partially independent information streams.

Outline Background: Univariate Hidden Markov Models Problem Statement: Multivariate Content of Speech Ex. : Composite Acoustic Cues Multivariate State Models: Definition Complexity Issues Ex. : Composite Acoustic Cues

Background: Statistical Classification Class Definition: Functional Form with Trainable Parameters Training: Modify Parameters of p(obs | class) Create Lookup Table of p(class) Classification: class = argmax p(class | obs)

Hidden Markov Models

HMM Phone Models

HMM Word Models

HMM Sentence Models

Recognition Scoring Find Q to maximize the “Recognition Probability, ” P(O, Q) = p(i) p(o 1|i) p(i|i) p(o 2|i) …

Implementation: the Viterbi Algorithm

Background: Stop Cons. Release n Three “Places of Articulation: ” u Lips (b, p) u Tongue Blade (d, t) u Tongue Body (g, k)

Problem Statement: Content of Speech is Multivariate 1. Source Information: Prosody, Articulatory Features

Content of Speech is Multivariate 2. Useful Non-Source Information: Composite Acoustic Cues

Composite Cues: Traditional Solution

Types of Measurement Error • Small Errors: Spectral Perturbation • Large Errors: Pick the Wrong Peak Amp. (d. B) Frequency (Hertz)

Large Errors are 20% of Total Std Dev of Small Errors = 45 -72 Hz Std Dev of Large Errors = 218 -1330 Hz P(Large Error) = 0. 17 -0. 22 Log PDF Measurement Error (Hertz) re: Manual Transcriptions

Measurement Error Predicts Classification Error

Solution: Composite Cues as State Variables

Complexity of Solution Without Additional Constraints

Useful Constraint #1: State Independence

Useful Constraint #2: Hierarchical Dependence

Description of the Test System

Test System Results

a Posteriori Measurement Distributions: 10 ms After /d/ in “dark” DFT Amplitude DFT Convexity P(F | O, Q) Frequency (0 -4000 Hertz)

Conclusions n n Speech Signal is Affected by Multiple Information Streams. Multivariate State Models Can Explicitly Model Multiple Information Streams: u. Articulatory Features u. Composite Acoustic Cues n Complexity is Viable if State Variables are Independent or Hierarchically-Dependent.

Future Directions n Multivariate-State Training Algorithms u. Search for Provable Low-Cost Algorithms u. Test Heuristic, Non-Provable Algorithms n n Replace “Phone String” w/ Multivariate Articulatory-Feature Representation Prosody u. Simultaneous Recog. of Prosody and Text u. Combine Prosody and Text to Extract Meaning

Speech Production Research: Factor Analysis of MRIDerived Tongue Shapes Hypothesis 1 During speech, tongue is controlled in a lowdimensional subspace. Hypothesis 2 Shape of the subspace is speaker-dependent. Hypothesis 3 Speaker-dependent control spaces are more similar acoustically than articulatorily.

MRI Image Collection • GE Signa 1. 5 T • T 1 -weighted • 3 mm slices • 24 cm FOV • 256 x 256 pixels • Coronal, Axial • 3 Subjects • 11 Vowels • Breath-hold in vowel position for 25 seconds

MRI Image Segmentation In CTMRedit: • Manual • Seeded Region Growing Tested: • Snake • Structural Saliency

Problem Statement: Content of Speech is Multivariate 2. Higher-Level Acoustic Information: Relational Spectral Cues