LandmarkBased Speech Recognition Mark HasegawaJohnson Artificial Intelligence Group

Landmark-Based Speech Recognition Mark Hasegawa-Johnson Artificial Intelligence Group Human-Computer Intelligent Interaction

Phoneme Recognition Accuracy: Machines (short-time spectra) vs. Humans Human Machine Vowels: Monophone 55% 70 Vowels: Triphone 66 70 Stops: VFSV Stops: VSV 88 97 67 89 Phones in Context 98. 5 79

Landmark-Based Speech Recognition Vowels: - Spectrum changes slowly - Measure: short-time spectrum (hidden Markov model) Consonants: - Spectrum changes rapidly - Measure: time-frequency pattern of change (landmark-based recognition)

Hybrid HMM/Landmark Recognizer Omar, Hasegawa-Johnson, and Levinson, 2001 Qi = [ phoneme string with Ki landmarks ] Xi = [ Ki lm observations, N-Ki short-time spectra ] Q* = argmax p(Qi|Xi) ? Comparability Conditions p(X 1, X 2, . . . |Q 1) = p(X 1|Q 1) p(~X 1|babble) p(X 1 | babble) = p(X 2 | babble)

Hybrid Landmark-HMM Recognition Results

Problem: Feature Selection INFOGRAMS: Time-Frequency Spread of Information About Phonetic Distinctions (Hasegawa-Johnson, 2000)

Feature Selection Method 1: Phonetic Knowledge

Method 2: Maximum Mutual Information Feature Selection Omar & Hasegawa-Johnson, submitted Mutual Information Between Feature Vector and Phonetic Distinction: LPCC, MFCC, PLP ---- Standard Feature Vectors MMIA ---- Maximum Mutual Information Acoustic Features

Summary of Current Results �Landmark/HMM Hybrid Recognizer improves stop consonant recognition accuracy from 27% to 66%. �MMI feature selection doubles mutual information between features and phonetic class. �Preliminary results with speech in noise: soft-decision auditory scene analysis gives 70% WRA at 0 d. B SNR.

Continuing Research: Speech in Complex Noise Environments Jing & Hasegawa-Johnson, submitted

Continuing Research: Prosody. Dependent Landmark Models

Other Continuing Research = Articulatory Dynamics at Landmarks: � Multimodal interfaces � Hidden dynamic trajectory ASR MRI Analysis of Normal and Abnormal Tongue = Active Sound Control in a Virtual Reality Environment =