Visual Speech Recognition Using Hidden Markov Models Kofi








- Slides: 8

Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS 280 Course Project

Motivation • Visual articulation provides good information source for speech – Lip-reading humans can intelligibly recognize speech – Visual information provides robustness to noise • Can enhance speech recognition in various applications – Text annotation of multimedia data – Automatic computer dictation – Lip-reading in mobile phones for noisy environments

Project Overview • Visual speech recognition task using Tulips 1 database • Recognition performed by training features in HMMs • Cross-validation procedure used for training and testing • Experimented with features and HMM architecture

Tulips 1 • Small public audiovisual database • Consists of 12 speakers (9 male, 3 female) saying first four English digits • Video format: – Digitized (8 -bit grayscale pgm) images of lips of size 100 x 75 – Sampling rate: 30 fps

Features • Contour features – • PCA on raw image pixels – • 6 features related to geometry of the mouth and lips (hand generated) Experimented with different numbers of components Image preprocessing + PCA Processing included: 1) Symmetry enforcement 2) Lowpass filtering (9 x 9 Gsn kernel, σ=1. 5) and subsampling (5 ) 3) Compression and linearization

Results Contour Features • Best choice: 5 states and 1 Gaussian • Note high accuracy with even 1 state • Indicates importance of delta components Raw Image Features • Best choice: 10 components • Similar performance to contour features, which require human assistance • Demonstrates power of PCA

Results Preprocessed Image Features • Procedure produces fair performance • Even better with addition of PCA

Conclusions • • For given task, HMMs proved very effective HMM architecture significantly affects results Delta features appear to be quite useful Feature selection – Contour features best • Generation can potentially be automatic – Within limited exploration, “blind” statistical technique (i. e. , PCA) superior to image-specific one