MULTIMICROPHONE MULTICAMERA AUDIOVISUAL SPEECH RECOGNITION IN AN AUTOMOTIVE

Slides: 1

MULTIMICROPHONE MULTICAMERA AUDIOVISUAL SPEECH RECOGNITION IN AN AUTOMOTIVE ENVIRONMENT Faculty: Mark Hasegawa-Johnson, Stephen Levinson, Thomas Huang Graduate Student: Bowon Lee, Lae-Hoon Kim, Ming Liu, Sarah Borys Department of Electrical and Computer Engineering University of Illinois at Urbana Champaign, USA 1. Research overview • In human perception of speech, the acoustic speech signal is primary. But visual observation of the lips, teeth, tongue, and jaw contribute to perception of phoneme articulation, while the angle of the head and raising of the eyebrows help convey sentence-level prosody. Visual information improves speech recognition especially in noisy environments. • Automatic speech recognition (ASR) can be very accurate when using a specifically designed Hidden Markov Model (HMM) recognizer. However, performance degrades severely when the training and test data sets have mismatched signal-to-noise ratio (SNR) or speaking styles. • Combining visual and audio information can improve ASR accuracy for low SNR conditions. For humans, adding a visual signal roughly equals a 12 d. B SNR gain. For ASR, an audio-video coupled HMM can improve word recognition accuracy by more than 40% at 20 d. B SNR. 3. Sequential MVDR beamforming + postfilter 2. Database • The AVICAR corpus is data recorded in a real car environment using a multi-sensory array consisting of eight microphones on the sun visor and four video cameras on the dashboard. • The script for the corpus consists of four categories. • Speakers from various language backgrounds are included, 50 male and 50 female. • Each script has five different noise condition. Y 1 Y 2 Y 3 YN T(Y) MVDR with direction of d Z-dd, r 1 MVDR with direction of r 1 Z-dr 1, r 2 MVDR with direction of r 2 S POST-FILTER (MMSE-log. SA) • Idling (IDL) • Driving at 35 mph with windows up (35 U) and down (35 D) • Driving at 55 mph with windows up (55 U) and down (55 D). Z-dr. M-1, r. M MVDR with direction of r. M MVDR BEAMFORMERS Implemented by LMS-GSC • 55 miles per hour with window down • A moving car is a good example for combining visual and audio information. Drivers may be less distracted when they operate devices by speaking commands to an ASR instead of manually operating them. Background noise in a car typically produces 15 to -10 d. B SNR at a speech-recording microphones, or by using a microphone array. A microphone array can improve SNR by using beamforming algorithms, which attenuate off-axis sounds such as wind noise, road noise, and passing vehicles. An array of cameras allows for the extraction of 3 D shape-based features for audiovisual ASR. Model-based (rather than image-based) feature extraction is relatively immune to the widely changing illumination in a moving car. 6. Visual features 4. Direction-Based VAD 5. CHMM and Audio-Visual fusion speech recognition WER (%) speaker verification result