Features for Improved Speech Activity Detection for Recognition

  • Slides: 25
Download presentation
Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye

Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International Computer Science Institute May 30 th, 2006 Speech Group Lunch Talk

Overview ● Background ● Previous work and Proposed changes ● HMM segmenter and ASR

Overview ● Background ● Previous work and Proposed changes ● HMM segmenter and ASR System ● Features Investigated ● Experimental Results ● Conclusions May 30 th, 2006 Speech Group Lunch Talk

Background ● ● Segmentation of audio into speech/nonspeech is a critical first step in

Background ● ● Segmentation of audio into speech/nonspeech is a critical first step in ASR Especially true for Individual Headset Microphone (IHM) condition in meetings – Issues: 1) Crosstalk 2) Breath/contact noise – Single-channel energy-based methods ineffective May 30 th, 2006 Speech Group Lunch Talk

Background ● Initiatives such as AMI, IM 2, and the NIST RT eval show

Background ● Initiatives such as AMI, IM 2, and the NIST RT eval show interest in recognition and understanding of multispeaker meetings May 30 th, 2006 Speech Group Lunch Talk

Background ● Major source of error for IHM recognition: speech activity detection errors May

Background ● Major source of error for IHM recognition: speech activity detection errors May 30 th, 2006 Speech Group Lunch Talk

Previous Work ● Previous approach: Time-based intersection of two distinct segmenters 1) HMM-based segmenter

Previous Work ● Previous approach: Time-based intersection of two distinct segmenters 1) HMM-based segmenter with standard cepstral features – – – 12 MFCCs Log-Energy First and second differences May 30 th, 2006 Speech Group Lunch Talk

Previous Work ● Previous approach: Time-based intersection of two distinct segmenters 2) Local-energy detector

Previous Work ● Previous approach: Time-based intersection of two distinct segmenters 2) Local-energy detector – Generates segments by zero-thresholding “crosstalk-compensated” energy-like signal May 30 th, 2006 Speech Group Lunch Talk

Proposed Changes ● Though intersection approach was effective, it was believed to be limited

Proposed Changes ● Though intersection approach was effective, it was believed to be limited – – – ● Cross-channel analysis disjoint from speech activity modeling Fixed threshold potentially lacks robustness Fails to incorporate other acoustically derived features (e. g. , cross-correlation) New approach: integrate features directly into HMM segmenter – Append features to cepstral feature vector May 30 th, 2006 Speech Group Lunch Talk

HMM Segmenter ● ● Derived from HMM-based speech recognition system Two-class HMM with three-state

HMM Segmenter ● ● Derived from HMM-based speech recognition system Two-class HMM with three-state phone model Multivariate GMM with 256 components Segmentation proceeds by repeatedly decoding waveform with decreasing transition penalties – Results in segments less than 60 s May 30 th, 2006 Speech Group Lunch Talk

HMM Segmenter ● Post-processing – – – Pad segments by a fixed amount (40

HMM Segmenter ● Post-processing – – – Pad segments by a fixed amount (40 ms) to prevent “clipping” effects Merge segments with small separation (< 0. 4 s) to “smooth” segmentation Constraints optimized based on recognition accuracy and runtime for segmenter with standard cepstral features May 30 th, 2006 Speech Group Lunch Talk

ASR System ● For development and validation experiments ICSI-SRI RT-05 S system used –

ASR System ● For development and validation experiments ICSI-SRI RT-05 S system used – – – Multiple decoding passes and front-ends for cross-adaptation and hypothesis refinement PLP and MFCC+MLP features Features transformed with VTLN and HLDA along with feature-level constrained MLLR Models trained on 2000 hours of telephone data and MAP adapted to 100 hours of meeting data 4 -gram LM trained with telephone, meeting transcripts, broadcast, and Web data May 30 th, 2006 Speech Group Lunch Talk

Features: Cross-channel ● Log-Energy Differences (LEDs) – ● Log of ratio of short-time energy

Features: Cross-channel ● Log-Energy Differences (LEDs) – ● Log of ratio of short-time energy between target and each non-target channel Normalized Log-Energy Differences – Subtract minimum frame energy of a channel from all energy values in the channel – Addresses significant gain differences Largely independent of amount of speech in channel – May 30 th, 2006 Speech Group Lunch Talk

Features: Cross-channel ● Normalized Maximum Cross-correlation (NMXC) – – Serves as an indicator of

Features: Cross-channel ● Normalized Maximum Cross-correlation (NMXC) – – Serves as an indicator of crosstalk More common cross-channel feature than energy-differences May 30 th, 2006 Speech Group Lunch Talk

Features: Cross-channel ● Feature vector length standardization – – For cross-channel features, number of

Features: Cross-channel ● Feature vector length standardization – – For cross-channel features, number of channels may vary, but feature vector length must be fixed Proposed solution: use order statistics (maximum and minimum) of the feature values generated on the different channels May 30 th, 2006 Speech Group Lunch Talk

Experiments: AMI devtest ● ● Performance of features initially investigated on AMI development set

Experiments: AMI devtest ● ● Performance of features initially investigated on AMI development set Testing – ● Training – ● 12 -minute excerpts from 4 meetings First 10 minutes from each of 35 meetings “Fast” (two-decoding pass) version of recognition system used for quick turnaround May 30 th, 2006 Speech Group Lunch Talk

Experiments: AMI devtest ● Results New features give significant improvement over baseline ●Reduced insertions

Experiments: AMI devtest ● Results New features give significant improvement over baseline ●Reduced insertions NLEDs give ~1% reduction over LEDs May 30 th, 2006 Speech Group Lunch Talk

Experiments: Eval 04 ● ● ● Having established effectiveness of features, systems were evaluated

Experiments: Eval 04 ● ● ● Having established effectiveness of features, systems were evaluated using RT-04 S set Meetings vary in style, number of participants, and room acoustics Testing – ● 11 -minute excerpts from 8 meetings, 2 from each of CMU, ICSI, NIST, and LDC Training – First 10 minutes from each of 15 NIST meetings and 73 ICSI meetings May 30 th, 2006 Speech Group Lunch Talk

Experiments: Eval 04 ● Results Features give improvement over baseline and previous system NMXC

Experiments: Eval 04 ● Results Features give improvement over baseline and previous system NMXC features not as robust ● Removed from consideration for final SAD system May 30 th, 2006 Speech Group Lunch Talk

System Validation: Eval 05 (and 06) ● ● Finalized system: HMM segmenter with baseline

System Validation: Eval 05 (and 06) ● ● Finalized system: HMM segmenter with baseline and NLED features* Training – Union of previous training sets ● – – AMI (35 mtgs), NIST (15 mtgs), ICSI (73 mtgs) Baseline and intersection systems used two models (ICSI+NIST and AMI) New systems used single model with pooled data *Eval 06 official submission used LEDs May 30 th, 2006 Speech Group Lunch Talk

System Validation: Eval 05 (and 06) ● Using the SDM signal – Eval 05

System Validation: Eval 05 (and 06) ● Using the SDM signal – Eval 05 included a meeting with an unmiked participant – SDM served as “stand-in” mic for participant – Including the SDM signal (and energy normalization) improved results by >12% on NIST meetings! – The SDM signal was not used for eval 06 since there were no unmiked speakers May 30 th, 2006 Speech Group Lunch Talk

System Validation: Eval 05 (and 06) ● ● 1. 2% gain over last year’s

System Validation: Eval 05 (and 06) ● ● 1. 2% gain over last year’s segmenter on eval 05 Energy normalization gave extra 1. 2% gain on eval 06, 2. 0% on eval 05 (due to unmiked speaker in NIST meeting) May 30 th, 2006 Speech Group Lunch Talk

Additional Experiments: MLP Features ● ● Use features as inputs to Multi-Layer Perceptron to

Additional Experiments: MLP Features ● ● Use features as inputs to Multi-Layer Perceptron to see if additional gains can be made Training – – – Inputs consist of baseline and either LED or NLED features (41 components) Input context window of 11 frames and 400 hidden units 90/10 split for cross-validation May 30 th, 2006 Speech Group Lunch Talk

Additional Experiments: MLP Features ● Amidevtest Results MLP with LEDs better than with NLEDs

Additional Experiments: MLP Features ● Amidevtest Results MLP with LEDs better than with NLEDs Addition of baseline features degrades performance No combination outperforms NLED features May 30 th, 2006 Speech Group Lunch Talk

Conclusions ● ● ● Integrating cross-channel analysis with speech activity modeling yields large WER

Conclusions ● ● ● Integrating cross-channel analysis with speech activity modeling yields large WER reductions Simple cross-channel energy-based features perform very well and are more robust than crosscorrelation based features Minimum energy subtraction produces still further gains Inclusion of omnidirectional mic allows crosstalk suppression even from speakers without dedicated microphones Still room for improvement as significant gap (>2%) exists between automatic and ideal segmentation May 30 th, 2006 Speech Group Lunch Talk

Fin May 30 th, 2006 Speech Group Lunch Talk

Fin May 30 th, 2006 Speech Group Lunch Talk