PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM

PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Human and Systems Engineering Department of Electrical and Computer Eng. Mississippi State University {parihar, picone}@isip. msstate. edu D. Pearce Speech and Multi. Modal Group Motorola Labs, UK bdp 003@motorola. com H. G. Hirsch Department of Electrical and Computer Sc. Niederrhein University hans-guenter. hirsch@hs-niederrhein. de URL: www. isip. msstate. edu/projects/ies/publications/conferences/eusipco/ 2004/distributed_speech/

Introduction: Abstract In this paper, we present the design and analysis of the baseline recognition system used for ETSI Aurora large vocabulary (ALV) evaluation. The experimental paradigm is presented along with the results from a number of experiments designed to minimize the computational requirements for the system. The ALV baseline system achieved a WER of 14. 0% on the standard 5 K Wall Street Journal task, and required 4 x. RT for training and 15 x. RT for decoding (on an 800 MHz Pentium processor). It is shown that increasing the sampling frequency from 8 k. Hz to 16 k. Hz improves performance significantly only for the noisy test conditions. Utterance detection resulted in significant improvements only on the noisy conditions for the mismatched training conditions. Use of the DSR standard VQ-based compression algorithm did not result in a significant degradation. The model mismatch and microphone mismatch resulted in a relative increase in WER by 300% and 200%, respectively. Performance Analysis of ALV Baseline System Page 1 of 8

Introduction: Motivation • ALV goal was at least a 25% relative improvement over the baseline MFCC front end • Develop generic baseline LVCSR system with no front end specific tuning • Benchmark the baseline MFCC front end using generic LVCSR system on six focus conditions — sampling frequency reduction, utterance detection, feature-vector compression, model mismatch, microphone variation, and additive noise Performance Analysis of ALV Baseline System Page 2 of 8

ALV Baseline System Development: ISIP WSJ 0 System Training Data Standard context-dependent cross. Monophone word HMM-based system: Modeling • Acoustic models: state-tied 4 -mixture cross-word triphones CD-Triphone Modeling • Language model: WSJ 0 5 K bigram • Search: Viterbi one-best using lexical trees for N-gram cross-word decoding • Lexicon: based on CMUlex • Performance: 8. 0% WER at 85 x. RT Performance Analysis of ALV Baseline System State-Tying CD-Triphone Modeling Mixture Modeling (16) Page 3 of 8

ALV Baseline System Development: ETSI WI 007 Front End Input Speech The baseline HMM system used an ETSI standard Zero-mean and Pre-emphasis MFCC-based front end: • Zero-mean debiasing • 10 ms frame duration • 25 ms Hamming window • Absolute energy Energy Fourier Transf. Analysis Cepstral Analysis • 12 cepstral coefficients • First and second derivatives Performance Analysis of ALV Baseline System / Page 4 of 8

ALV Baseline System Development: Real-time reduction • Derived from ISIP WSJ 0 system (with CMS) • Aurora-4 database terminal filtering resulted in marginal degradation • ETSI WI 007 front end is 14% worst (no CMS) • ALV Baseline System performance: 14. 0% • Real-time: 4 x. RT for training and 15 x. RT for decoding on a 800 MHz Pentium Performance Analysis of ALV Baseline System Factor WER Relative Degrad. Baseline system (ISIP) 8. 3% N/A Terminal filtering 8. 4% 1% ETSI front end 9. 6% 14% Beam Adj. (15 x. RT) 11. 8% 23% Reduce 16 to 4 mixtures 14. 1% 20% 50% reduction of eval set 14. 9% 6% Endpointing silences 14. 0% -6% Page 5 of 8

Benchmarking ALV Baseline System: Aurora— 4 database Acoustic Training: • Derived from 5000 word WSJ 0 task • TS 1 (clean), and TS 2 (multi-condition) • Clean plus 6 noise conditions • Randomly chosen SNR between 10 and 20 d. B • 2 microphone conditions (Sennheiser and secondary) • 2 sample frequencies – 16 k. Hz and 8 k. Hz • G. 712 filtering at 8 k. Hz and P. 341 filtering at 16 k. Hz Development and Evaluation Sets: • Derived from WSJ 0 Evaluation and Development sets • 14 test sets for each • 7 recorded on Sennheiser; 7 on secondary • Clean plus 6 noise conditions • Randomly chosen SNR between 5 and 15 d. B • G. 712 filtering at 8 k. Hz and P. 341 filtering at 16 k. Hz Performance Analysis of ALV Baseline System Page 6 of 8

Experimental Results: Sampling Frequency Reduction 16 k. Hz 8 k. Hz 40 30 20 10 0 TS 1 TS 2 TS 3 TS 4 TS 5 TS 6 TS 7 • No degradation on perfectly matched condition (TS 1 and Tr. S 1) • No clear trend on mismatched conditions (Tr. S 1 and TS 2 TS 14) • Significant degradation on most matched conditions (Tr. S 2 and TS 3 -TS 14) Performance Analysis of ALV Baseline System Page 7 of 8

Experimental Results: Utterance Detection • No significant improvement on perfectly matched condition (Tr. S 1 and TS 1) • Significant improvement on mismatched conditions (Tr. S 1 and TS 2 -TS 14) due to reduction in insertions Test Set W/O Endpointing With Endpointing Sub. Del. Ins. TS 2 (Senn. , Car) 41. 4% 3. 6% 20. 1% 40. 0% 3. 6% 13. 0% TS 9 (Sec. , Car) 54. 4% 12. 3% 15. 1% 49. 1% 15. 1% 10. 1% • No significant improvement on matched conditions (Tr. S 2 and TS 1 -TS 14) due to reduction in insertions Performance Analysis of ALV Baseline System Page 8 of 8

Experiment Results: Feature-vector Compression • Sampling frequency specific codebooks — 8 k. Hz and 16 k. Hz • No significant degradation on perfectly matched condition (Tr. S 1 and TS 1) • No significant degradation on mismatched conditions (Tr. S 1 and TS 2 -TS 14) • Significant degradation on a few matched conditions – TS 3, 8, 9, 10, 12 at 16 k. Hz sampling and TS 7, 12 at 9 k. Hz sampling frequency Performance Analysis of ALV Baseline System Page 9 of 8

Experimental Results: Model Mismatch Tr. S 1 • Best performance observed on perfectly matched conditions (Tr. S 1 and TS 1) Tr. S 2 TS 1 (Clean) 70 60 50 40 30 • Significant degradation on mismatched conditions (Tr. S 1 and TS 2 -TS 14) 20 10 0 • Matched conditions (Tr. S 2 and TS 1 -TS 14) significant better than mismatched conditions 40 30 TS 2 TS 3 TS 4 TS 5 TS 6 TS 7 20 10 0 TS 2 TS 3 TS 4 TS 5 TS 6 TS 7 Performance Analysis of ALV Baseline System Page 10 of 8

Experimental Results: Microphone Variation • Train on Sennheiser mic. ; evaluate on secondary mic. • Perfectly matched condition (Tr. S 1 and TS 1) result in optimal performance • Significant degradation on mismatched conditions (Tr. S 1 and TS 8) • Less severe degradation when samples of sec. microphone seen during training (Tr. S 2) Performance Analysis of ALV Baseline System Senn. Sec. 40 35 30 25 20 15 10 5 0 Tr. S 1 Tr. S 2 Page 11 of 8

Experimental Results: Additive Noise Tr. S 1 Tr. S 2 TS 1 (Clean) 70 60 50 • Performance degrades on noise condition when systems are trained only on clean data (mismatched conditions) 40 30 20 10 0 TS 2 TS 3 TS 4 TS 5 TS 6 TS 7 40 30 • Exposing systems to noise and microphone variations (TS 2) improves performance (matched conditions) 20 10 0 TS 2 TS 3 TS 4 TS 5 TS 6 TS 7 Performance Analysis of ALV Baseline System Page 12 of 8

Summary and Conclusions: What have we learned? • Presented a WSJ 0 based LVCSR system that runs at 4 x. RT for training and 15 x. RT for decoding on a 800 MHz Pentium • Increase in sampling frequency from 8 k. Hz to 16 k. Hz results in significant improvement only on matched noisy test conditions • Utterance detection resulted in significant improvements only on the noisy conditions for the mismatched training conditions • Vector Quantization based compression is robust in DSR environment • Exposing models to different noisy conditions and microphone conditions improves the speech recognition performance in adverse conditions Performance Analysis of ALV Baseline System Page 13 of 8

Summary and Conclusions: Available Resources • Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end • Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit • ETSI DSR Website: reports and front end standards Performance Analysis of ALV Baseline System Page 14 of 8

Summary and Conclusions: Brief Bibliography • N. Parihar, Performance Analysis of Advanced Front Ends, M. S. Dissertation, Mississippi State University, December 2003. • N. Parihar, and J. Picone, “An Analysis of the Aurora Large Vocabulary Evaluation, ” Eurospeech 2003, pp. 337 -340, Geneva, Switzerland, September 2003. • N. Parihar and J. Picone, “DSR Front End LVCSR Evaluation - AU/384/02, ” Aurora Working Group, European Telecommunications Standards Institute, December 06, 2002. • D. Pearce, “Overview of Evaluation Criteria for Advanced Distributed Speech Recognition, ” ETSI STQ-Aurora DSR Working Group, October 2001. • G. Hirsch, “Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task, ” ETSI STQAurora DSR Working Group, December 2002. • “ETSI ES 201 108 v 1. 1. 2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm, ” ETSI, April 2000. Performance Analysis of ALV Baseline System Page 15 of 8