Robust Feature Extraction for Automatic Speech Recognition based
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches 1 Harvilla , Mark J. Chanwoo and Richard M. 1 Electrical and Computer Engineering Department and 2 Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA Introduction It is well known that the accuracy of automatic speech recognition systems is compromised in high-noise environments. In contrast, humans have a remarkable ability to accurately recognize continuous speech fairly independently of the environment. This observation might imply that the robustness of ASR systems can be increased by exploiting principles and adopting characteristic mechanisms of the human auditory system (HAS). A contrasting, and quite possibly complimentary, approach is to confront the problem from a statistical standpoint. By designing algorithms with statistical optimality in mind, robustness systems can be built that are effective, but that don’t necessarily adhere to any physiological mechanism. 2 Kim Power-Normalized Cepstral Coefficients (PNCC) Input audio Pre-emphasis STFT Magnitude squared Gammatone filter bank Peak power normalization Medium-duration power bias subtraction Power function nonlinearity DCT & CMN Feature Compensatory Spectral Averaging and Warping using Histograms (CSAWH) • PNCC combines multiple properties of the HAS including: • • • Knowledge of the shape of the effective auditory filters related to cochlear response The precedence effect (see SSF below) The rate-level nonlinearity • PNCC imposes power bias subtraction motivated by the mismatch in the AM-GM ratio between clean and noisy speech. • • The AM-GM ratio is related to the shape parameter of the Gamma distribution, which characterizes well the distribution of linear power of speech. The AM-GM ratio can also be used for blind SNR measurement (Waveformamplitude distribution analysis, WADA). The figures to the left illustrate clean and noisy subband speech power signals with and without PNCC processing (bottom and top, respectively). Objectives Our general objectives are to: • Develop portable front-end features that work generally to improve the robustness of speech-based systems, most specifically ASR, but conceivably helpful for any fundamentally speech-based systems such as voice-activity detectors, speaker recognizers, keyword spotters, and so on. • Maintain generality of the features so that they are independent of ASR systems, tasks, and other adaptation and normalization techniques • Design the features with the overall objective of reducing the mismatch between training and testing data in mind 1, 2 Stern Suppression of Slowly-varying components and the Falling edge (SSF) Input audio Pre-emphasis STFT Magnitude squared Gammatone filter bank Peak power normalization Power function nonlinearity Histogram matching Weighted spectral averaging Audio resynthesis • CSAWH is based on the observation that noise significantly alters the characteristic distribution of subband speech power • CSAWH applies nonparametric transformations to match the distribution of the input speech to prototype distributions observed over clean reference data • The nonlinear transformations do not inherently discriminate between speech and noise; weighted spectral averaging helps to mitigate sporadic suppression of speech or amplification of noise Output audio The figure to the left illustrates the effect of the noise on the distribution of the subband speech power Selected Experimental Results Input audio • SSF is based on the precedence effect, which Pre-emphasis is the tendency of the HAS to focus on the first arriving STFT wave front of a given sound source. Magnitude • By emphasizing onsets, squared the spectral smearing Gammatone filter effect of reverberation can bank be partially counteracted. SSF processing Spectral reshaping Inverse STFT Post de-emphasis Output audio The above plots show results from CMU Sphinx-3 on RM 1 in white noise. Below, results are depicted for RATS-like noise on the SRI DECIPHER ASR. For both cases, the left column shows clean training and the right column shows multistyle training. M[m, l] = λM[m-1, l] + (1 - λ)P[m, l] P 1[m, l] = max(P[m, l]-M[m, l], c 0 P[m, l]) P 2[m, l] = max(P[m, l]-M[m, l], c 0 M[m, l]) These processing blocks effectively resynthesize audio; this helps to smooth spectral discontinuities introduced by nonlinear processing The figure below illustrates the effect of SSF to emphasize onsets in reverberation: Below, results for SSF are compared to other standard feature extraction algorithms. The left plot shows results for clean speech in reverberation and the right plot shows results for speech in music noise. The differences between SSF Type-I and Type-II are pronounced in reverberation, but insignificant in the other case.
- Slides: 1