A Sparse Modeling Approach to Speech Recognition Using

  • Slides: 50
Download presentation
A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker hamaker@isip. msstate.

A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker hamaker@isip. msstate. edu Institute for Signal and Information Processing Mississippi State University Jon Hamaker/ Microsoft/ May 2003

Abstract Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have

Abstract Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to over-fitting and overparameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in performance on small pattern recognition tasks compared to a number of conventional approaches. SVMs, however, require ad hoc (and unreliable) methods to couple it to probabilistic learning machines. Probabilistic Bayesian learning machines, such as the relevance vector machine (RVM), are fairly new approaches that attempt to overcome the deficiencies of SVMs by explicitly accounting for sparsity and statistics in their formulation. In this presentation, we describe both of these modeling approaches in brief. We then describe our work to integrate these as acoustic models in large vocabulary speech recognition systems. Particular attention is given to algorithms for training these learning machines on large corpora. In each case, we find that both SVM and RVM-based systems perform better than Gaussian mixture-based HMMs in open-loop recognition. We further show that the RVM-based solution performs on par with the SVM system using an order of magnitude fewer parameters. We conclude with a discussion of the remaining hurdles for providing this technology in a form amenable to current state-of-the-art recognizers.

Bio Jon Hamaker is a Ph. D. candidate in the Department of Electrical and

Bio Jon Hamaker is a Ph. D. candidate in the Department of Electrical and Computer Engineering at Mississippi State University under the supervision of Dr. Joe Picone. He has been a senior member of the Institute for Signal and Information Processing (ISIP) at MSU since 1996. Mr. Hamaker's research work has revolved around automatic structural analysis and optimization methods for acoustic modeling in speech recognition systems. His most recent work has been in the application of kernel machines as replacements for the underlying Gaussian distribution in hidden Markov acoustic models. His dissertation work compares the popular support vector machine with the relatively new relevance vector machine in the context of a speech recognition system. Mr. Hamaker has co-authored 4 journal papers (2 under review), 22 conference papers, and 3 invited presentations during his graduate studies at MS State (http: //www. isip. msstate. edu/publications). He also spent two summers as an intern at Microsoft in the recognition engine group.

Outline n n n n n The acoustic modeling problem for speech Current state-of-the-art

Outline n n n n n The acoustic modeling problem for speech Current state-of-the-art Discriminative approaches Structural optimization and Occam’s Razor Support vector classifiers Relevance vector classifiers Coupling vector machines to ASR systems Scaling relevance vector methods to “real” problems Extensions of this work

ASR Problem n Input Speech Acoustic Front-End n Focus of Work Language Model p(W)

ASR Problem n Input Speech Acoustic Front-End n Focus of Work Language Model p(W) Statistical Acoustic Models p(A/W) Search Recognized Utterance n Front-end maintains information important for modeling in a reduced parameter set Language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams) Search engine uses knowledge sources and models to chooses amongst competing hypotheses

Acoustic Confusability Requires reasoning under uncertainty! • Regions of overlap represent classification error •

Acoustic Confusability Requires reasoning under uncertainty! • Regions of overlap represent classification error • Reduce overlap by introducing acoustic and linguistic context Comparison of “aa” in “l. Ock” and “iy” in “b. EAt” for SWB

Probabilistic Formulation n To deal with the uncertainty, we typically formulate speech as a

Probabilistic Formulation n To deal with the uncertainty, we typically formulate speech as a probabilistic problem: n Objective: Minimize the word error rate by maximizing P(W|A) Approach: Maximize P(A|W) during training Components: n n n P(A|W): Acoustic Model P(W): Language Model P(A): Acoustic probability (ignored during maximization)

Acoustic Modeling - HMMs n n n HMMs model temporal variation in the transition

Acoustic Modeling - HMMs n n n HMMs model temporal variation in the transition probabilities of the state machine GMM emission densities are used to account for variations in speaker, accent, and pronunciation Sharing model parameters is a common strategy to reduce complexity THREE s 0 s 1 TWO s 2 FIVE EIGHT s 3 s 4

Maximum Likelihood Training n n n Data-driven modeling supervised only from a wordlevel transcription

Maximum Likelihood Training n n n Data-driven modeling supervised only from a wordlevel transcription Approach: maximum likelihood estimation The EM algorithm is used to improve our estimates: Guaranteed convergence to local maximum n No guard against overfitting! Computationally efficient training algorithms (Forward -Backward) have been crucial Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge n n n

Drawbacks of Current Approach n ML Convergence does not translate to optimal classification n

Drawbacks of Current Approach n ML Convergence does not translate to optimal classification n Error from incorrect modeling assumptions n Finding the optimal decision boundary requires only one parameter!

Drawbacks of Current Approach n Data not separable by a hyperplane – nonlinear classifier

Drawbacks of Current Approach n Data not separable by a hyperplane – nonlinear classifier is needed n Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization

Acoustic Modeling n Acoustic Models Must: Model the temporal progression of the speech n

Acoustic Modeling n Acoustic Models Must: Model the temporal progression of the speech n Model the characteristics of the sub-word units n n We would also like our models to: Optimally trade-off discrimination and representation n Incorporate Bayesian statistics (priors) n Make efficient use of parameters (sparsity) n Produce confidence measures of their predictions for higher-level decision processes n

Paradigm Shift - Discriminative Modeling n Discriminative Training (Maximum Mutual Information Estimation) n Essential

Paradigm Shift - Discriminative Modeling n Discriminative Training (Maximum Mutual Information Estimation) n Essential Idea: Maximize numerator (ML term), minimize denominator (discriminative term) n Discriminative Modeling (e. g. ANN Hybrids – Bourlard and Morgan)

Research Focus n Our Research: replace the Gaussian likelihood computation with a machine that

Research Focus n Our Research: replace the Gaussian likelihood computation with a machine that incorporates notions of Discrimination n Bayesian statistics (prior information) n Confidence n Sparsity n n All while maintaining computational efficiency

ANN Hybrids P(c 1|o) … P(cn|o) Architecture: . . . ANN provides flexible, discriminative

ANN Hybrids P(c 1|o) … P(cn|o) Architecture: . . . ANN provides flexible, discriminative classifiers for …. . emission probabilities that avoid HMM independence assumptions (can use wider acoustic context) ………………. . n Trained using Viterbi iterative training (hard decision rule) or can be Input Feature Vector trained to learn Baum. Welch targets (soft decision Shortcomings: rule) n Prone to overfitting: require cross-validation to determine when to stop training. Need methods to automatically penalize overfitting n No substantial recognition improvements over HMM/GMM n

Structural Optimization Error Open-Loop Error Optimum Training Set Error Model Complexity n n Structural

Structural Optimization Error Open-Loop Error Optimum Training Set Error Model Complexity n n Structural optimization often guided by an Occam’s Razor approach Trading goodness of fit and model complexity n Examples: MDL, BIC, AIC, Structural Risk Minimization, Automatic Relevance Determination

Structural Risk Minimization Expected risk bound on the expected risk n Expected Risk: n

Structural Risk Minimization Expected risk bound on the expected risk n Expected Risk: n n Not possible to estimate P(x, y) Empirical Risk: n Related by the VC dimension, h: n Approach: choose the machine that gives the least upper bound on the actual risk optimum VC confidence empirical risk VC dimension n n The VC dimension is a measure of the complexity of the learning machine Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik)

Support Vector Machines CO C 2 H 2 C 1 class 1 H 1

Support Vector Machines CO C 2 H 2 C 1 class 1 H 1 Optimization: Separable Data n Hyperplane: n Constraints: w n origin n n class 2 optimal classifier Hyperplanes C 0 -C 2 achieve zero empirical risk. C 0 generalizes optimally The data points that define the boundary are n Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors Final classifier:

SVMs as Nonlinear Classifiers n n Data for practical applications typically not separable using

SVMs as Nonlinear Classifiers n n Data for practical applications typically not separable using a hyperplane in the original input feature space Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface n Kernels used for this transformation n Final classifier:

SVMs for Non-Separable Data n n No hyperplane could achieve zero empirical risk (in

SVMs for Non-Separable Data n n No hyperplane could achieve zero empirical risk (in any dimension space!) Recall the SRM Principle: trade-off empirical risk and model complexity Relax our optimization constraint to allow for errors on the training set: A new parameter, C, must be estimated to optimally control the trade-off between training set errors and model complexity

SVM Drawbacks n Uses a binary (yes/no) decision rule n n Generates a distance

SVM Drawbacks n Uses a binary (yes/no) decision rule n n Generates a distance from the hyperplane, but this distance is often not a good measure of our “confidence” in the classification Can produce a “probability” as a function of the distance (e. g. using sigmoid fits), but they are inadequate Number of support vectors grows linearly with the size of the data set Requires the estimation of trade-off parameter, C, via held-out sets

Evidence Maximization Build a fully specified probabilistic model – incorporate prior information/beliefs as well

Evidence Maximization Build a fully specified probabilistic model – incorporate prior information/beliefs as well as a notion of confidence in predictions n Mac. Kay posed a special form for regularization in neural networks – sparsity n Evidence maximization: evaluate candidate models based on their “evidence”, P(D|Hi) n Structural optimization by maximizing the evidence across all candidate models! n Steeped in Gaussian approximations n

Evidence Framework P(w|D, Hi) Dw P(w|Hi) w s w n n n Evidence approximation:

Evidence Framework P(w|D, Hi) Dw P(w|Hi) w s w n n n Evidence approximation: n Likelihood of data given best fit parameter set: Penalty that measures how well our posterior model fits our prior assumptions: We can use set the prior in favor of sparse, smooth models!

Relevance Vector Machines n A kernel-based learning machine n Incorporates an automatic relevance determination

Relevance Vector Machines n A kernel-based learning machine n Incorporates an automatic relevance determination (ARD) prior over each weight (Mac. Kay) n A flat (non-informative) prior over a completes the Bayesian specification

Relevance Vector Machines n The goal in training becomes finding: Estimation of the “sparsity”

Relevance Vector Machines n The goal in training becomes finding: Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set! n A closed-form solution to this maximization problem is not available. Rather, we iteratively reestimate n

Laplace’s Method n Fix a and estimate w (e. g. gradient descent) n Use

Laplace’s Method n Fix a and estimate w (e. g. gradient descent) n Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at n With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding

RVMs Compared to SVMs RVM n Data: Class labels (0, 1) n Goal: Learn

RVMs Compared to SVMs RVM n Data: Class labels (0, 1) n Goal: Learn posterior, P(t=1|x) n n Structural Optimization: Hyperprior distribution encourages sparsity Training: iterative – O(N 3) SVM n Data: Class labels (-1, 1) n Goal: Find optimal decision surface under constraints n n Structural Optimization: Trade-off parameter that must be estimated Training: Quadratic – O(N 2)

Simple Example

Simple Example

ML Comparison

ML Comparison

SVM Comparison

SVM Comparison

SVM With Sigmoid Posterior Comparison

SVM With Sigmoid Posterior Comparison

RVM Comparison

RVM Comparison

Experimental Progression Proof of concept on speech classification data n Coupling classifiers to ASR

Experimental Progression Proof of concept on speech classification data n Coupling classifiers to ASR system n Reduced-set tests on Alphadigits task n Algorithms for scaling up RVM classifiers n Further tests on Alphadigits task (still not the full training set though!) n New work aiming at larger data sets and HMM decoupling n

Vowel Classification n Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log

Vowel Classification n Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test Approach % Error SVM: Polynomial Kernels 49% K-Nearest Neighbor 44% Gaussian Node Network 44% SVM: RBF Kernels 35% Separable Mixture Models 30% # Parameters 83 SVs

Coupling to ASR n n n Data size: n 30 million frames of data

Coupling to ASR n n n Data size: n 30 million frames of data in training set n Solution: Segmental phone models Source for Segmental Data: n Solution: Use HMM system in bootstrap procedure n Could also build a segmentbased decoder Probabilistic decoder coupling: n SVMs: Sigmoid-fit posterior n RVMs: naturally k frames hh aw aa r y uw region 1 0. 3*k frames region 2 0. 4*k frames region 3 0. 3*k frames mean region 1 mean region 2 mean region 3

Coupling to ASR System Features (Mel-Cepstra) SEGMENTAL CONVERTER Segmental Features HMM RECOGNITION Segment Information

Coupling to ASR System Features (Mel-Cepstra) SEGMENTAL CONVERTER Segmental Features HMM RECOGNITION Segment Information N-best List HYBRID DECODER Hypothesis

Alphadigit Recognition OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A 19 B 4

Alphadigit Recognition OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A 19 B 4 E”) n Reduced training set size for RVM comparison: 2000 training segments per phone model n n Could not, at this point, run larger sets efficiently 3329 utterances using 10 -best lists generated by the HMM decoder n SVM and RVM system architecture are nearly identical: RBF kernels with gamma = 0. 5 n n SVM requires the sigmoid posterior estimate to produce likelihoods – sigmoid parameters

SVM Alphadigit Recognition Transcription Segmentation SVM HMM N-best Hypothesis 11. 0% 11. 9% N-best+Ref

SVM Alphadigit Recognition Transcription Segmentation SVM HMM N-best Hypothesis 11. 0% 11. 9% N-best+Ref Reference 3. 3% 6. 3% n n n HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models SVM system has monophone models with segmental features System combination experiment yields another 1% reduction in error

SVM/RVM Alphadigit Comparison Approach Error Avg. # Training Testing Time Rate Parameters SVM 16.

SVM/RVM Alphadigit Comparison Approach Error Avg. # Training Testing Time Rate Parameters SVM 16. 4 257 0. 5 hours 30 mins % RVM 16. 2 12 30 days 1 min % n RVMs yield a large reduction in the parameter count while attaining superior performance n Computational costs mainly in training for RVMs but is still prohibitive for larger sets

Scaling Up Central to RVM training is the inversion of an Mx. M Hessian

Scaling Up Central to RVM training is the inversion of an Mx. M Hessian matrix: an O(N 3) operation initially n Solutions: n Constructive Approach: Start with an empty model and iteratively add candidate parameters. M is typically much smaller than N n Divide and Conquer Approach: Divide complete problem into set of sub-problems. Iteratively refine the candidate parameter set according to subproblem solution. M is user-defined n

Constructive Approach Tipping and Faul (MSR-Cambridge) n Define n n n has a unique

Constructive Approach Tipping and Faul (MSR-Cambridge) n Define n n n has a unique solution with respect to The results give a set of rules for adding vectors to the model, removing vectors from the model or updating parameters in the model

Constructive Approach Algorithm Prune all parameters; While not converged For each parameter: If parameter

Constructive Approach Algorithm Prune all parameters; While not converged For each parameter: If parameter is pruned: check. Add. Rule Else: check. Prune. Rule check. Update. Rule End Update Model End n Begin with all weights set to zero and iteratively construct an optimal model without evaluating the full Nx. N inverse n Formed for RVM regression – can have oscillatory behavior for classification n Rule subroutines require the full design matrix (Nx. N) storage requirement

Iterative Reduction Algorithm Candidate Pool Iteration I n n Subset J TRAIN RVs Subset

Iterative Reduction Algorithm Candidate Pool Iteration I n n Subset J TRAIN RVs Subset 0 TRAIN RVs Iteration I+1 O(M 3) in run-time and O(Mx. N) in memory. M is a user-defined parameter Assumes that if P(wk=0|w. I, J, D) is 1 then P(wk=0|w, D) is also 1! Optimality?

Alphadigit Recognition Approach Error Avg. # Training Testing Time Rate Parameters Time 15. 5%

Alphadigit Recognition Approach Error Avg. # Training Testing Time Rate Parameters Time 15. 5% 994 3 hours 1. 5 hours 14. 8% 72 5 days 5 mins SVM RVM Constructive RVM 14. 8% Reduction n n 74 6 days 5 mins Data increased to 10000 training vectors Reduction method has been trained up to 100 k vectors (on toy task). Not possible for Constructive method

Summary First to apply kernel machines as acoustic models n Comparison of two machines

Summary First to apply kernel machines as acoustic models n Comparison of two machines that apply structural optimization to learning: SVM and RVM n Performance exceeds that of HMM but with quite a bit of HMM interaction n Algorithms for increased data sizes are key n

Decoupling the HMM Still want to use segmental data (data size) n Want the

Decoupling the HMM Still want to use segmental data (data size) n Want the kernel machine acoustic model to determine an optimal segmentation though n Need a new decoder n Hypothesize each phone for each possible segment n Pruning is a huge issue n Stack decoder is beneficial n n Status: In development

Improved Iterative Algorithm Subset 0 TRAIN Candidate Pool TRAIN Subset 1 RVs Same principle

Improved Iterative Algorithm Subset 0 TRAIN Candidate Pool TRAIN Subset 1 RVs Same principle of operation n One pass over the data – much faster! n Status: Equivalent performance on all benchmarks – running on Alphadigits now n

Active Learning for RVMs n n n Idea: Given the current model, iteratively chooses

Active Learning for RVMs n n n Idea: Given the current model, iteratively chooses a subset of points from the full training set that will improve the system performance Problem #1: “Performance” typically is defined as classifier error rate (e. g. boosting). What about the posterior estimate accuracy? Problem #2: For kernel machines, an added training point can: n n n Assist in bettering the model performance Become part of the model itself! How do we determine which points should be added? Look to work in Gaussian Processes (Lawrence, Seeger, Herbrich, 2003)

Extensions Not ready for prime time as an acoustic model n How else might

Extensions Not ready for prime time as an acoustic model n How else might we use the same techniques for speech? n Online Speech/Noise Classification? n n n Requires adaptation methods Application of automatic relevance determination to model selection for HMMs?

Acknowledgments Collaborators: Aravind Ganapathiraju and Joe Picone at Mississippi State n Consultants: Michael Tipping

Acknowledgments Collaborators: Aravind Ganapathiraju and Joe Picone at Mississippi State n Consultants: Michael Tipping (MSRCambridge) and Thorsten Joachims (now at Cornell) n