Dynamic Programming Search for Continuous Speech Recognition 2004113

  • Slides: 42
Download presentation
Dynamic Programming Search for Continuous Speech Recognition 2004/11/3 報告者:陳怡婷

Dynamic Programming Search for Continuous Speech Recognition 2004/11/3 報告者:陳怡婷

Concerned Problems Could this huge search space be handled by DP in an efficient

Concerned Problems Could this huge search space be handled by DP in an efficient way? n How to decide good DP search n DP computes only the single best sentence? ? n

Outline Describe some problems and rule about Speech recognition n One-Pass DP Search Using

Outline Describe some problems and rule about Speech recognition n One-Pass DP Search Using a Linear Lexicon n One-Pass DP Search Using a Tree Lexicon n One-Pass DP Search for Word Graph Construction n

Boundary of Acoustic signal or words、 speakers、quality of speech signal、naturallanguage… n Bayes Decision Rule:(圖一所示)

Boundary of Acoustic signal or words、 speakers、quality of speech signal、naturallanguage… n Bayes Decision Rule:(圖一所示) Pr(w 1…w. N | x 1…x. T) =>Pr(w 1…w. N)Pr(x 1. . x. T|w 1. . . w. N) n Language Model:syntactic、semantic… (for large-vocabulary: bigram or trigram model) n Acoustic-phonetic Model (training、HMMs、 pronunciation lexicon or dictionary) n

(圖一 Bayes Decision Rule)

(圖一 Bayes Decision Rule)

Specification of Search Problem Decision on the spoken words Language model n Acoustic-phonetic model

Specification of Search Problem Decision on the spoken words Language model n Acoustic-phonetic model n Pronunciation (optimization by knowledge sources) n Thinking:a super HMM for a hypothesized word sequence

n Consider only the most probable –Viterbi approximation ∴The search has to be performed

n Consider only the most probable –Viterbi approximation ∴The search has to be performed at two level: state level ( ) l word level ( ) Recombine hypotheses at both levels by DP---beam search l

One-Pass DP Search Using a Linear Lexicon Linear lexicon n For three-word vocabulary ,

One-Pass DP Search Using a Linear Lexicon Linear lexicon n For three-word vocabulary , the search space: n

Maximum approximation:  To assign each acoustic vector observed at time t to a (state,

Maximum approximation:  To assign each acoustic vector observed at time t to a (state, word) index pair. (s 1, w 1), …, (st, wt), …, (s. T, w. T) n

n n HMMs-the word interior (top) Word boundaries (bottom) Pr(w 1…wt)˙Pr(x 1…xt; s 1…st

n n HMMs-the word interior (top) Word boundaries (bottom) Pr(w 1…wt)˙Pr(x 1…xt; s 1…st | w 1…wt) Note that the unknown word sequence and the unknown state sequence are determined simultaneously. ---DP algorithm

DP Recursion Two quantities: Q( t, s; w) score of the best path to

DP Recursion Two quantities: Q( t, s; w) score of the best path to time that ends in state s of word B( t, s ; w) start time of the best path up to time t tha t e nds in s ta te s of word w n DP solved that two types of transition rules for the path- the word interior and word boundaries. n

In the word interior:  Q(t, s; w) = max {P(xt, s | s’; w)˙Q(t-1,

In the word interior:  Q(t, s; w) = max {P(xt, s | s’; w)˙Q(t-1, s’; w)}  B(t, s; w) = B(t-1, smax(t, s; w)   Where smax(t, s; w) is the optimum predecessor state for the n hypothesis (t, s; w). n Word boundary: H(w; t) : = max {P(w|v)˙Q(t, Sv; v)}

The search procedure works with a timesynchronous breadth-first strategy. (table 2) n To reduce

The search procedure works with a timesynchronous breadth-first strategy. (table 2) n To reduce the storage requirements- traceback array. n A strategy-Beam Search n

Table 2

Table 2

One-Pass DP Search Using a three Lexicon To large-vocabulary recognition,for efficiency reasons-Using the form

One-Pass DP Search Using a three Lexicon To large-vocabulary recognition,for efficiency reasons-Using the form of a prefix tree n How to present the search algorithm for such a context ? ? n

n Structure the search space as follow: ( bigram )

n Structure the search space as follow: ( bigram )

DP Recursion Qv (t, s) : = score of the best partial path that

DP Recursion Qv (t, s) : = score of the best partial path that ends at time t in state s of the lexical tree for predecessor v. n Bv (t, s) : = start time of the best partial path that ends at time t in state s of the lexical tree for predecessor v. n

Qv(t, s) = max {P(wt, s | s’)˙Qv(t-1, s’)} n Bv(t, s) = Bv(t-1,

Qv(t, s) = max {P(wt, s | s’)˙Qv(t-1, s’)} n Bv(t, s) = Bv(t-1, svmax(t, s)) n Where svmax(t, s) is the optimum predecessor state for the hypothesis (t, s) and predecessor word v. n At the word boundaries H(w; t) : = max {P(w|v)˙Qv(t, Sw)} Qv(t-1, s=0) = H(v; t-1) Bv(t-1, s=0) = t-1

Extension to Trigram Language Models The root of each tree copy is labeled with

Extension to Trigram Language Models The root of each tree copy is labeled with its two-word history. n The probabilities or costs of each edge depend only on the edge itself. n Size of the potential search space is increased drastically. n Pruning strategy is even more crucial. n

Refinements and Implementation Issues n Pruning Refinements n Acoustic pruning: QAC(t) : = max{Qv(t,

Refinements and Implementation Issues n Pruning Refinements n Acoustic pruning: QAC(t) : = max{Qv(t, s)} Qv(t, s) < f. AC˙QAC(t) n Language model pruning (word end pruning) QLM(t) : = max{Qv(t, s=0)} Qv(t, s=0) < f. LM˙QLM(t) n Histogram pruning

Language Model Look-Ahead n To incorporate the language model probabilities as early as possible

Language Model Look-Ahead n To incorporate the language model probabilities as early as possible into the search process where W(s) is the set of words that can be reached from tree state s. n n n To incorporate the anticipated LM probabilities into the three pruning operations of the search Why? Reduce number of state hypotheses. Problem?

Fig. 11

Fig. 11

Implementation n To arrive at an efficient implementation for tree search: n Set representation

Implementation n To arrive at an efficient implementation for tree search: n Set representation of active hypotheses n Set of active tree n Arc n Set of active HMM states (index s , score Q , back pointer) n Forward DP recombination (computational cost down) n Word boundaries、phoneme boundaries、HMM states n Direct access to each new successor hypotheses. n an exception-trigram LM.

Implementation n Traceback and Garbage Collection n Back pointer->a special traceback array (word index,

Implementation n Traceback and Garbage Collection n Back pointer->a special traceback array (word index, end time of the predecessor word, score, and back pointer) n Apply a garbage collection or purging method -> extend an additional component (time stamp)

n Table 4

n Table 4

One-Pass DP Search For Word Graph Construction Main idea:word alternative in regions of the

One-Pass DP Search For Word Graph Construction Main idea:word alternative in regions of the speech signal. n To keep track of word sequence hypotheses whose scores are very to the optimal hypothesis, but don’t survive n To represents a word sequences by a word graph, Each word sequence should be close to single best sentence. n

Using the same principle of time synchrony for word graph: n =Conditional probability that

Using the same principle of time synchrony for word graph: n =Conditional probability that word w produces the acoustic vectors n =Joint probability of observing the acoustic vector and a word sequence with end time t.

n decomposition- where is probability of the language model => For construct a word

n decomposition- where is probability of the language model => For construct a word graph,Introduce a formal definition of the word boundary

n Exploiting an m-gram language model we can recombine word sequence hypotheses at phrase

n Exploiting an m-gram language model we can recombine word sequence hypotheses at phrase level if they do not differ in their final (m-1) words. =>sufficient to distinguish partial word sequence hypotheses by their final words ∴ (included pruning strategy)

Word Pair Approximation n The crucial assumption now is that the dependence of the

Word Pair Approximation n The crucial assumption now is that the dependence of the word boundary can be confined to the final word pair => or n Assuming the word pair approximation- n n n At every time frame t, consider all word pairs =(v, w) For each triple(t ; v , w), keep track of word boundary 、word score At end of the speech signal word grapy is constructed by…. . .

n Fig 13

n Fig 13

n How to computation ?

n How to computation ?

n table 5

n table 5

n A third level-the phrase level Depending on whether the phrase-level recognition is carried

n A third level-the phrase level Depending on whether the phrase-level recognition is carried out in a timesynchronous. Extended one-pass approach n Two-pass approach : cache-based language model n

n Fig 14

n Fig 14

n Principal properties: There is a maximum for the number of incoming word edges

n Principal properties: There is a maximum for the number of incoming word edges in any node, namely the vocabulary size. n There is no maximum for the number of outgoing word edges. n n Two refinements of word graph method: Short words n Long words with identical ending portions may waste esearch effort. n

n table 6

n table 6

n table 8

n table 8