Automatic Speech Recognition CS 4705 Opportunity to participate

  • Slides: 26
Download presentation
Automatic Speech Recognition CS 4705

Automatic Speech Recognition CS 4705

 • Opportunity to participate in a new user study for Newsblaster and get

• Opportunity to participate in a new user study for Newsblaster and get $25 -$30 for 2. 5 -3 hours of time respectively. • http: //www 1. cs. columbia. edu/~delson/study. html • More opportunities will be coming….

What is speech recognition? • Transcribing words? • Understanding meaning? • Today: – –

What is speech recognition? • Transcribing words? • Understanding meaning? • Today: – – Overview ASR issues Building an ASR system Using an ASR system Future research

“It’s hard to. . . recognize speech/wreck a nice beach” • Speaker variability: within

“It’s hard to. . . recognize speech/wreck a nice beach” • Speaker variability: within and across • Recording environment varies wrt noise • Transcription task must handle all of this and produce a transcript of what was said, from limited, noisy information in the speech signal – Success: low word error rate (WER) • WER = (S+I+D)/N * 100 – Thesis test vs. This is a test. 75% WER • Understanding task must do more: from words to meaning

– Measure concept accuracy (CA) of string in terms of accuracy of recognition of

– Measure concept accuracy (CA) of string in terms of accuracy of recognition of domain concepts mentioned in string and their values I want to go from Boston to Baltimore on September 29 – Domain concepts Values – source city Boston – target city Baltimore – travel date September 29 – Score recognized string “Go from Boston to Washington on December 29” (1/3 = 33% CA) – “Go to Boston from Baltimore on September 29”

Again, the Noisy Channel Model Source Noisy Channel Decoder Input to channel: spoken sentence

Again, the Noisy Channel Model Source Noisy Channel Decoder Input to channel: spoken sentence s – Output from channel: an observation O – Decoding task: find s’ = P(s|O) – Using Bayes Rule – And since P(O) doesn’t change for any hypothetical s’ – s’ = P(O|s) P(s) – P(O|s) is the observation likelihood, or Acoustic Model, and P(s) is the prior, or Language Model

What do we need to build use an ASR system? • • • Corpora

What do we need to build use an ASR system? • • • Corpora for training and testing of components Feature extraction component Pronunciation Model Acoustic Model Language Model Algorithms to search hypothesis space efficiently

Training and Test Corpora • Collect corpora appropriate for recognition task at hand –

Training and Test Corpora • Collect corpora appropriate for recognition task at hand – Small speech + phonetic transcription to associate sounds with symbols (Acoustic Model) – Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) – Very large text corpus to identify unigram and bigram probabilities (Language Model)

Representing the Signal • What parameters (features) of the speech input – Can be

Representing the Signal • What parameters (features) of the speech input – Can be extracted automatically – Will preserve phonetic identity and distinguish it from other phones – Will be independent of speaker variability and channel conditions – Will not take up too much space • Speech representations (for [ae] in had): – Waveform: change in sound pressure over time – LPC Spectrum: component frequencies of a waveform – Spectrogram: overall view of how frequencies change from phone to phone

 • Speech captured by microphone and sampled (digitized) -- may not capture all

• Speech captured by microphone and sampled (digitized) -- may not capture all vital information • Signal divided into frames • Power spectrum computed to represent energy in different bands of the signal – LPC spectrum, Cepstra, PLP – Each frame’s spectral features represented by small set of numbers • Frames clustered into ‘phone-like’ groups (phones in context) -- Gaussian or other models

 • Why this works? – Different phonemes have different spectral characteristics • Why

• Why this works? – Different phonemes have different spectral characteristics • Why it doesn’t work? – Phonemes can have different properties in different acoustic contexts, spoken by different people … – Nice white rice

Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses (weighted

Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) • Allophones: butter vs. but • Multiple pronunciations for each word • Lexicon may be weighted automaton or simple dictionary • Words come from all corpora; pronunciations from pronouncing dictionary or TTS system

Acoustic Models • Model likelihood of phones or subphones given spectral features and prior

Acoustic Models • Model likelihood of phones or subphones given spectral features and prior context • Use pronunciation models • Usually represented as HMM – Set of states representing phones or other subword units – Transition probabilities on states: how likely is it to see one phone after seeing another? – Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1?

 • Initial estimates for • Transition probabilities between phone states • Observation probabilities

• Initial estimates for • Transition probabilities between phone states • Observation probabilities associating phone states with acoustic examples • Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) • I. e. , we tell the HMM the ‘right’ answers -- which words to associate with which sequences of sounds • Iteratively retrain the transition and observation probabilities by running the training data through the model and scoring output until no improvement

Language Model • Models likelihood of word given prior word and of entire sentence

Language Model • Models likelihood of word given prior word and of entire sentence • Ngram models: – Build the LM by calculating bigram or trigram probabilities from text training corpus – Smoothing issues very important for real systems • Grammars – Finite state grammar or Context Free Grammar (CFG) or semantic grammar • Out of Vocabulary (OOV) problem

 • Entropy H(X): the amount of information in a LM, grammar – How

• Entropy H(X): the amount of information in a LM, grammar – How many bits will it take on average to encode a choice or a piece of information? – More likely things will take fewer bits to encode • Perplexity 2 H: a measure of the weighted mean number of choice points in e. g. a language model

Search/Decoding • Find the best hypothesis P(O|s) P(s) given – Lattice of subword units

Search/Decoding • Find the best hypothesis P(O|s) P(s) given – Lattice of subword units (Acoustic Model) – Segmentation of all paths into possible words (Pronunciation Model) – Probabilities of word sequences (Language Model) • Produces a huge search space: How to reduce? – Lattice minimization and determinization – Forward algorithm: sum of all paths leading to a state – Viterbi algorithm: max of all paths leading to a state

– Forward-backward (Baum-Welch, Expectation. Maximization) algorithm: computes probability of sequence at any state in

– Forward-backward (Baum-Welch, Expectation. Maximization) algorithm: computes probability of sequence at any state in search space – Beam search: prune the lattice

Varieties of Speech Recognition Mode Isolated words continuous Style Read, prepared, spontaneous Enrollment Speaker-dependent

Varieties of Speech Recognition Mode Isolated words continuous Style Read, prepared, spontaneous Enrollment Speaker-dependent or independent Vocabulary size <20 5 K --> 60 K -->~1 M Language Model Finite state, ngrams, CFGs, CSGs Perplexity <10 > 100 SNR > 30 d. B (high) < 10 d. B (low) Input device Telephone, microphones

Challenges for Transcription • • Robustness to channel characteristics and noise Portability to new

Challenges for Transcription • • Robustness to channel characteristics and noise Portability to new applications Adapatation: to speakers, to environments LMs: simple ngrams need help Confidence measures OOV words New speaking styles/genres New applications

Challenges for Understanding • Recognizing communicative ‘problems’ – ASR errors – User corrections –

Challenges for Understanding • Recognizing communicative ‘problems’ – ASR errors – User corrections – Disfluencies and self-repairs • Possibilities: – Recognizing speaker emotion – Identifying speech acts: okay – Locating topic boundaries for topic tracking, audio browsing, speech data mining

An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT.

An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning. “ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning.

S: You can say the arrival city name, such as “New York City. "

S: You can say the arrival city name, such as “New York City. " U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City. ” U: Baltimore to Chicago. (I hate it. ) U: Train from Baltimore. . . S: . . . I heard you s. . . U: to Chicago. . Hello? S: You can say the name of your departure city, such as "New York City. " U: Baltimore.

U: Bal-ti-more. . S: You can say. . . U: Baltimore. S: Do you

U: Bal-ti-more. . S: You can say. . . U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. U: Yes. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev-

Summary • ASR technology relies upon a large number of phenomena and techniques we’ve

Summary • ASR technology relies upon a large number of phenomena and techniques we’ve already seen to convert sound into words – Phonetic/phonological, morphological, and lexical events – FSA’s, Ngrams, Dynamic programming algorithms • Better modeling of linguistic phenomena will be needed to improve performance on transcription and especially on understanding • For next class: we’ll start talking about larger structures in language above the word (Ch 8)

Disfluencies and Self-Repairs • Disfluencies abound in spontaneous speech – every 4. 6 s

Disfluencies and Self-Repairs • Disfluencies abound in spontaneous speech – every 4. 6 s in radio call-in (Blackmer & Mitton ‘ 91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. • Hard to recognize Ch- change strategy. --> to D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago.