Application of Speech Recognition Synthesis Dialog Speech for

  • Slides: 55
Download presentation
Application of Speech Recognition, Synthesis, Dialog

Application of Speech Recognition, Synthesis, Dialog

Speech for communication • The difference between speech and language • Speech recognition and

Speech for communication • The difference between speech and language • Speech recognition and speech understanding © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon

Speech recognition can only identify words System does not know what you want System

Speech recognition can only identify words System does not know what you want System does not know who you are © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 Carnegie Mellon

Speech and Audio Processing • Signal processing: • Convert the audio wave into a

Speech and Audio Processing • Signal processing: • Convert the audio wave into a sequence of feature vectors • Speech recognition: • Decode the sequence of feature vectors into a sequence of words • Semantic interpretation: • Determine the meaning of the recognized words • Dialog Management: • Correct errors and help get the task done • Response Generation • What words to use to maximize user understanding • Speech synthesis: • Generate synthetic speech from a ‘marked-up’ word string © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 Carnegie Mellon

Data Flow Part I Signal Processing Part II Semantic Interpretation Speech Recognition Discourse Interpretation

Data Flow Part I Signal Processing Part II Semantic Interpretation Speech Recognition Discourse Interpretation Dialog Management Speech Synthesis © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 Response Generation Carnegie Mellon

Semantic Interpretation: Word Strings • Content is just words • • System: User: What

Semantic Interpretation: Word Strings • Content is just words • • System: User: What is your address? My address is fourteen eleven main street • Need concept extraction / keyword(s) spotting • Applications • template filling • directory services • information retrieval © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 Carnegie Mellon

Semantic Interpretation: Pattern-Based • Simple (typically regular) patterns specify content • ATIS (Air Traffic

Semantic Interpretation: Pattern-Based • Simple (typically regular) patterns specify content • ATIS (Air Traffic Information System) Task: • System: What are your travel plans? • User: [On Monday], I’m going [from Boston] [to San Francisco]. • Content: [DATE=Monday, ORIGIN=Boston, DESTINATION=SFO] © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 Carnegie Mellon

Robustness and Partial Success • Controlled Speech • limited task vocabulary; limited task grammar

Robustness and Partial Success • Controlled Speech • limited task vocabulary; limited task grammar • Spontaneous Speech • Can have high out-of-vocabulary (OOV) rate • Includes restarts, word fragments, omissions, phrase fragments, disagreements, and other disfluencies • Contains much grammatical variation • Causes high word error-rate in recognizer • Interpretation is often partial, allowing: • omission • parsing fragments © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 Carnegie Mellon

Speech Dialog Management © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10

Speech Dialog Management © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 Carnegie Mellon

Discourse & Dialog Processing • Discourse interpretation: • Understand what the user really intends

Discourse & Dialog Processing • Discourse interpretation: • Understand what the user really intends by interpreting utterances in context • Dialog management: • Determine system goals in response to user utterances based on user intention • Response generation: • Generate natural language utterances to achieve the selected goals © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 Carnegie Mellon

Discourse Interpretation • Goal: understand what the user really intends • Example: Can you

Discourse Interpretation • Goal: understand what the user really intends • Example: Can you move it? • What does “it” refer to? • Is the utterance intended as a simple yes-no query or a request to perform an action? • Issues addressed: • Reference resolution • Intention recognition • Interpret user utterances in context © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 Carnegie Mellon

Reference Resolution U: Where is A Bug’s Life playing in Monroeville? S: A Bug’s

Reference Resolution U: Where is A Bug’s Life playing in Monroeville? S: A Bug’s Life is playing at the Carmike theater. U: When is it playing there? S: It’s playing at 2 pm, 5 pm, and 8 pm. U: I’d like 1 adult and 2 children for the first show. How much would that be? • Knowledge sources: • • • Domain knowledge Discourse knowledge World knowledge © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 Carnegie Mellon

Reference Resolution • Focus stacks: Maintain recent objects in stack Select objects that satisfy

Reference Resolution • Focus stacks: Maintain recent objects in stack Select objects that satisfy semantic/pragmatic constraints starting from top of stack • Take into account discourse structure • • • Rule-based filtering & ranking of objects for pronoun resolution © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 Carnegie Mellon

Dialog Management: Motivating Example S: Would you like movie showtime or theater playlist information?

Dialog Management: Motivating Example S: Would you like movie showtime or theater playlist information? U: Movie showtime. S: What movie do you want showtime information about? U: Saving Private Ryan. S: At what theater do you want to see Saving Private Ryan? U: Carmike. S: Saving Private Ryan is not playing at the Carmike theater. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 15 Carnegie Mellon

Interacting with the user Dialog manager Domain agent • Guide interaction through task •

Interacting with the user Dialog manager Domain agent • Guide interaction through task • Map user inputs and system state into actions • Interact with back-end(s) • Interpret information using domain knowledge © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 Carnegie Mellon

Dialog Management • Goal: determine what to accomplish in response to user utterances, e.

Dialog Management • Goal: determine what to accomplish in response to user utterances, e. g. : • • • Answer user question Solicit further information Confirm/Clarify user utterance Notify invalid query and suggest alternative • Interface between user/language processing components and system knowledge base © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 Carnegie Mellon

Graph-based systems Welcome to Bank ABC! Please say one of the following: Balance, Hours,

Graph-based systems Welcome to Bank ABC! Please say one of the following: Balance, Hours, Loan, . . . What type of loan are you interested in? Please say one of the following: Mortgage, Car, Personal, . . . © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann . . 19 Carnegie Mellon

Frame-based systems Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. . © Copyright 2002

Frame-based systems Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. . © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann Transition on keyword or phrase Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. . 20 Carnegie Mellon

Application Task Complexity • Examples: Call Routing Weather Information Automatic Banking ATIS Travel University

Application Task Complexity • Examples: Call Routing Weather Information Automatic Banking ATIS Travel University Planning Course Advising Simple Complex • Directly affects: • Types and quantity of system knowledge • Complexity of system’s reasoning abilities © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 Carnegie Mellon

Dialog Complexity • Determines what can be talked about: • • • The task

Dialog Complexity • Determines what can be talked about: • • • The task only Subdialog: e. g. , clarification, confirmation The dialog itself: meta-dialog • Could you hold on for a minute? • What was that click? Did you hear it? • Determines who can talk about them: • • • System only User only Both participants © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 Carnegie Mellon

Dialogue Management: Process • Determines how the system will go about selecting among the

Dialogue Management: Process • Determines how the system will go about selecting among the possible goals • At the dialogue level, determined by system designer in terms of initiative strategies: • System-initiative: system always has control, user only • • • responds to system questions User-initiative: user always has control, system passively answers user questions Mixed-initiative: control switches between system and user using fixed rules Variable-initiative: control switches between system and user dynamically based on participant roles, dialogue history, etc. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 Carnegie Mellon

Response Generation U: Is Saving Private Ryan playing at the Chatham cinema? © Copyright

Response Generation U: Is Saving Private Ryan playing at the Chatham cinema? © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 Carnegie Mellon

S: No, it’s not. • S provides elliptical response © Copyright 2002 Michael G.

S: No, it’s not. • S provides elliptical response © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 Carnegie Mellon

S: No, Saving Private Ryan is not playing at the Chatham cinema. • S

S: No, Saving Private Ryan is not playing at the Chatham cinema. • S provides full response (which provides grounding information) © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 27 Carnegie Mellon

S: No, Saving Private Ryan is not playing at the Chatham cinema; theater’s under

S: No, Saving Private Ryan is not playing at the Chatham cinema; theater’s under renovation. • S provides full response and supporting evidence © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 Carnegie Mellon

Communicating with the user Language • Decide what to say to user (and how

Communicating with the user Language • Decide what to say to user (and how to phrase it) Generator Speech • Construct sounds and intonation synthesizer Display Generator © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 29 Action Generator Carnegie Mellon

Response Generation • Goal: generate natural language utterances to achieve goal(s) selected by the

Response Generation • Goal: generate natural language utterances to achieve goal(s) selected by the dialogue manager • Issues: • Content selection: determining what to say • Surface realization: determining how to say it • Generation gap: discrepancy between the actual output of the content selection process and the expected input of the surface realization process © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 Carnegie Mellon

Language generation • Template-based systems • Sentence templates with variables • “Linguistic” systems •

Language generation • Template-based systems • Sentence templates with variables • “Linguistic” systems • Generate surface from meaning representation • Stochastic approaches • Statistical models of domain-expert speech © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 Carnegie Mellon

Dialog Evaluation • Goal: determine how “well” a dialogue system performs • Main difficulties:

Dialog Evaluation • Goal: determine how “well” a dialogue system performs • Main difficulties: • No strict right or wrong answers • Difficult to determine what features make a dialogue • • • system better than another Difficult to select metrics that contribute to the overall “goodness” of the system Difficult to determine how the metrics compensate for one another Expensive to collect new data for evaluating incremental improvement of systems © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 Carnegie Mellon

Dialog Evaluation (Cont’d) • System-initiative, explicit confirmation • better task success rate • lower

Dialog Evaluation (Cont’d) • System-initiative, explicit confirmation • better task success rate • lower WER • longer dialogs • fewer recovery subdialogs • less natural © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann • Mixed-initiative, no confirmation • lower task success rate • higher WER • shorter dialogs • more recovery subdialogs • more natural 33 Carnegie Mellon

Speech Synthesis © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 Carnegie

Speech Synthesis © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 Carnegie Mellon

Speech Synthesis (Text-to-Speech TTS) • Prior knowledge • Vocabulary from words to sounds; surface

Speech Synthesis (Text-to-Speech TTS) • Prior knowledge • Vocabulary from words to sounds; surface markup • Recorded prompts • Formant synthesis • Model vocal tract as source and filters • Concatenative synthesis • Record and segment expert’s voice • Splice appropriate units into full utterances • Intonation modeling © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 Carnegie Mellon

Recorded Prompts • The simplest (and most common) solution is to record prompts spoken

Recorded Prompts • The simplest (and most common) solution is to record prompts spoken by a (trained) human • Produces human quality voice • Limited by number of prompts that can be recorded • Can be extended by limited cut-and-paste or template filling © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 Carnegie Mellon

The Source-Filter Model of Formant Synthesis • Model of features to be extracted and

The Source-Filter Model of Formant Synthesis • Model of features to be extracted and fitted • Excitation or Voicing Source(s) to model sound source • • • standard wave of glottal pulses for voiced sounds randomly varying noise for unvoiced sounds modification of airflow due to lips, etc. high frequency (F 0 rate), quasi-periodic, choppy modeled with vector of glottal waveform patterns in voiced regions • Acoustic Filter(s) • shapes the frequency character of vocal tract and radiation character at the lips • relatively slow (samples around 5 ms suffice) and stationary • modeled with LPC (linear predictive coding) © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 Carnegie Mellon

Concatenative Synthesis • • Record basic inventory of sounds Retrieve appropriate sequence of units

Concatenative Synthesis • • Record basic inventory of sounds Retrieve appropriate sequence of units at run time Concatenate and adjust durations and pitch Synthesize waveform © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 Carnegie Mellon

Diphone and Polyphone Synthesis • Phone sequences capture co-articulation • Cut speech in positions

Diphone and Polyphone Synthesis • Phone sequences capture co-articulation • Cut speech in positions that minimize context contamination • Need single phones, diphones and sometimes triphones • Reduce number collected by • phonotactic constraints • collapsing in cases of no co-articulation • Data Collection Methods • Collect data from a single (professional) speaker • Select text with maximal coverage (typically with greedy algorithm), or • Record minimal pairs in desired contexts (real words or nonsense) © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 Carnegie Mellon

Signal Processing for Concatenative Synthesis • Diphones recorded in one context must be generated

Signal Processing for Concatenative Synthesis • Diphones recorded in one context must be generated in other contexts • Features are extracted from recorded units • Signal processing manipulates features to smooth boundaries where units are concatenated • Signal processing modifies signal via ‘interpolation’ • intonation • duration © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 Carnegie Mellon

Intonation in Bell Labs TTS • Generate a sequence of F 0 targets for

Intonation in Bell Labs TTS • Generate a sequence of F 0 targets for synthesis • Example: • We were away a year ago. • phones: w E w R & w A & y E r & g O source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed. , Kluwer, 1998 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 46 Carnegie Mellon

What you can do with Speech Recognition • Transcription • dictation, information retrieval •

What you can do with Speech Recognition • Transcription • dictation, information retrieval • Command control • data entry, device control, navigation, call routing • Information access • airline schedules, stock quotes, directory assistance • Problem solving • travel planning, logistics © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 48 Carnegie Mellon

Human-machine interface is critical Speech recognition is NOT the core function of most applications

Human-machine interface is critical Speech recognition is NOT the core function of most applications Speech is a feature of applications that offers specific advantages Errorful recognition is a fact of life © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 49 Carnegie Mellon

Properties of Recognizers • Speaker Independent vs. Speaker Dependent • Large Vocabulary (2 K-200

Properties of Recognizers • Speaker Independent vs. Speaker Dependent • Large Vocabulary (2 K-200 K words) vs. Limited Vocabulary (2200) • Continuous vs. Discrete • Speech Recognition vs. Speech Verification • Real Time vs. multiples of real time • Spontaneous Speech vs. Read Speech • Noisy Environment vs. Quiet Environment • High Resolution Microphone vs. Telephone vs. Cellphone • Push-and-hold vs. push-to-talk vs. always-listening • Adapt to speaker vs. non-adaptive • Low vs. High Latency • With online incremental results vs. final results • Dialog Management © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 50 Carnegie Mellon

Speech Recognition vs. Touch Tone v Shorter calls v Choices mean something v Automate

Speech Recognition vs. Touch Tone v Shorter calls v Choices mean something v Automate more tasks v Reduces annoying operations v Available © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 56 Carnegie Mellon

Transcription and Dictation • Transcription is transforming a stream of human speech into computer-readable

Transcription and Dictation • Transcription is transforming a stream of human speech into computer-readable form • Medical reports, court proceedings, notes • Indexing (e. g. , broadcasts) • Dictation is the interactive composition of text • Report, correspondence, etc. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 57 Carnegie Mellon

Speech. Wear • Vehicle inspection task • • • USMC mechanics, fixed inspection form

Speech. Wear • Vehicle inspection task • • • USMC mechanics, fixed inspection form Wearable computer (COTS components) html-based task representation • film clip © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 58 Carnegie Mellon

Speech recognition and understanding • Sphinx system • • • speaker-independent continuous speech large

Speech recognition and understanding • Sphinx system • • • speaker-independent continuous speech large vocabulary • ATIS system • air travel information retrieval • context management • film clip (1994) © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 60 Carnegie Mellon

Sample Market: Call Centers Automate services, lower payroll Shorten time on hold Shorten agent

Sample Market: Call Centers Automate services, lower payroll Shorten time on hold Shorten agent and client call time Reduce fraud Improve customer service © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 62 Carnegie Mellon

Interface guidelines • State transparency • Input control • Error recovery • Error detection

Interface guidelines • State transparency • Input control • Error recovery • Error detection • Error correction • Log performance • Application integration © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 63 Carnegie Mellon

Applications related to Speech Recognition Figure out what a person is saying. Speaker Verification

Applications related to Speech Recognition Figure out what a person is saying. Speaker Verification Authenticate that a person is who she/he claims to be. Limited speech patterns Speaker Identification Assigns an identity to the voice of an unknown person. Arbitrary speech patterns © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 64 Carnegie Mellon

Three Types of Security + What You Have key, card, token + What You

Three Types of Security + What You Have key, card, token + What You Know password, PIN, maiden name + Who You Are Stronger Authentication © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 65 Carnegie Mellon

Family Tree: Voice Biometrics Speech Processing Output Input Face Recognition Finger Geometry Fingerprinting Hand

Family Tree: Voice Biometrics Speech Processing Output Input Face Recognition Finger Geometry Fingerprinting Hand Geometry Iris/Retina Scan Signature Verif. Speech Recognition Voice Biometrics Speaker Verification © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann Speaker Identification 66 Typing Dynamics Speech Synthesis Digitized Speech Biometrics DNA … Carnegie Mellon

Carnegie Mellon Speech Demos • CMU Communicator • Call: 1 -877 -CMU-PLAN (268 -7526),

Carnegie Mellon Speech Demos • CMU Communicator • Call: 1 -877 -CMU-PLAN (268 -7526), also 268 -5144, or x 8 -1084 • the information is accurate; you can use it for your own travel planning… CMU Universal Speech Interface (USI) • CMU Movie Line Seems to be about apartments now… • Call: (412) 268 -1185 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 67 Carnegie Mellon

Telephone Demos • Nuance http: //www. nuance. com • Banking: 1 -650 -847 -7438

Telephone Demos • Nuance http: //www. nuance. com • Banking: 1 -650 -847 -7438 • Travel Planning: 1 -650 -847 -7427 • Stock Quotes: 1 -650 -847 -7423 • Speech. Works http: //www. speechworks. com/demos. htm • Banking: 1 -888 -729 -3366 • Stock Trading: 1 -800 -786 -2571 • MIT Spoken Language Systems Laboratory http: //www. sls. lcs. mit. edu/sls/whatwedo/applications. html • Travel Plans (Pegasus): 1 -877 -648 -8255 • Weather (Jupiter): 1 -888 -573 -8255 • IBM http: //www-3. ibm. com/software/speech/ • Mutual Funds, Name Dialing: 1 -877 -VIA-VOICE © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 69 Carnegie Mellon

Questions?

Questions?