Adapting and Learning Dialogue Models Discourse Dialogue CMSC

Roadmap • The Problem: Portability • Task domain: Call-routing • Porting: – Speech recognition

SLS Portability • Spoken language system design – Record or simulate user interactions –

Call-routing • Goal: Given an utterance, identify type – Dispatch to right operator •

Dialogue Management • Flow Controller – Pluggable dialogue strategy modules • ATN: call-flow, easy

Adaptation: ASR • ASR: Language models – Usually trained from in-domain transcriptions • Here:

Adaptation: Call-routing • Manual tagging: Slow, expensive • Here: Existing out-of-domain labeled data –

Call-type Classification • Boostexter: word n-gram features; 1, 100 iter – ASR output basis

Dialogue Model • Build dialogue strategy templates – Based on call-type classification • Generic:

Dialogue Model Porting • Evaluation: – Compare to original transcribed dialogue • Task 1:

Conclusions • Portability: – Bootstrapping of ASR, Call-type, DM – Generally effective • Call-type

Learning DM Strategies • Prior approaches: – Hand-coded: state-, frame- or agent-based – Adaptation

Training HMM DM • Construct training corpus – E. g. Record human-human interactions –

Reinforcement Learning • Model dialogues with (partially observable) Markov decision processes – – Users

Applications • Toot – train information • Litman, Kearns, et al • Learned different

Turn-taking Discourse and Dialogue CS 35900 -1 November 16, 2004

Agenda • Motivation – Silence in Human-Computer Dialogue • Turn-taking in human-human dialogue –

Turn-taking in HCI • Human turn end: – Detected by 250 ms silence •

Yielding & Taking the Floor • Turn change signal – Offer floor to auditor/hearer

Retaining the Floor • Within-turn signal – Still speaker: Look at hearer as end

Improving Human-Computer Turn-taking • Identifying cues to turn change and turn start • Meeting

Tasks • Sentence/disfluency/non-boundary ID – End of sentence, break off, continue • Jump-in points

Text + Prosody • Text sequence: – Modeled as n-gram language model • Hidden

Interpreting Breaks • For each inter-word position: – Is it a disfluency, sentence end,

Jump-in Points • (Used) Possible turn changes – Points WITHIN spurt where new speaker

Jump-in Features • Do people speak differently when jump-in? – Differ from regular turn

Summary • Prosodic features signal conversational moves – Pause and vowel duration distinguish sentence

Slides: 27

Download presentation

Adapting and Learning Dialogue Models Discourse & Dialogue CMSC 35900 -1 November 19, 2006

Roadmap • The Problem: Portability • Task domain: Call-routing • Porting: – Speech recognition – Call-routing – Dialogue management • Conclusions • Learning DM strategies – HMMs and POMDPs

SLS Portability • Spoken language system design – Record or simulate user interactions – Collect vocabulary, sentence style, sequence • Transcribe/label – Expert creates vocabulary, language model, dialogue model • Problem: Costly, time-consuming, expert

Call-routing • Goal: Given an utterance, identify type – Dispatch to right operator • Classification task: – Manual rules or data-driven methods • Feature-based classification (Boosting) – Pre-defined types, e. g. : • Hello? -> Hello; I have a question -> request(info) • I would like to know my balance. > request(balance)

Dialogue Management • Flow Controller – Pluggable dialogue strategy modules • ATN: call-flow, easy to augment, manage context – Inputs: context, semantic rep. of utterance • ASR – Language models • Trigrams, in probabilistic framework

Adaptation: ASR • ASR: Language models – Usually trained from in-domain transcriptions • Here: out-of-domain transcriptions – Switchboard, spoken dialog (telecomm, insur) – In-domain web pages • New domain: pharmaceuticals • Style differences: SLS: pronouns; OOV: med best • Best accuracy: spoken dialogue+web – SWBD too big/slow

Adaptation: Call-routing • Manual tagging: Slow, expensive • Here: Existing out-of-domain labeled data – Meta call-types: Library • Generic: all apps • Re-usable: in-domain, but already exist • Specific: only this app – Grouping done by experts • Bootstrap: Start with generic, reusable

Call-type Classification • Boostexter: word n-gram features; 1, 100 iter – ASR output basis • Telecomm based call-type library • Two classifications: reject-yn; classification – In-domain: true: 78%; ASR: 62% – Generic: test on generic: 95%; 91% – Bootstrap: generic+reuse+rules: 79%, 68%

Dialogue Model • Build dialogue strategy templates – Based on call-type classification • Generic: – E. g. . Yes, no, hello, repeat, help • Cause generic context dependent reply • Tag as vague/concrete: – Vague: “I have a question” -> clarification – Concrete: clear routing, attributes – sub-dialogs

Dialogue Model Porting • Evaluation: – Compare to original transcribed dialogue • Task 1: DM category: 32 clusters of calls – Bootstrap 16 categories – 70% of instances • Using call-type classifiers: get class, conf, concrete? • If confident/concrete/correct -> correct; – If incorrect, error • Also classify vague/generic • 67 -70% accuracy for DM, routing task

Conclusions • Portability: – Bootstrapping of ASR, Call-type, DM – Generally effective • Call-type success high • Others: potential

Learning DM Strategies • Prior approaches: – Hand-coded: state-, frame- or agent-based – Adaptation bootstraps from existing structure • Alternative: – Capture prior interaction patterns – Learn dialogue structure and management

Training HMM DM • Construct training corpus – E. g. Record human-human interactions – Identify and label states • Train HMM dialogue management – Use tagged sequences to learn • Correspondences between utterances and states • State transition probabilities • Effective, still requires initial tagging

Reinforcement Learning • Model dialogues with (partially observable) Markov decision processes – – Users form stochastic env, Actions are system utterances, State is dialogue so far Goal: maximize some utility measure • Task completion/user satisfaction • Learn policy – implemented as actions in state – That optimizes utility measure

Applications • Toot – train information • Litman, Kearns, et al • Learned different initiative/confirmation strategies • Air travel bookings (Young et al 2006) – Problem: huge number of possible states • More airports, dramatically more possible utts – Approach: Collapse all alternative slot fillers • Represent with single default

Turn-taking Discourse and Dialogue CS 35900 -1 November 16, 2004

Agenda • Motivation – Silence in Human-Computer Dialogue • Turn-taking in human-human dialogue – Turn-change signals – Back-channel acknowledgments – Maintaining contact • Exploiting to improve HCC – Automatic identification of disfluencies, jump-in points, and jump-ins

Turn-taking in HCI • Human turn end: – Detected by 250 ms silence • System turn end: – Signaled by end of speech – Indicated by any human sound • Barge-in • Continued attention: – No signal

Yielding & Taking the Floor • Turn change signal – Offer floor to auditor/hearer – Cues: pitch fall, lengthening, “but uh”, end gesture, amplitude drop+’uh’, end clause • Likelihood of change increases with more cues • Negated by any gesticulation • Speaker-state signal: • Shift in head direction AND/OR Start of gesture

Retaining the Floor • Within-turn signal – Still speaker: Look at hearer as end clause • Continuation signal – Still speaker: Look away after within-turn/back • Back-channel: – ‘mmhm’/okay/etc; nods, • sentence completion. Clarification request; restate – NOT a turn: signal attention, agreement, confusion

Improving Human-Computer Turn-taking • Identifying cues to turn change and turn start • Meeting conversations: – Recorded, natural research meetings – Multi-party – Overlapping speech – Units = “Spurts” between 500 ms silence

Tasks • Sentence/disfluency/non-boundary ID – End of sentence, break off, continue • Jump-in points – Times when others “jump in” • Jump-in words – Interruption vs start from silence • Off- and on- line • Language model and/or prosodic cues

Text + Prosody • Text sequence: – Modeled as n-gram language model • Hidden event prediction – e. g. boundary as hidden state – Implement as HMM • Prosody: – Duration, Pitch, Pause, Energy – Decision trees: classify + probability • Integrate LM + DT

Interpreting Breaks • For each inter-word position: – Is it a disfluency, sentence end, or continuation? • Key features: – Pause duration, vowel duration • 62% accuracy wrt 50% chance baseline – ~90% overall • Best combines LM & DT

Jump-in Points • (Used) Possible turn changes – Points WITHIN spurt where new speaker starts • Key features: – Pause duration, low energy, pitch fall – No lexical/punctuation features used – Forward features useless • Look like SB but aren’t • Accuracy: 65% wrt 50% baseline • Performance depends only on preceding prosodic features

Jump-in Features • Do people speak differently when jump-in? – Differ from regular turn starts? • Examine only first words of turns – No LM • Key features: – Raised pitch, raised amplitude • Accuracy: 77% wrt 50% baseline – Prosody only

Summary • Prosodic features signal conversational moves – Pause and vowel duration distinguish sentence end, disfluency, or fluent continuation – Jump-ins occur at locations that sound like sent. ends – Raise voice when jump in