k hypotheses other belief updating in spoken dialog

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus www. cs. cmu. edu/~dbohus@cs. cmu. edu Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213

problem spoken language interfaces lack robustness when faced with understanding errors § errors stem mostly from speech recognition § typical word error rates: 20 -30% § significant negative impact on interactions 2

guarding against understanding errors § use confidence scores § machine learning approaches for detecting misunderstadings [Walker, Litman, San-Segundo, Wright, and others] § engage in confirmation actions § explicit confirmation did you say you wanted to fly to Seoul? n n n yes → trust hypothesis no → delete hypothesis “other” → non-understanding § implicit confirmation traveling to Seoul … what day did you need to travel? n 3 rely on new values overwriting old values related work : data : user response analysis : proposed approach: experiments and results : conclusion

today’s talk … construct accurate beliefs by integrating information over multiple turns in a conversation S: Where would you like to go? U: Huntsville [SEOUL / 0. 65] destination = {seoul/0. 65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0. 60] destination = {? } 4

belief updating: problem statement § given § an initial belief Binitial(C) over destination = {seoul/0. 65} concept C S: traveling to Seoul. What day did you need to travel? § a system action SA [THE TRAVELING BERLIN P_MR / 0. 60] § a user. TOresponse destination = {? } § construct an updated belief § Bupdated(C) ← f (Binitial(C), SA, R) 5

outline § proposed approach § data § experiments and results § effect on dialog performance § conclusion 6 proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief updating: problem statement destination = {seoul/0. 65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0. 60] destination = {? } § given § an initial belief Binitial(C) over concept C § a system action SA(C) § a user response R § construct an updated belief § Bupdated(C) ← f(Binitial(C), SA(C), R) 7 proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) § most accurate representation § probability distribution over the set of possible values § however § system will “hear” only a small number of conflicting values for a concept within a dialog session § in our data 8 n max = 3 (conflicting values heard) n only in 6. 9% of cases, more than 1 value heard proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) § compressed belief representation § k hypotheses + other § at each turn, the system retains the top m initial hypotheses and adds n new hypotheses from the input (m+n=k) 9 proposed approach: data: experiments and results : effect on dialog performance : conclusion

belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) § B(C) modeled as a multinomial variable § {h 1, h 2, … hk, other} § B(C) = <ch 1, ch 2, …, chk, cother> n where ch 1 + ch 2 + … + chk + cother = 1 § belief updating can be cast as multinomial regression problem: Bupdated(C) ← Binitial(C) + SA(C) + R 10 proposed approach: data: experiments and results : effect on dialog performance : conclusion

system action 11 Bupdated(C) ← f(Binitial(C), SA(C), R) request S: For when do you want the room? U: Friday [FRIDAY / 0. 65] explicit confirmation S: Did you say you wanted a room for Friday? U: Yes [GUEST / 0. 30] implicit confirmation S: a room for Friday … starting at what time? U: starting at ten a. m. [STARTING AT TEN A_M / 0. 86] unplanned implicit confirmation S: I found 5 rooms available Friday from 10 until noon. Would you like a small or a large room? U: not Friday, Thursday [FRIDAY THURSDAY / 0. 25] no action / unexpected update S: okay. I will complete the reservation. Please tell me your name or say ‘guest user’ if you are not a registered user. U: guest user [THIS TUESDAY / 0. 55] proposed approach: data: experiments and results : effect on dialog performance : conclusion

user response 12 Bupdated(C) ← f(Binitial(C), SA(C), R) acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std. dev, min and max slope, plus normalized versions), voiced-tounvoiced ratio, speech rate, initial pause, etc; lexical number of words, lexical terms highly correlated with corrections or acknowledgements (selected via mutual information computation). grammatical number of slots (new and repeated), parse fragmentation, parse gaps, etc; dialog state, turn number, expectation match, new value for concept, timeout, barge-in, concept identity priors for concept values (manually constructed by a domain expert for 3 of 29 concepts: date, start_time, end_time; uniform assumed o/w) confusability empirically derived confusability scores proposed approach: data: experiments and results : effect on dialog performance : conclusion

approach Bupdated(C) ← f(Binitial(C), SA(C), R) § problem § <uch 1, … uchk, ucoth> ← f(<ich 1, … ichk, icoth>, SA(C), R) § approach: multinomial generalized linear model § regression model, multinomial independent variable § sample efficient § stepwise approach n feature selection n BIC to control over-fitting § one model for each system action n 13 <uch 1, … uchk, ucoth> ← f. SA(C)(<ich 1, … ichk, icoth>, R) proposed approach: data: experiments and results : effect on dialog performance : conclusion

outline § proposed approach § data § experiments and results § effect on dialog performance § conclusion 14 proposed approach: data: experiments and results : effect on dialog performance : conclusion

data § collected with Room. Line § a phone-based mixed-initiative spoken dialog system § conference room reservation § explicit and implicit confirmations § simple heuristic rules for belief updating § explicit confirm: yes / no § implicit confirm: new values overwrite old ones 15 proposed approach: data: experiments and results : effect on dialog performance : conclusion

corpus § user study § 46 participants (naïve users) § 10 scenario-based interactions each § compensated per task success § corpus § 449 sessions, 8848 user turns § orthographically transcribed § manually annotated n n n 16 misunderstandings corrections correct concept values proposed approach: data: experiments and results : effect on dialog performance : conclusion

outline § proposed approach § data § experiments and results § effect on dialog performance § conclusion 17 proposed approach: data: experiments and results : effect on dialog performance : conclusion

baselines § initial baseline § accuracy of system beliefs before the update § heuristic baseline § accuracy of heuristic update rule used by the system § oracle baseline § accuracy if we knew exactly when the user corrects 18 proposed approach: data: experiments and results : effect on dialog performance : conclusion

k=2 hypotheses + other Informative features § priors and confusability § initial confidence score § concept identity § barge-in § expectation match § repeated grammar slots 19 proposed approach: data: experiments and results : effect on dialog performance : conclusion

outline § proposed approach § data § experiments and results § effect on dialog performance § conclusion 20 proposed approach: data: experiments and results : effect on dialog performance : conclusion

a question remains … … does this really matter? what is the effect on global dialog performance? 21 proposed approach: data: experiments and results : effect on dialog performance : conclusion

let’s run an experiment guinea pigs from Speech Lab for exp: $0 getting change from guys in the lab: $2/$3/$5 real subjects for the experiment: $25 picture with advisor of the VERY last exp at CMU: priceless!!!! [courtesy of Mohit Kumar] 22

a new user study … § implemented models in Raven. Claw, performed a new user study § 40 participants, first-time users § 10 scenario-driven interactions each § non-native speakers of North-American English § improvements more likely at higher WER n supported by empirical evidence § between-subjects; 2 gender-balanced groups § control: Room. Line using heuristic update rules § treatment: Room. Line using runtime models 23 proposed approach: data: experiments and results : effect on dialog performance : conclusion

effect on task success control 73. 6% treatment 81. 3% task success even though 24 control 21. 9% treatment 24. 2% average user WER proposed approach: data: experiments and results : effect on dialog performance : conclusion

78% 30% WER 64% 16% WER probability of task success effect on task success … a closer look word error rate Task Success ← 2. 09 - 0. 05∙WER + 0. 69∙Condition p=0. 001 25 proposed approach: data: experiments and results : effect on dialog performance : conclusion

absolute Improvement in task success improvements at different WER word-error-rate 26 proposed approach: data: experiments and results : effect on dialog performance : conclusion

effect on task duration (for successful tasks) § ANOVA on task duration for successful tasks Duration ← -0. 21 + 0. 013∙WER - 0. 106∙Condition § significant improvement, equivalent to 7. 9% absolute reduction in WER 27 proposed approach: data: experiments and results : effect on dialog performance : conclusion

outline § proposed approach § data § experiments and results § effect on dialog performance § conclusion 28 proposed approach: data: experiments and results : effect on dialog performance : conclusion

summary § data-driven approach for constructing accurate system beliefs § integrate information across multiple turns § bridge together detection of misunderstandings and corrections § significantly outperforms current heuristics § significantly improves effectiveness and efficiency 29

other advantages § sample efficient § performs a local one-turn optimization § good local performance leads to good global performance § scalable § works independently on concepts § 29 concepts, varying cardinalities § portable § decoupled from dialog task specification § doesn’t make strong assumptions about dialog management technology 30

thank you! 31 questions …

user study § 10 scenarios, fixed order § presented graphically (explained during briefing) § participants compensated per task success 32