constructing accurate beliefs in spoken dialog systems Dan
constructing accurate beliefs in spoken dialog systems Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University 3 user response analysis § how do users respond to correct and incorrect confirmations? Correct (1159) 94% Incorrect (279) 1% No 0% [93%] [57%] 2 problem belief updating problem: given an initial belief over a concept Belieft(C), a system action SA(C) and a user response R, compute the updated belief Belieft+1(C) belief representation: § most accurately: probability distribution over the set of possible values § but: system is not likely to “hear” more than 3 or 4 conflicting values Correct explicit confirmation Incorrect (correct value) S: starting at what time do you need the room? U: [STARTING AT TEN A M / 0. 45] starting at ten a. m. start-time = {10: 00 / 0. 45} S: did you say you wanted the room starting at ten a. m. ? U: [GUEST UNTIL ONE / 0. 89] yes until noon (correct value) S: for when do you need the room? U: [NEXT THURSDAY / 0. 75] next Thursday date = {2004 -08 -26 / 0. 75} S: a room for Thursday, August 26 th … starting at what time do you need the room? U: [FIVE TO SEVEN P_M / 0. 58] five to seven p. m. date = {? } implicit confirmation model 1159 250 29 § phone-based, mixed initiative system for conference room reservations § access to live schedules for 13 rooms in 2 buildings § size, location, a/v equipment [85%] User does corrects not correct Correct Incorrect ~ correct later 2 552 ~critical 55 2 111 118 critical 14 47 initial confidence score of top hypothesis, # of initial hypotheses, concept type (bool / non-bool), concept identity; system action indicators describing other system actions in conjunction with current confirmation; user acoustic / response prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std. dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause; lexical number of words, lexical terms highly correlated with corrections (MI); grammatical number of slots (new, repeated), parse fragmentation, parse gaps; dialog state, turn number, expectation match, new value for concept, timeout, barge-in. evaluation 30% initial heuristic proposed model oracle (error rate in system beliefs before the update) (error rate in system beliefs after the update – using the heuristic update rules) (error rate of the proposed logistic model tree) (oracle error rate) 31. 15 30% 30. 40 unplanned implicit confirmation 20% 23. 37 20% 15. 40 20% 16. 15 15. 33 10% 14. 36 12. 64 10. 37 10% 8. 41 2. 71 0% (basic + priors) 16. 17% 10% oracle 7. 86% 6. 06% 5. 52% 0% implicit confirmation request 98. 14% 30. 46% 26. 16% 22. 69% 12% 9. 49% 9. 64% 8% 21. 45% 17. 56% 20% 6. 08% 4% 10% 0% 0% unexpected updates unplanned implicit confirmation 80. 00% initial implicit confirmation 20% proposed model § explicit confirmation § implicit confirmation § unplanned impl. confirmation § request [system asks for the value for a concept] § unexpected update [system received a value for a concept, without asking for it, e. g. as a result of a misrecognition or the user over-answering or attempting a topic shift] users interact strategically 30. 83% 30% (basic feature set) system actions: all actions features 0% Roomline 61% [15%] [100%] § 1 -level deep, root splits on answer-type (YES / NO / other) § leaves contain stepwise logistic regression models § sample efficient, feature selection § good probability outputs (minimize cross entropy between model predictions and reality) 3. 57 3 dataset 33% [0%] 63% [0%] Confupd(th. C) ← M (Confinit(th. C), SA(C), R) explicit confirmation given an initial confidence score for the top hypothesis h for a concept C Confinit(th. C), construct an updated confidence score for the hypothesis h Confupd(th. C) - in light of the system confirmation action SA(C) and the follow-up user response R 6% 7% [0%] Other § logistic model tree [one for each system action] § compressed belief representation: k hypotheses + other § for now, k = 1: top hypothesis + other [see current and future work for extensions] § for now, only updates after system confirmation actions compressed belief updating problem: Incorrect (229) [37%] implicit confirmation 0 (incorrect value) S: how may I help you? U: [THREE TO RESERVE A ROOM / 0. 65] I’d like to reserve a room start-time = {15: 00 / 0. 65} S: starting at three p. m. … for which day do you need the conference room? U: [CAN YOU DETAILS TIME / NONUNDER. (0. 0)] I need a different time 30% No 4 models. results start-time = {? } implicit confirmation Correct (554) [7%] User does corrects not correct start-time = {? } § in our data, the maximum number of hypotheses for a concept accumulated through interaction was 3; the system heard more than 1 hypothesis for a concept in only 6. 9% of cases § user study with the Room. Line spoken dialog system § 46 participants (1 st time users) § 10 scenario-based interactions each § 449 dialogs § 8278 turns § corpus transcribed annotated 27% Yes § how often users correct the system? explicit confirmation the accuracy of it’s beliefs. 5% [0%] 72% [6%] Other § k hypotheses + other § multinomial generalized linear model 0% 5 conclusion § proposed a data-driven approach for constructing more accurate beliefs in task-oriented spoken dialog systems § bridge insights from detection of misunderstandings and corrections into a unified belief updating framework § model significantly outperforms heuristics currently used in most spoken dialog systems features § § 15. 49% 15. 15% 45. 03% added prior information on concepts priors constructed manually 10. 72% 8% 25. 66% 19. 23% 20% 14. 02% 12. 95% 12% 40% 4% 0% 0% b estimated impact on task success § how does the accuracy of the belief updating model affect task success? § relates the accuracy of the belief updates to overall task success through a logistic regression model § accuracy of belief updates: measured as AVG-LIK of the correct hypothesis § word-error-rate acts as a co-founding factor § model: P(Task Success=1) ← α + β • WER + γ • AVG-LIK § fitted model using 443 data-points (dialog sessions) § β, γ capture the impact of WER and AVG-LIK on overall task success natives non-natives 1. 0 0. 8 av g. l 0. 6 0. 4 0. 2 0. 0 0 av g. l ik. 20 =0. 6 =0 avg. lik. = 0. pro 9 pos ed mo av del cu a g. l v rre g. l ik. nt ik. =0 he =0 ur. 7. 8 ist ic . 5 40 60 word-error-rate 80 100 1. 0 avg 0. 8 av av g. l ik. 0. 4 0 ik. =0 0. 2 0. 0 ik. g. l 0. 6 20 . lik. = av g. l . 5 =0. 6 av cu rre nt average word-error rate Yes responses to implicit confirmations explicit confirmation initial heuristic proposed model probability of task success responses to explicit confirmations belief representation average word-error rate We propose a data-driven approach for constructing more accurate beliefs over concept values in spoken dialog systems by integrating information across multiple turns in the conversation. The approach bridges existing work in confidence annotation and correction detection and provides a unified framework for belief updating. It significantly outperforms heuristic rules currently used in most spoken dialog systems. a k-hypotheses + other probability of task success 1 abstract As a prerequisite for increased robustness and making better decisions, dialog systems must be able to accurately assess the reliability of the information they use. Typically, recognition confidence scores provide an initial assessment for the reliability of the information obtained from the user. Ideally, a system should leverage information available in subsequent turns to update and improve ( current & future work ) g. l he ik. ur ist pr =0 =0. 7 op . 8 os ed ic 40 60 80 word-error-rate c using information from n-best lists § currently: using only the top hypothesis from the recognizer § next: extract more information from n-best list or lattices 0. 9 mo d el 100
- Slides: 1