Spoken Dialogue Systems Julia Hirschberg CS 4706 352021

Spoken Dialogue Systems Julia Hirschberg CS 4706 3/5/2021 1

Today • Some Swedish examples • Controlling the dialogue flow – State prediction • Controlling lexical choice • Learning from human-human dialogue – User feedback • Evaluating systems 3/5/2021 2

The Waxholm Project at KTH • tourist information • Stockholm archipelago • time-tables, hotels, hostels, camping and dining possibilities. • mixed initiative dialogue • speech recognition • multimodal synthesis • graphic information • pictures, maps, charts and time-tables • Demos at http: //www. speech. kth. se/multimodal 3/5/2021 3

The Waxholm system 3/5/2021 There Information Which are. When IIs IWaxholm am lots Which think it day This Where looking possible of Ido about Where of want about Iboats ishotels the want the Thank a is can Thank table hotels the evening to The for week to the shown from isto go are Ieat restaurants boats Waxholm? you find hotels city go oftomorrow you is in Stockholm do in the to on boats shown too Waxholm? hotels? Waxholm? to you Waxholm boats. . . in this Waxholm want depart? in map in. Waxholm to this to. Waxholm go? table is on a Friday, Fromis. At shown where shown whatin do time inthis you this do table want table youto want go to go? 4

Today • Some Swedish examples • Controlling the dialogue flow – State prediction • Controlling lexical choice • Learning from human-human dialogue – User feedback • Evaluating systems 3/5/2021 5

Dialogue control state prediction Dialog grammar specified by a number of states Each state associated with an action database search, system question… … Probable state determined from semantic features Transition probability from one state to state Dialog control design tool with a graphic interface 3/5/2021 6

Waxholm Topics TIME_TABLE Task: get a time-table. Example: När går båten? (When does the boat leave? ) SHOW_MAP Task : get a chart or a map displayed. Example: Var ligger Vaxholm? (Where is Vaxholm located? ) EXIST Task : display lodging and dining possibilities. Example: Var finns det vandrarhem? (Where are there hostels? ) OUT_OF_DOMAIN Task : the subject is out of the domain. Example: Kan jag boka rum. (Can I book a room? ) NO_UNDERSTANDING Task : no understanding of user intentions. Example: Jag heter Olle. (My name is Olle) END_SCENARIO Task : end a dialog. Example: Tack. (Thank you. ) 3/5/2021 7

Topic selection FEATURES 3/5/2021 TOPIC EXAMPLES TIME TABLE SHOW MAP FACILITY NO UNDER- OUT OF STANDING DOMAIN OBJECT QUEST-WHEN QUEST-WHERE FROM-PLACE AT-PLACE . 062. 188. 062. 250. 062 . 312. 031. 688. 031. 219 . 073. 024. 390. 024. 293 . 091 . 067 . 091 TIME PLACE OOD END HOTEL HOSTEL ISLAND PORT MOVE . 312. 091. 062. 333. 125. 875 . 031. 200. 031. 556. 750. 031 . 024. 500. 122. 024. 488. 122. 062. 244. 098 . 091 . 067. 933. 067 . 091. 909. 091 { p(ti | F )} argmax i END 8

Topic prediction results % Errors 15 12, 9 8, 8 10 5 12, 7 8, 5 All “no understanding” excluded 3, 1 2, 9 0 complete parse 3/5/2021 raw data no extra linguistic sounds 9

Today • Some Swedish examples • Controlling the dialogue flow – State prediction • Controlling lexical choice • Learning from human-human dialogue – User feedback • Evaluating systems 3/5/2021 10

User answers to questions? The answers to the question: “What weekday do you want to go? ” (Vilken veckodag vill du åka? ) • • • 22% 11% 7% 6% • - 3/5/2021 Friday (fredag) I want to go on Friday (jag vill åka på fredag) I want to go today (jag vill åka idag) on Friday (på fredag) I want to go a Friday (jag vill åka en fredag) are there any hotels in Vaxholm? (finns det några hotell i Vaxholm) 11

Examples of questions and answers Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern? jag åker en gång om året kanske jag åker ganska sällan utomlands på semester jag åker nästan alltid utomlands under min semester jag åker ungefär 2 gånger per år utomlands på semester jag åker utomlands nästan varje år jag reser en gång om året utomlands jag reser inte ofta utomlands på semester det blir mera i arbetet jag reser utomlands på semestern vartannat år jag reser utomlands en gång per semester jag reser utomlands på semester ungefär en gång per år jag brukar resa utomlands på semestern åtminståne en jag åker utomlands på semestern varje år jag åker utomlands ungefär en gång om året jag är nästan aldrig utomlands en eller två gånger om året en gång per semester kanske en gång per år ungefär en gång per år åtminståne en gång om året nästan aldrig gång i året en gång per år kanske en gång vart annat år varje år vart tredje år ungefär nu för tiden inte så ofta varje år brukar jag åka utomlands 3/5/2021 12

Results no no reuse 4% 2%answer other 24% reuse 52% 18% ellipse 3/5/2021 13

Today • Some Swedish examples • Controlling the dialogue flow – State prediction • Controlling lexical choice • Learning from human-human dialogue – User feedback • Evaluating systems 3/5/2021 14

The August system 3/5/2021 People IWhat IStrindberg IYes, Over call The can How come Strindberg The Perhaps myself answer that information who amany from Royal million was live we Strindberg, questions the people was Institute ain will people smart department glass married ishere? shown live thing about of houses live but in I Yes, When What it. Do You do might you Thank you Good were are is was like your be do welcome! bye! you! that born for it name? born? we ameet inliving? 1849 will! Strindberg, ofdon’t Speech, should in the really Technology! three Stockholm? on soon Stockholm not KTH Music tothe have throw say! times! again! map and a surname Stockholm stones area Hearing 15

Evidence from Human Performance • Users provide explicit positive and negative feedback • Corpus-based vs. laboratory experiments – do these tell us different things? 3/5/2021 16

Adapt – demonstration of ”complete” system 3/5/2021 17

Feedback and ‘Grounding’: Bell & Gustafson ’ 00 • Positive and negative – Previous corpora: August system • 18% of users gave pos or neg feedback in subcorpus • Push-to-talk • Corpus: Adapt system – 50 dialogues, 33 subjects, 1845 utterances – Feedback utterances labeled w/ • Positive or negative • Explicit or implicit • Attention/Attitude • Results: – 18% of utterances contained feedback – 94% of users provided 3/5/2021 18

– 65% positive, 2/3 explicit, equal amounts of attention vs. attitude – Large variation • Some subjects provided at almost every turn • Some never did • Utility of study: – Use positive feedback to model the user better (preferences) – Use negative feedback in error detection 3/5/2021 19

The HIGGINS domain This is a 3 D test environment • • The primary domain of HIGGINS is city navigation for pedestrians. Secondarily, HIGGINS is intended to provide simple information about the immediate surroundings. 3/5/2021 20

Initial experiments • Studies on human-human conversation • The Higgins domain (similar to Map Task) • Using ASR in one direction to elicit error handling behaviour User 3/5/2021 Speaks ASR Listens Vocoder Reads Speaks Operator 21

Non-Understanding Error Recovery (Skantze ’ 03) • Humans tend not to signal non-understanding: – O: Do you see a wooden house in front of you? – U: ASR: YES CROSSING ADDRESS NOW (I pass the wooden house now) – O: Can you see a restaurant sign? • This leads to – Increased experience of task success – Faster recovery from non-understanding 3/5/2021 22

Today • Some Swedish examples • Controlling the dialogue flow – State prediction • Controlling lexical choice • Learning from human-human dialogue – User feedback • Evaluating systems 3/5/2021 23

Evaluating Dialogue Systems • PARADISE framework (Walker et al ’ 00) • “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished Maximize Task Success 3/5/2021 Minimize Costs Efficiency Measures Qualitative Measures 24

Task Success • Task goals seen as Attribute-Value Matrix ELVIS e-mail retrieval task (Walker et al ‘ 97) “Find the time and place of your meeting with Kim. ” Attribute Selection Criterion Time Place Value Kim or Meeting 10: 30 a. m. 2 D 516 • Task success defined by match between AVM values at end of with “true” values for AVM 3/5/2021 25

Metrics • Efficiency of the Interaction: User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted 3/5/2021 26

Experimental Procedures • Subjects given specified tasks • Spoken dialogues recorded • Cost factors, states, dialog acts automatically logged; ASR accuracy, barge-in hand-labeled • Users specify task solution via web page • Users complete User Satisfaction surveys • Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors 3/5/2021 27

User Satisfaction: Sum of Many Measures • Was Annie easy to understand in this conversation? (TTS Performance) • In this conversation, did Annie understand what you said? (ASR Performance) • In this conversation, was it easy to find the message you wanted? (Task Ease) • Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) • In this conversation, did you know what you could say at each point of the dialog? 3/5/2021 (User Expertise) • How often was Annie sluggish and slow to reply to you in this conversation? (System Response) • Did Annie work the way you expected her to in this conversation? (Expected Behavior) • From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) 28

Performance Functions from Three Systems • ELVIS User Sat. =. 21* COMP +. 47 * MRS -. 15 * ET • TOOT User Sat. =. 35* COMP +. 45* MRS -. 14*ET • ANNIE User Sat. =. 33*COMP +. 25* MRS +. 33* Help – COMP: User perception of task completion (task success) – MRS: Mean recognition accuracy (cost) – ET: Elapsed time (cost) – Help: Help requests (cost) 3/5/2021 29

Performance Model • Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction • Performance model useful for system development – Making predictions about system modifications – Distinguishing ‘good’ dialogues from ‘bad’ dialogues • But can we also tell on-line when a dialogue is ‘going wrong’ 3/5/2021 30

Next Class • Turn-taking (J&M, Link to conversational analysis description, Beattie on Margaret Thatcher) 3/5/2021 31