LREC 2010 O 3 Dialogue and Evaluation Estimation

Introduction • The aim of this study 1. 2. 3. 4. 5. Introduction Musicnavi

Background (1/2) • Use of speech input applications (e. g. Skype) by PC users

Background (2/2) • The evaluation using automatically measured metrics – Tune up the system

Music. Navi 2 database 1. 2. 3. 4. 5. Introduction Musicnavi 2 database N-gram

Example of a dialog U S U S U = User S = System

Data collection by the field test • Large scaled field test through the Internet

Distributions of the experimental subjects and the equipments used by them • Subjects who

Overview of the Music. Navi 2 database # of utterances Word Error Rate LREC

Pre-analysis of the Music. Navi 2 database • Classification of users by their satisfaction

Modeling method for the dialog context • The dialog management of SDS is designed

Spoken dialog logs to Dilaog act symbols • Vocabulary size of the recognition dictionary

Example of an encoded dialog U S U S U = User S =

Modeling the dialog act sequence by N-gram • A dialog act sequence: – arranged

Estimation experiment • Detection of the user’s class using N-gram model Exp. 1: “task

Estimation experiment • Detection method – Model selection by thresholding the likelihood ratio –

AUC (Area under the ROC curve) • “task incomplete” users N 1 -gram 2

Detection result of “task incomplete” users • SYSUSR N 4 -gram achieved 100% true

Detection result of “unsatisfied” users The more N of N-gram is, the less false

Conclusion 1. 2. 3. 4. 5. Introduction Musicnavi 2 database N-gram modeling Estimation experiment

• Thanks for your kind attention! LREC 2010: Sunao HARA et al. ,

LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010

Modeling the dialog act sequence by N-gram • Encoded dialog logs to dialog act

Detection by thresholding • Model selection by an a posteriori odds classifier, • Introduce

Detection result for 6 -classes of satisfaction システム系列のみを利用、 3 -gramの場合で 34. 4% LREC 2010:

Confusion matrix • 3 -gram of SYS sequence Actual Estimated ϕ 1 2 3

Modeling the N-gram • Encoded to dialog act symbols automatically – User’s dialog acts

Example of a dialog U S U S Hello ( ko-n-ni-chi-wa) U = User

1. 2. 3. 4. 5. LREC 2010: Sunao HARA et al. , Nagoya Univ.

Slides: 33

Download presentation

LREC 2010: O 3 - Dialogue and Evaluation Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System Sunao Hara, Norihide Kitaoka, Kazuya Takeda {naoh, kitaoka, kazuya. takeda}@nagoya-u. jp Graduate School of Information Science, Nagoya University, Japan

Introduction • The aim of this study 1. 2. 3. 4. 5. Introduction Musicnavi 2 database N-gram modeling Estimation experiment Conclusion – Construct an estimation model of user satisfaction for spoken dialog systems (SDSs) based on the real PC environment data • Experiment – Field experiment using a SDS for the music retrieval application – Construct and evaluate an estimation model for user satisfaction using N-gram history model LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 2

Background (1/2) • Use of speech input applications (e. g. Skype) by PC users is spreading – More users may use Spoken Dialog Systems (SDSs) via the Internet • The acoustic properties of PC environments differ among users – e. g. microphones, noise conditions, etc. Collect the speech under realistic PC environment • From a practical application standpoint – Evaluation and prediction of the system performance (User Satisfaction) are also important issues Build an estimation model for User Satisfaction LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 3

Background (2/2) • The evaluation using automatically measured metrics – Tune up the system parameters in the designing stage – Use to select the best dialog strategy for SDS applications – PARADISE Framework [Walker, et al. 1997] • The detection of problematic dialog for call center Interactive Voice Response (IVR) systems – To detect that “the conversation will break down”, as soon as possible – Problematic dialog predictor using SLU-success feature Spoken Language Understanding [Walker, et al. 2002] – N-gram-based call quality monitoring system [Kim 2007] Can we estimate the user satisfaction of SDS by modeling the dialog context? LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 4

Music. Navi 2 database 1. 2. 3. 4. 5. Introduction Musicnavi 2 database N-gram modeling Estimation experiment Conclusion • Field experiment using a music retrieval system with spoken dialog interface 1. Download the system through the Internet 2. Use it for a certain period 3. Fill in questionnaires on the web page • Music retrieval system - Music. Navi 2 – “Music retrieval application” + “Spoken dialog interface” – The spoken dialogue interface for retrieving and playing songs stored in user’s PC – Can collect speech data in corporation with a server program via the Internet LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 5

Example of a dialog U S U S U = User S = System User’s utterances / System’s prompts Hello ( ko-n-ni-chi-wa) Hello Da-i-to-ka-i Do you want to retrieve the song “Da-i-to-ka-i? ” Yes ( ha-i ) Now, playing the song “Da-i-to-ka-i” by “Crystal King. ” Stop ( te-i-shi ) Now, stopping. LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 6

Data collection by the field test • Large scaled field test through the Internet – Subjects used Music. Navi 2 on their own PC – Participants: 1369 subjects – Total of usage: 488 hours • User’s task – To listen to at least five songs – To perform at least twenty Q&A dialogs, or to use the system for over forty minutes • Questionnaire (only by “task complete” users) – Satisfaction level for SDS from 1 to 5 1: Extremely 2: Unsatisfied 3: Acceptable 4: Satisfied unsatisfied LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. 5: Extremely satisfied May 19, 2010 7

Distributions of the experimental subjects and the equipments used by them • Subjects who answered questionnaires – 449 Subjects (278 males and 171 females) – Total 34296 utterances Microphone LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. Loudspeaker / headphone May 19, 2010 8

Overview of the Music. Navi 2 database # of utterances Word Error Rate LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. Utterances per song played May 19, 2010 9

Pre-analysis of the Music. Navi 2 database • Classification of users by their satisfaction level – “task complete” users : c = 1, 2, 3, 4, 5 – “task incomplete” users: c = ϕ • Summary of data – Total 518 subjects c # of subjects # of utterances WER [%] Utt. / song ϕ 1 69 52. 2 70. 5 38 134. 5 54. 1 2 102 119. 7 51. 0 107 7. 21 5. 34 LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. 3 107 114. 9 46. 8 4 155 106. 5 41. 2 47 98. 4 35. 3 5. 12 4. 22 3. 43 May 19, 2010 5 10

Modeling method for the dialog context • The dialog management of SDS is designed by a dialog developer 1. 2. 3. 4. 5. Introduction Musicnavi 2 database N-gram modeling Estimation experiment Conclusion – The management is not always satisfactory for users • Assume that satisfaction appears in the dialog context • Statistically learning the naturalness of the dialog – Use N-gram to model the dialog context – Construct models for each class of users – Estimate the unknown user’s satisfaction based on the likelihood of N-gram model LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 11

Spoken dialog logs to Dilaog act symbols • Vocabulary size of the recognition dictionary – That is, the number of the songs – Is different between the users • Word level information is informative, but it is too sparse to deal with as statistically • Use dialog act symbols for the users’/system’s acts – Defined 21 system dialog acts and 19 user dialog acts LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 12

Example of an encoded dialog U S U S U = User S = System User’s utterances / System’s Dialog prompts act symbols x 1 = USR_CMD_HELLO Hello ( ko-n-ni-chi-wa) x 2 = SYS_INFO_GREETING Hello x 3 = USR_REQUEST_BYMUSIC Da-i-to-ka-i x 4 =song SYS_CONFIRM_KEYWORD Do you want to retrieve the “Da-i-to-ka-i? ” x 5 = USR_CMD_YES Yes ( ha-i ) x 6 = SYS_PLAY_SONG Now, playing the song “Da-i-to-ka-i” by “Crystal King. ” Stop ( te-i-shi )x 7 = USR_CMD_STOP x 8 = SYS_INFO_STOPPED Now, stopping. LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 13

Modeling the dialog act sequence by N-gram • A dialog act sequence: – arranged the dialog act symbols in time order t. • N-gram probability (= likelihood) when given a model for a user class c LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 14

Estimation experiment • Detection of the user’s class using N-gram model Exp. 1: “task incomplete” users 1. 2. 3. 4. 5. Introduction Musicnavi 2 database N-gram modeling Estimation experiment Conclusion Exp. 2: “unsatisfied” users • Experimental conditions – N-gram: 1 -gram, 2 -gram, …, 8 -gram • Witten-Bell smoothing (using SRILM toolkit) – Input sequence: USR, SYSUSR – Leave-one-out cross validation LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 15

Estimation experiment • Detection method – Model selection by thresholding the likelihood ratio – ROC curve – Area under the ROC curve (AUC) 1 true detection • Evaluation metrics 0 LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. false detection May 19, 2010 1 16

AUC (Area under the ROC curve) • “task incomplete” users N 1 -gram 2 -gram 3 -gram SYS 0. 901 0. 948 0. 989 • “unsatisfied” users SYSUSR 0. 873 0. 927 0. 929 0. 977 0. 954 0. 993 SYS 0. 611 0. 628 0. 591 SYSUSR 0. 638 0. 619 0. 644 0. 724 0. 651 0. 704 4 -gram 0. 995 0. 952 0. 997 0. 583 0. 681 0. 739 5 -gram 0. 993 0. 954 0. 995 0. 629 0. 662 0. 739 6 -gram 0. 989 performance 0. 951 0. 995 0. 632 0. 639 0. 761 High detection in “task 7 -gramincomplete” 0. 988 users 0. 946 to use 0. 995 0. 604 0. 633 0. 765 the system dialog acts 8 -gram 0. 987 0. 936 0. 994 0. 592 0. 622 0. 756 Suggested the effectivity of using both system and user dialog acts LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 17

Detection result of “task incomplete” users • SYSUSR N 4 -gram achieved 100% true detection rate 1 -gram with 6% false detection rate 2 -gram 3 -gram AUC 0. 927 0. 977 0. 993 4 -gram 5 -gram 6 -gram 7 -gram 8 -gram 0. 997 0. 995 0. 994 LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 18

Detection result of “unsatisfied” users The more N of N-gram is, the less false detection rate becomes LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. • SYSUSR N 1 -gram 2 -gram 3 -gram AUC 0. 619 0. 724 0. 704 4 -gram 5 -gram 6 -gram 7 -gram 8 -gram 0. 739 0. 761 0. 765 0. 756 May 19, 2010 19

Conclusion 1. 2. 3. 4. 5. Introduction Musicnavi 2 database N-gram modeling Estimation experiment Conclusion • Estimation method of user satisfaction using N-gram-based dialog history model for SDS – Constructed the real PC environmental database – Achieved high performance in the detection of “task incomplete” users • 100% true detection rate, when 6% false detection rate – Not sufficient performance in the detection of “unsatisfied” users – N-gram model was effective by comparison of 1 -gram – Using both system and user dialog act was effective • Future works – N-gram model-based estimation of dialog failure (online detection) – Analysis of the dialog context affected user satisfaction – Integrated method of using acoustic features, prosodic features, dialog features, etc. LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 20

• Thanks for your kind attention! LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 21

LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 22

Modeling the dialog act sequence by N-gram • Encoded dialog logs to dialog act symbols automatically User’s dialog acts System’s dialog acts ü Using speech recognition results ü They are defined in recognition dictionary ü Using system prompts or responses ü They are the same as system’s internal act • A dialog act sequence: x – arranged the dialog act symbols in time order t. • N-gram probability(=Likelihood) when given a model with a satisfaction level s LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 23

Detection by thresholding • Model selection by an a posteriori odds classifier, • Introduce a priori odds 1/α and Bayes factor B • Finally, * α =1 means ML classifier LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 24

Detection result for 6 -classes of satisfaction システム系列のみを利用、 3 -gramの場合で 34. 4% LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 27

Confusion matrix • 3 -gram of SYS sequence Actual Estimated ϕ 1 2 3 4 5 43 5 7 5 6 3 0 7 8 9 11 3 1 8 31 16 35 11 0 9 22 23 45 8 0 8 34 29 66 18 0 4 5 6 24 8 LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. 課題未達成ユーザ（Φ）は誤検出も少なく、比較的高い精度で識別されている満足しているユーザも推定結果が大きく異なっている例は少ない May 19, 2010 28

LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 30

Modeling the N-gram • Encoded to dialog act symbols automatically – User’s dialog acts • Using speech recognition results • They are defined in recognition dictionary – System’s dialog acts • Using system responses or acts • They are the same as system’s internal act • A dialog act sequence: x – Arranged the dialog act symbols in time order t. • 6クラスの満足度毎にN-gramモデルを作成 – Witten-Bell smoothing … SRILM toolkit を利用 LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 31

Example of a dialog U S U S Hello ( ko-n-ni-chi-wa) U = User S = System Hello Da-i-to-ka-i Do you want to retrieve the song “Da-i-to-ka-i? ” Yes ( ha-i ) Now, playing the song “Da-i-to-ka-i” by “Crystal King. ” U Stop ( te-i-shi ) S Now, stopping. LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. May 19, 2010 32

1. 2. 3. 4. 5. LREC 2010: Sunao HARA et al. , Nagoya Univ. , Japan. Introduction Musicnavi 2 database N-gram modeling Estimation experiment Conclusion May 19, 2010 33