ITCS 6010 VUI Evaluation Paradise SUM PARADISE Paradigm

PARADISE Paradigm for Dialogue System Evaluation n Goal: Maximize User Satisfaction

PARADISE Paradigm for Dialogue System Evaluation n Performance is modeled as a weighted function

Attribute Value Matrix (AVM) n AVM represents dialogue task n n n Information exchange

AVM (cont’d) A 1: Hello this is Train Enquiry Service. DC, AC, DR, DT

PARADISE Paradigm for Dialogue System Evaluation n Advantages n n PARADISE approach addresses performance

Alternative Approaches n What’s important? n n Maximize User Satisfaction Maximize Task Success

User Satisfaction n How do we measure user satisfaction? n Questionnaires n Interviews n

Task Success n How do we measure task success? n Logging Actual Use n

Task Success n For each dialogue and the entire conversation establish AVMs. n Measure

Conclusions n PARADISE is good, but too complex! n Measure user satisfaction and task

Speech Usability Metric (SUM) n Uses 3 metrics: n n User satisfaction Accuracy Task

Speech Usability Metric (SUM) n SUM = X * User Satisfaction + Y *

User Satisfaction n Surveys n Questionnaires n Interviews

Accuracy n Misinterpretations n n Out-of-vocabulary errors n n System recognizes wrong word Words

Task Completion Time n n Time to complete task Time for expert to complete

Conclusion n SUM determines usability of a speech application n n Utilizes 3 pre-defined

Slides: 18

Download presentation

ITCS 6010 VUI Evaluation Paradise & SUM

PARADISE Paradigm for Dialogue System Evaluation n Goal: Maximize User Satisfaction

PARADISE Paradigm for Dialogue System Evaluation n Performance is modeled as a weighted function of a task-based success measure and dialogue-based cost measures, where weights are computed by correlating user satisfaction with performance. n Dialogue tasks are represented as Attribute Value Matrix (AVM) pairs.

Attribute Value Matrix (AVM) n AVM represents dialogue task n n n Information exchange between agent and user Represented as ordered pairs of attribute and possible values AVM instantiation of task n Same regardless of dialogue

AVM (cont’d) A 1: Hello this is Train Enquiry Service. DC, AC, DR, DT Please speak after the tone. Which information do you need? U 1: I want to go from Torino to Milano. DC, AC A 2: Do you want to go from Trento to DC, AC Milano? Yes or No? U 2: No. DC, AC A 3: Do you want to leave from Trento? Yes if No? DC U 3: No. DC A 4: Where do you want to leave from? DC U 4: I want to leave from Torino. DC A 5: Do you want to leave from Torino? DC Yes or No? U 5: Yes. DC A 6: Do you want to go to Milano? AC U 6: Yes. AC A 7: At which time do you want to leave? DR U 7: I want to travel in the evening. DR A 8: Do you want to leave between 6 and 9 pm? Yes or No? DR U 8: Yes. DR A 9: There is a train leaving at 8: 00 pm. DT …. B 1: Hello this is Train Enquiry Service. Please speak after the tone. Which information do you need? U 1: I want to travel from Torino to Milano. B 2: At which time do you want to leave from Merano to Milano? U 2: No, I want to leave from Torino in the evening. B 3: Do you want to leave from Torino between 6 and 11 pm? Please answer Yes or No. U 3: Yes. B 4: A train leaves at 8: 00 pm. …. attribute actual value depart-city Torino arrival-city Milano depart-range Evening depart-time 8 pm DC, AC, DR, DT DC, AC, DR DC, DR DT

PARADISE Paradigm for Dialogue System Evaluation n Advantages n n PARADISE approach addresses performance and user satisfaction Disadvantages n n Too complex to compute. Need a large sample size up front

Alternative Approaches n What’s important? n n Maximize User Satisfaction Maximize Task Success

User Satisfaction n How do we measure user satisfaction? n Questionnaires n Interviews n Focus Groups

Task Success n How do we measure task success? n Logging Actual Use n Performance Measurement n Walkthroughs n Pilot Testing

Task Success n For each dialogue and the entire conversation establish AVMs. n Measure task success with respect to: n n Task completion time Accuracy or Errors (e. g. misinterpretations)

Conclusions n PARADISE is good, but too complex! n Measure user satisfaction and task success. n What if user satisfaction not most relevant aspect?

Speech Usability Metric (SUM) n Uses 3 metrics: n n User satisfaction Accuracy Task completion time Eliminates restriction of one factor to determine usability

Speech Usability Metric (SUM) n SUM = X * User Satisfaction + Y * Accuracy + Z * Completion Time n n n X+Y+Z=1 X, Y, Z > 0 Weights determined by evaluator

User Satisfaction n Surveys n Questionnaires n Interviews

Accuracy n Misinterpretations n n Out-of-vocabulary errors n n System recognizes wrong word Words not in system grammar Wrong choice n Correct word recognized, wrong path chosen

Task Completion Time n n Time to complete task Time for expert to complete task (ETCT) Maximum time to complete task (MTCT) Expected time to complete task (Ex. TCT)

Conclusion n SUM determines usability of a speech application n n Utilizes 3 pre-defined metrics Allows for greater flexibility