Rating Scales What the Research Says Joe Dumas

The Scope of the Session u Discussion of literature about rating scales in usability

Table of Contents u Types of rating scales u Guidelines from past studies u

Formats u u u One question format Before-after format Multiple question format Mini_UPA, 2009

One Question Formats Original Likert scale format: I think that I would like to

One Question Formats u Likert-like scales: Characters on the screen are: Hard to read

One Question Formats One more Likert-like scale (used in SUMI): I would recommend this

One Question Formats Subjective Mental Effort Scale (SMEQ) Mini_UPA, 2009 9

One Question Formats u Semantic Differential: u Magnitude estimation: u Use any positive number

Before-After Ratings Before the task: How easy or difficult do you expect this task

Multiple Question Formats (Selected List) Software Usability Scale (SUS) – 10 ratings u *Questionnaire

More Multiple Question Formats Post Study System Usability Questionnaire (PSSOQ) - 19 ratings. Electronic

Guidelines from Past Studies Mini_UPA, 2009 14

Guidelines Have 5 -9 levels in a rating u You gain no additional information

Guidelines Use positive integers as numbers u 1 -7 instead of -3 to +3

Guidelines u Most word labels produce a bipolar scale u In a 1 to

Evaluating a Rating Scale Mini_UPA, 2009 18

Statistical Criteria u Is it valid? Does it measure what it’s suppose to measure?

Practical Criteria u Is it easy for the participant to understand use? u Do

Guidelines from Recent Studies Mini_UPA, 2009 21

Post-Task Ratings The simpler the better u Tedesco and Tullis found this format the

More on Post-Task Ratings u They provide diagnostic information about usability issues with tasks

More on Post-Task Ratings Even post-task ratings may be inflated (Teague et al. ,

Post-Test Ratings u Home grown questionnaires perform more poorly than standardized ones u Tullis

More on Post-Test Ratings u The lowest correlations among all measures used in testing

Examine the Distribution See how the average would miss how bimodal the distribution is.

Low Sensitivity with Small Samples u Three recent studies have all shown that post-task

The Value of Confidence Intervals Actual data from an online study comparing the NASA

Little Known Advantages of Rating Scales Mini_UPA, 2009 30

Ratings Can Help Prioritize Work “Promote It” “Don’t Touch It” “Big Opportunity” “Fix it

Ratings Can Help Identify “Disconnects” This “disconnect” between the accuracy and task ease ratings

Ratings Can Help You Make Comparisons You can be very pleased if you get

In Closing… u These slides, a bibliography of readings, and associated examples, can be

Slides: 34

Download presentation

Rating Scales: What the Research Says Joe Dumas UX Consultant joe. dumas 99@gmail. com tom. tullis@fmr. com Mini_UPA, 2009 Tom Tullis Fidelity Investments

The Scope of the Session u Discussion of literature about rating scales in usability methods, primarily usability testing u Brief review of recommendations from older literature u Focus on recent studies u Recommendations for practitioners Mini_UPA, 2009 2

Table of Contents u Types of rating scales u Guidelines from past studies u How to evaluate a rating scale u Guidelines from recent studies u Additional advantages of rating scales Mini_UPA, 2009 3

Types of Rating Scales Mini_UPA, 2009 4

Formats u u u One question format Before-after format Multiple question format Mini_UPA, 2009 5

One Question Formats Original Likert scale format: I think that I would like to use this system frequently: ___ Strongly Disagree ___ Neither agree not disagree Rensis ___ Agree Likert ___ Strongly Agree u Mini_UPA, 2009 6

One Question Formats u Likert-like scales: Characters on the screen are: Hard to read 1 2 3 4 9 5 Mini_UPA, 2009 6 Easy to read 7 8 7

One Question Formats One more Likert-like scale (used in SUMI): I would recommend this software to my colleagues: __ Agree __ Undecided __ Disagree Mini_UPA, 2009 8

One Question Formats Subjective Mental Effort Scale (SMEQ) Mini_UPA, 2009 9

One Question Formats u Semantic Differential: u Magnitude estimation: u Use any positive number Mini_UPA, 2009 10

Before-After Ratings Before the task: How easy or difficult do you expect this task to be: Very easy Very difficult 1 2 3 4 5 6 7 u After the task: How easy or difficult was task to do: Very easy Very Mini_UPA, 2009 11 difficult u

Multiple Question Formats (Selected List) Software Usability Scale (SUS) – 10 ratings u *Questionnaire for User-Interface Satisfaction – QUIS 71 (long form), 26 (short form) ratings u *Software Usability Measurement Inventory (SUMI) – 50 ratings u After Scenario Questionnaire (ASQ) – three ratings u Mini_UPA, 2009 * Requires a license 12

More Multiple Question Formats Post Study System Usability Questionnaire (PSSOQ) - 19 ratings. Electronic version called the Computer System Usability Questionnaire (CSUQ) u *Website Analysis and Measure. Ment Inventory (WAMMI) – 20 ratings of website usability u * Requires a license Mini_UPA, 2009 13

Guidelines from Past Studies Mini_UPA, 2009 14

Guidelines Have 5 -9 levels in a rating u You gain no additional information by having more than 10 levels u Include a neutral point in the middle of the scale u Otherwise you lose information by forcing some participants to take sides u People from some Asian cultures are more likely to choose the midpoint u Mini_UPA, 2009 15

Guidelines Use positive integers as numbers u 1 -7 instead of -3 to +3 (Participants are less likely to go below 0 than they are to use 1 -3) u Or don’t show numbers at all u Use word labels for at least the end points. u Hard to create labels for every point beyond 5 levels u Having labels on the end points only also makes the data more “interval-like”. u Mini_UPA, 2009 16

Guidelines u Most word labels produce a bipolar scale u In a 1 to 7 scale from easy to difficult, what is increasing with the numbers? Is ease the absence of difficulty? u This may be one reason why participants are reluctant to move to the difficult end – it is a different concept that lack of ease u One solution – scale from “not at all easy” to “very easy” Mini_UPA, 2009 17

Evaluating a Rating Scale Mini_UPA, 2009 18

Statistical Criteria u Is it valid? Does it measure what it’s suppose to measure? u For example, does it correlate with other usability measures. u Is it sensitive? u Can it discriminate between tasks or products with small samples Mini_UPA, 2009 19

Practical Criteria u Is it easy for the participant to understand use? u Do they get what it means? u Is it easy for the tester to present – online or paper – and score? u Do you need a widget to present it? u Can scoring be done automatically? Mini_UPA, 2009 20

Guidelines from Recent Studies Mini_UPA, 2009 21

Post-Task Ratings The simpler the better u Tedesco and Tullis found this format the most sensitive Overall this task was: Very Easy Very Difficult u u Sauro and Dumas found SMEQ just as sensitive as Likert Mini_UPA, 2009 22

More on Post-Task Ratings u They provide diagnostic information about usability issues with tasks u They correlate moderately well with other measures especially time, and their correlations are higher than for post-test ratings Mini_UPA, 2009 23

More on Post-Task Ratings Even post-task ratings may be inflated (Teague et al. , 2001). Ratings made during a task were significantly lower than after the task and even higher if given only after the task Concurre Postnt nt task During After task Only task Ease 4. 44 4. 78 Mini_UPA, 2009 5. 60 24

Post-Test Ratings u Home grown questionnaires perform more poorly than standardized ones u Tullis and Stetson and others have found SUS most sensitive. Many testers are using it. u Some of the standardized questionnaires have industry norms to compare against SUMI and WAMMI u But no one knows what the database of norms contains Mini_UPA, 2009 25

More on Post-Test Ratings u The lowest correlations among all measures used in testing are with posttask ratings (Sauro and Lewis) u Why? – they are tapping into factors that don’t effect other measures such as demand characteristics, need to please, need to appear competent, lack of understanding of what an “overall” rating means, etc. Mini_UPA, 2009 26

Examine the Distribution See how the average would miss how bimodal the distribution is. Some participants find it very hard to use. Why? Mini_UPA, 2009 27

Low Sensitivity with Small Samples u Three recent studies have all shown that post-task and post-test ratings do not discriminate well with sample sizes below about 10 -12 u For sample sizes typical of laboratory formative tests, ratings are not reliable u Ratings can be used as an opportunity to get participant to talk about why they have chosen a value Mini_UPA, 2009 28

The Value of Confidence Intervals Actual data from an online study comparing the NASA & Wikipedia sites for finding info on the Apollo space program. Mini_UPA, 2009 29

Little Known Advantages of Rating Scales Mini_UPA, 2009 30

Ratings Can Help Prioritize Work “Promote It” “Don’t Touch It” “Big Opportunity” “Fix it Fast” 1=Difficult … 7=Easy Mini_UPA, 2009 31

Ratings Can Help Identify “Disconnects” This “disconnect” between the accuracy and task ease ratings is worrisome– it indicates users didn’t realize they were screwing up on Task 2! Mini_UPA, 2009 32

Ratings Can Help You Make Comparisons You can be very pleased if you get an average SUS score of 83 (which is the 94 th percentile of this distribution). But you should be worried if you get an average SUS score of 48 (the 12 th percentile). Mini_UPA, 2009 33

In Closing… u These slides, a bibliography of readings, and associated examples, can be downloaded from: http: //www. measuring. UX. com/ Feel free to contact us with questions! Mini_UPA, 2009 34