Empirical Evaluation Chris North cs 5984 Information Visualization

  • Slides: 21
Download presentation
Empirical Evaluation Chris North cs 5984: Information Visualization

Empirical Evaluation Chris North cs 5984: Information Visualization

Evaluating Visualizations • Expert Review • Examination by visualization expert • Heuristic Evaluation •

Evaluating Visualizations • Expert Review • Examination by visualization expert • Heuristic Evaluation • Principles, Guidelines • Algorithmic • Usability Evaluation • Observation, problem identification • Empirical Experiment ** • Controlled scientific experiment, “user study” • Comparisons, statistical analysis

What is Science? • Measurement • Modeling

What is Science? • Measurement • Modeling

Scientific Method 1. 2. 3. 4. Form Hypothesis Collect data Analyze Accept/reject hypothesis

Scientific Method 1. 2. 3. 4. Form Hypothesis Collect data Analyze Accept/reject hypothesis

Deep Questions • Is ‘computer science’ science? • How can you “prove” a hypothesis

Deep Questions • Is ‘computer science’ science? • How can you “prove” a hypothesis with science?

Empirical Experiment • Typical question: • Which visualization is better in which situations? Lifelines

Empirical Experiment • Typical question: • Which visualization is better in which situations? Lifelines Perspective. Wall

More Rigorous Question • Does Vis Tool (Lifelines or Persp. Wall) have an effect

More Rigorous Question • Does Vis Tool (Lifelines or Persp. Wall) have an effect on user performance time for task X? • Null hypothesis: • No effect • Lifelines = Persp. Wall • Want to disprove, provide counter-example, show an effect

Variables • Independent Variables (what you vary) and treatments (the variable values): • Visualization

Variables • Independent Variables (what you vary) and treatments (the variable values): • Visualization tool » Lifelines, Perspective Wall, Text UI • Task type » Find, count, pattern, compare • Data size (# of items) » 100, 1000000 • Dependent Variables (what you measure) • • User performance time Errors Subjective satisfaction (survey) HCI metrics!

Example: 2 x 3 design Ind Var 2: Task Type Task 1 Task 2

Example: 2 x 3 design Ind Var 2: Task Type Task 1 Task 2 Task 3 Life. Ind Var 1: Lines Vis. Tool Persp. Wall • n users per cell Measured user performance times (dep var)

Groups • “Between subjects” variable • • 1 group of users for each variable

Groups • “Between subjects” variable • • 1 group of users for each variable treatment Group 1: 20 users, Lifelines Group 2: 20 users, Persp. Wall Total: 40 users, 20 per cell • “With-in subjects” (repeated) variable • • • All users perform all treatments Counter-balancing order effect Group 1: 20 users, Lifelines then Persp. Wall Group 2: 20 users, Persp. Wall then Lifelines Total: 40 users, 40 per cell

Issues • Randomized • Fairness • Identical procedures • • Bias • User privacy,

Issues • Randomized • Fairness • Identical procedures • • Bias • User privacy, data security

Procedure • For each user: • Sign legal forms • Pre-Survey: demographics • Instructions

Procedure • For each user: • Sign legal forms • Pre-Survey: demographics • Instructions » Do not reveal true purpose of experiment • Training runs • Actual runs • Post-Survey: subjective measures • * n users

Data • Measured dependent variables • Spreadsheet: • Lifelines task 1, 2, 3, Persp.

Data • Measured dependent variables • Spreadsheet: • Lifelines task 1, 2, 3, Persp. Wall task 1, 2, 3

Averages Ind Var 2: Task Type Life. Ind Var 1: Lines Vis. Tool Persp.

Averages Ind Var 2: Task Type Life. Ind Var 1: Lines Vis. Tool Persp. Wall Task 1 Task 2 Task 3 37. 2 54. 5 103. 7 29. 8 53. 2 145. 4 Measured user performance times (dep var)

Persp. Wall better than Lifelines? Perf time (secs) Lifelines persp. Wall • Problem with

Persp. Wall better than Lifelines? Perf time (secs) Lifelines persp. Wall • Problem with Averages: lossy • Compares only 2 numbers • What about the 40 data values? (Show me the data!)

The real picture Perf time (secs) Lifelines persp. Wall • Need stats that take

The real picture Perf time (secs) Lifelines persp. Wall • Need stats that take all data into account

Statistics • t-test • Compares 1 dep var on 2 treatments of 1 ind

Statistics • t-test • Compares 1 dep var on 2 treatments of 1 ind var • ANOVA: Analysis of Variance • Compares 1 dep var on n treatments of m ind vars • Result: “significant difference” between treatments? • p = significance level (confidence) • typical cut-off: p < 0. 05

p < 0. 05 • • Woohoo! Found a “statistically significant difference” Averages determine

p < 0. 05 • • Woohoo! Found a “statistically significant difference” Averages determine which is ‘better’ Conclusion: • • Vis Tool has an “effect” on user performance for task 1 Persp. Wall better user performance than Lifelines for task 1 “ 95% confident that Persp. Wall better than Lifelines” Not “Persp. Wall beats Lifelines 95% of time” • Found a counter-example to the null-hypothesis • Null-hypothesis: Lifelines = Persp. Wall • Hence: Lifelines Persp. Wall

p > 0. 05 • Hence, same? • Vis Tool has no effect on

p > 0. 05 • Hence, same? • Vis Tool has no effect on user performance for task 1? • Lifelines = Persp. Wall ? • NOT! • • We did not detect a difference, but could still be different Did not find a counter-example to null hypothesis Provides evidence for Lifelines = Persp. Wall, but not proof Boring! Basically found nothing • How? • Not enough users • Need better tasks, data, …

Data Mountain • Robertson, “Data Mountain” • Quoc, Reenal (Microsoft)

Data Mountain • Robertson, “Data Mountain” • Quoc, Reenal (Microsoft)

Assignment • Thurs: Visualization Development • Bederson, “Jazz” » Jun, Rohit • Literature Review

Assignment • Thurs: Visualization Development • Bederson, “Jazz” » Jun, Rohit • Literature Review due Thurs • Homework #2 due thurs oct 4