Lecture 3 Data Driven Visualizations Credit Aditya Parameswaran

Groups and QCRs 1. Complete groups now • Discussion leaders vs research group 2.

Today’s Lecture 1. Why data visualization? 2. Visualization recommendations 3. DB-inspired Optimizations 3

Section 1 It all started with John Snow Hypothesis at the time: Diseases such

Section 1 How did he do it? Death/Survival in cholera in 19 th century

Section 1 How did he do it? Snow's data visualization study is regarded as

Section 1 Standard Data Visualization Recipe: 1. Load dataset into data viz tool 2.

Section 1 Tedious and Time-consuming! Key Issue: Visualization can be generated by: varying subsets

Section 2 2. Visualization recommendations 12

Section 1 What you will learn about in this section 1. Space of Visualizations

Section 1 Goal Given a dataset and a task, automatically produce a set of

Section 2 Example • Data analyst studying census data • age, education, marital-status, sex,

Section 2 Space of visualizations For simplicity, assume a single table (star schema) Visualizations

Section 2 Space of visualizations Vi = SELECT d, f(m) FROM table WHERE ___

Section 2 Interesting visualizations Deviation-based Utility A visualization is interesting if it displays a

Section 2 Deviation-based Utility Metric A visualization is interesting if it displays a large

Section 2 Computing Expected Trend Race vs. AVG(capital-gain) Reference Trend SELECT race, AVG(capital-gain) FROM

Section 2 Computing Actual Trend Race vs. AVG(capital-gain) Target. Trend SELECT race, AVG(capital-gain) FROM

Section 2 Computing Utility U = D[P(V 1) , P(V 2)] D = EMD,

Section 2 Low Utility Visualization Actual Expected 25

Section 2 High Utility Visualization Actual Expected 26

Section 2 Other metrics • • • Data characteristics Task or Insight Semantics and

Section 3 3. DB-inspired Optimizations 28

Section 3 What you will learn about in this section 1. Ranking Visualizations 2.

Section 3 Ranking Across all (d, m, f), where V 1 = SELECT d,

Section 3 Even larger space of queries • • Binning 3 dimensional or 4

Section 3 Back to ranking Across all (d, m, f), where V 1 =

Section 3 Issues with Naïve Approach • Repeated processing of same data in sequence

Section 3 Optimizations • Each visualization = 2 SQL queries • Latency > 100

Section 3 Optimizations • Combine aggregate queries on target and ref • Combine multiple

Section 3 Combining Multiple Group-by’s • Too few group-bys leads to many table scans

Section 3 Pruning optimizations Discard low-utility views early to avoid wasted computation • Keep

Visualizations Middleware Layer Queries (100 s) Optimizer DBMS Sharing Pruning 38

Section 3 More on automated visualizations 39

Section 3 ZQL: a viz exploration language 40

Section 3 Intelligent query optimizer 41

Summary Human in the loop analytics are here to stay! 42

Slides: 42

Download presentation

Lecture 3: Data Driven Visualizations Credit: Aditya Parameswaran 1

Groups and QCRs 1. Complete groups now • Discussion leaders vs research group 2. QCs are due: Sunday 11: 59 pm and Tuesday 11: 59 pm • This is to allow discussion leaders to prepare responses 3. Please start forming research groups • Up to three (this means you can even be on your own) 2

Today’s Lecture 1. Why data visualization? 2. Visualization recommendations 3. DB-inspired Optimizations 3

Section 1 1. Why data visualization? 4

Section 1 It all started with John Snow Hypothesis at the time: Diseases such as cholera and bubonic plague are caused by pollution or a noxious form of "bad air”. John Snow’s Research: The source of the outbreak was the public water pump [On the Mode of Communication of Cholera] 5

Section 1 How did he do it? Death/Survival in cholera in 19 th century London by district and water company. 6

Section 1 How did he do it? Snow's data visualization study is regarded as the founding event of the science of epidemiology. 7

Section 1 Data Visualizations Today 8

Section 1 Data Visualizations Today 9

Section 1 Standard Data Visualization Recipe: 1. Load dataset into data viz tool 2. Start with a desired hypothesis/pattern (explore combination of attributes) 3. Select viz to be generated 4. See if it matches desired pattern 5. Repeat 3 -4 until you find a match 10

Section 1 Tedious and Time-consuming! Key Issue: Visualization can be generated by: varying subsets of data varying attributes being visualized Too many visualization to look at to find desired visual patterns! 11

Section 2 2. Visualization recommendations 12

Section 1 What you will learn about in this section 1. Space of Visualizations 2. Recommendation Metrics 13

Section 1 Goal Given a dataset and a task, automatically produce a set of visualizations that are the most “interesting” given the task Particularly vague 14

Section 1 Goal Given a dataset and a task, automatically produce a set of visualizations that are the most “interesting” given the task 15

Section 2 Example • Data analyst studying census data • age, education, marital-status, sex, race, income, hours-worked etc. • A = # attributes in table • Task: Compare on various socioeconomic indicators, unmarried adults vs. all adults 16

Section 2 Space of visualizations For simplicity, assume a single table (star schema) Visualizations = agg. + grp. by queries Vi = SELECT d, f(m) FROM table WHERE ___ GROUP BY d 4. 5 4 3. 5 50 3 10 10 CA IL 30 2. 5 2 2000 2002 2004 2006 2008 2010 2012 (d, m, f): dimension, measure, aggregate 1. 5 MA NY 17

Section 2 Space of visualizations Vi = SELECT d, f(m) FROM table WHERE ___ GROUP BY d (d, m, f): dimension, measure, aggregate {d} : race, work-type, sex etc. {m} : capital-gain, capital-loss, hours-per-week {f} : COUNT, SUM, AVG 18

Section 1 Goal Given a dataset and a task, automatically produce a set of visualizations that are the most “interesting” given the task 19

Section 2 Interesting visualizations Deviation-based Utility A visualization is interesting if it displays a large deviation from some reference Target Reference Task: compare unmarried adults with all adults V 1 = SELECT d, f(m) FROM table WHERE target GROUP BY d V 2 = SELECT d, f(m) FROM table WHERE reference GROUP BY d 50 V 1 Compare induced probability distributions! V 1 MA V 2 30 MA 30 10 10 CA IL 20 CA V 2 NY 40 10 IL NY 20

Section 2 Deviation-based Utility Metric A visualization is interesting if it displays a large deviation from some reference Many metrics for computing distance between distributions D [P( V 1), P(V 2)] V 1 V 2 Earth mover’s distance L 1, L 2 distance K-L divergence Any distance metric b/n distributions is OK! 21

Section 2 Computing Expected Trend Race vs. AVG(capital-gain) Reference Trend SELECT race, AVG(capital-gain) FROM census GROUP BY race P(V 1) Expected Distribution 22

Section 2 Computing Actual Trend Race vs. AVG(capital-gain) Target. Trend SELECT race, AVG(capital-gain) FROM census GROUP BY race WHERE marital-status=‘unmarried’ P(V 2) Actual Distribution 23

Section 2 Computing Utility U = D[P(V 1) , P(V 2)] D = EMD, L 2 etc. 24

Section 2 Low Utility Visualization Actual Expected 25

Section 2 High Utility Visualization Actual Expected 26

Section 2 Other metrics • • • Data characteristics Task or Insight Semantics and Domain Knowledge Visual Ease of Understanding User Preference 27

Section 3 3. DB-inspired Optimizations 28

Section 3 What you will learn about in this section 1. Ranking Visualizations 2. Optimizations 29

Section 3 Ranking Across all (d, m, f), where V 1 = SELECT d, f(m) FROM table WHERE target GROUP BY d V 2 = SELECT d, f(m) FROM table WHERE reference GROUP BY d Goal: return k best utility visualizations (d, m, f), (those with largest D[V 1, V 2]) Vi = (d: dimension, m: measure, f: aggregate) 10 s of dimensions, 10 s of measures, handful of aggregates 2* d * m * f è 100 s of queries for a single user task! èCan be even larger. How? 30

Section 3 Even larger space of queries • • Binning 3 dimensional or 4 dimensional visualizations Scatterplot or map visualizations … 31

Section 3 Back to ranking Across all (d, m, f), where V 1 = SELECT d, f(m) FROM table WHERE target GROUP BY d V 2 = SELECT d, f(m) FROM table WHERE reference GROUP BY d Goal: return k best utility visualizations (d, m, f), (those with largest D[V 1, V 2]) Naïve Approach For each (d, m, f) in sequence evaluate queries for V 1 (target), V 2 (reference) compute D[V 1, V 2] Return the k (d, m, f) with largest D values 32

Section 3 Issues with Naïve Approach • Repeated processing of same data in sequence across queries • Computation wasted on low-utility visualizations Sharing Pruning 33

Section 3 Optimizations • Each visualization = 2 SQL queries • Latency > 100 s • Minimize number of queries and scans 34

Section 3 Optimizations • Combine aggregate queries on target and ref • Combine multiple aggregates (d 1, m 1, f 1), (d 1, m 2, f 1) (d 1, [m 1, m 2], f 1) • Combine multiple group-bys* (d 1, m 1, f 1), (d 2, m 1, f 1) ([d 1, d 2], m 1, f 1) Could be problematic… • Parallel Query Execution 35

Section 3 Combining Multiple Group-by’s • Too few group-bys leads to many table scans • Too many group-bys hurt performance • # groups = Π (# distinct values per attributes) • Optimal group-by combination ≈ bin-packing • Bin volume = log S (max number of groups) • Volume of items (attributes) = log (|ai|) • Minimize # bins s. t. Σi log (|ai|) <= log S 36

Section 3 Pruning optimizations Discard low-utility views early to avoid wasted computation • Keep running estimates of utility • Prune visualizations based on estimates • Two flavors • Vanilla Confidence Interval based Pruning • Multi-armed Bandit Pruning 37

Visualizations Middleware Layer Queries (100 s) Optimizer DBMS Sharing Pruning 38

Section 3 More on automated visualizations 39

Section 3 ZQL: a viz exploration language 40

Section 3 Intelligent query optimizer 41

Summary Human in the loop analytics are here to stay! 42