Provenance for Generalized Map and Reduce Workflows PANDA

  • Slides: 26
Download presentation
Provenance for Generalized Map and Reduce Workflows PANDA Project Robert Ikeda, Hyunjung Park, Jennifer

Provenance for Generalized Map and Reduce Workflows PANDA Project Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University

Provenance v Where data came from v How it was derived, manipulated, combined, processed,

Provenance v Where data came from v How it was derived, manipulated, combined, processed, … v How it has evolved over time v Uses: § Explanation § Debugging and verification § Recomputation Robert Ikeda 2

The Panda Environment I 1 … In O v Data-oriented workflows § Graph of

The Panda Environment I 1 … In O v Data-oriented workflows § Graph of processing nodes § Data sets on edges § Statically-defined; batch execution; acyclic Robert Ikeda 3

Provenance Twitter Posts Movie Sentiments v Backward tracing § Find the input subsets that

Provenance Twitter Posts Movie Sentiments v Backward tracing § Find the input subsets that contributed to a given output element v Forward tracing § Determine which output elements were derived from a particular input element Robert Ikeda 4

Provenance v Basic idea § Capture provenance one node at a time (lazy or

Provenance v Basic idea § Capture provenance one node at a time (lazy or eager) § Use it for backward and forward tracing § Handle processing nodes of all types Robert Ikeda 5

Generalized Map and Reduce Workflows R M M What if every node were a

Generalized Map and Reduce Workflows R M M What if every node were a Map or Reduce function? v Provenance easier to define, capture, and exploit than in the general case v Transparent provenance capture in Hadoop § Doesn’t interfere with parallelism or fault-tolerance Robert Ikeda 6

Remainder of Talk v Defining Map and Reduce provenance v Recursive workflow provenance v

Remainder of Talk v Defining Map and Reduce provenance v Recursive workflow provenance v Capturing and tracing provenance v System description and performance v Future work Robert Ikeda 7

Remainder of Talk v Defining Map and Reduce provenance v Recursive workflow provenance Surprising

Remainder of Talk v Defining Map and Reduce provenance v Recursive workflow provenance Surprising theoretical result v Capturing and tracing provenance v System description and performance Implementation details v Future work Robert Ikeda 8

Map and Reduce Provenance v Map functions § M(I) = Ui I (M({i})) §

Map and Reduce Provenance v Map functions § M(I) = Ui I (M({i})) § Provenance of o O is i I such that o M({i}) v Reduce functions § R(I) = U 1≤ k ≤ n(R(Ik)) I 1, …, In partition I on reduce-key § Provenance of o O is Ik I such that o R(Ik) Robert Ikeda 9

Workflow Provenance I*1 I 1 … … I*n In R E 1 M R

Workflow Provenance I*1 I 1 … … I*n In R E 1 M R M M o O O E 2 v Intuitive recursive definition v Desirable “replay” property o W(I*1, …, I*n) Usually holds, but not always Robert Ikeda 10

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 “I enjoyed Avatar” Avatar 7 “I loved Twilight” Twilight 7 Avatar 4 “Avatar was okay” Robert Ikeda 11 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 “I enjoyed Avatar” Avatar 7 “I loved Twilight” Twilight 7 Avatar 4 “Avatar was okay” Robert Ikeda 12 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 13 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 14 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count Inferred Movie

Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count Inferred Movie Ratings Rating Medians Nonmonotonic Reduce Function Movie Rating “Avatar was great” One-Many “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 15 #Movies Per Rating Nonmonotonic Reduce Movie Median #Movies Avatar 7 2 1 Twilight 7 2 7 12

Capturing and Tracing Provenance v Map functions § Add the input ID to each

Capturing and Tracing Provenance v Map functions § Add the input ID to each of the output elements v Reduce functions § Add the input reduce-key to each of the output elements v Tracing § Straightforward recursive algorithms Robert Ikeda 16

RAMP System v Built as an extension to Hadoop v Supports Map. Reduce Workflows

RAMP System v Built as an extension to Hadoop v Supports Map. Reduce Workflows § Each node is a Map. Reduce job v Provenance capture is transparent § Retaining Hadoop’s parallel execution and fault tolerance v Users need not be aware of provenance capture § Wrapping is automatic § RAMP stores provenance separately from the input and output data Robert Ikeda 17

RAMP System: Provenance Capture v Hadoop components § § § Record-reader Mapper Combiner (optional)

RAMP System: Provenance Capture v Hadoop components § § § Record-reader Mapper Combiner (optional) Reducer Record-writer Robert Ikeda 18

RAMP System: Provenance Capture Input Wrapper Record. Reader (ki, vi) p (ki, 〈vi, p〉)

RAMP System: Provenance Capture Input Wrapper Record. Reader (ki, vi) p (ki, 〈vi, p〉) Wrapper (ki, vi) Mapper p (km, vm) (km, 〈vm, p〉) Robert Ikeda Map Output 19 Map Output

RAMP System: Provenance Capture Map Output (km, [〈vm 1, p 1〉, …, 〈vmn, pn〉])

RAMP System: Provenance Capture Map Output (km, [〈vm 1, p 1〉, …, 〈vmn, pn〉]) (km, [vm 1, …, vmn]) Wrapper (km, [vm 1, …, vmn]) Reducer (ko, vo) (ko, 〈vo, km. ID〉) Wrapper (ko, vo) Record. Writer q (km. ID, pj) (q, km. ID) Robert Ikeda Output 20 Output Provenance

Experiments v 51 large EC 2 instances (Thank you, Amazon!) v Two Map. Reduce

Experiments v 51 large EC 2 instances (Thank you, Amazon!) v Two Map. Reduce “workflows” § Wordcount • Many-one with large fan-in • Input sizes: 100, 300, 500 GB § Terasort • One-one • Input sizes: 93, 279, 466 GB Robert Ikeda 21

Results: Wordcount Robert Ikeda 22

Results: Wordcount Robert Ikeda 22

Results: Terasort Robert Ikeda 23

Results: Terasort Robert Ikeda 23

Summary of Results v Overhead of provenance capture § Terasort • 20% time overhead,

Summary of Results v Overhead of provenance capture § Terasort • 20% time overhead, 21% space overhead § Wordcount • 76% time overhead, space overhead depends directly on fan-in v Backward-tracing § Terasort • 1. 5 seconds for one element § Wordcount • Time directly dependent on fan-in Robert Ikeda 24

Future Work v RAMP § Selective provenance capture § More efficient backward and forward

Future Work v RAMP § Selective provenance capture § More efficient backward and forward tracing § Indexing v General § Incorporating SQL processing nodes Robert Ikeda 25

PANDA A System for Provenance and Data “stanford panda”

PANDA A System for Provenance and Data “stanford panda”