Provenance for Generalized Map and Reduce Workflows PANDA
- Slides: 26
Provenance for Generalized Map and Reduce Workflows PANDA Project Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University
Provenance v Where data came from v How it was derived, manipulated, combined, processed, … v How it has evolved over time v Uses: § Explanation § Debugging and verification § Recomputation Robert Ikeda 2
The Panda Environment I 1 … In O v Data-oriented workflows § Graph of processing nodes § Data sets on edges § Statically-defined; batch execution; acyclic Robert Ikeda 3
Provenance Twitter Posts Movie Sentiments v Backward tracing § Find the input subsets that contributed to a given output element v Forward tracing § Determine which output elements were derived from a particular input element Robert Ikeda 4
Provenance v Basic idea § Capture provenance one node at a time (lazy or eager) § Use it for backward and forward tracing § Handle processing nodes of all types Robert Ikeda 5
Generalized Map and Reduce Workflows R M M What if every node were a Map or Reduce function? v Provenance easier to define, capture, and exploit than in the general case v Transparent provenance capture in Hadoop § Doesn’t interfere with parallelism or fault-tolerance Robert Ikeda 6
Remainder of Talk v Defining Map and Reduce provenance v Recursive workflow provenance v Capturing and tracing provenance v System description and performance v Future work Robert Ikeda 7
Remainder of Talk v Defining Map and Reduce provenance v Recursive workflow provenance Surprising theoretical result v Capturing and tracing provenance v System description and performance Implementation details v Future work Robert Ikeda 8
Map and Reduce Provenance v Map functions § M(I) = Ui I (M({i})) § Provenance of o O is i I such that o M({i}) v Reduce functions § R(I) = U 1≤ k ≤ n(R(Ik)) I 1, …, In partition I on reduce-key § Provenance of o O is Ik I such that o R(Ik) Robert Ikeda 9
Workflow Provenance I*1 I 1 … … I*n In R E 1 M R M M o O O E 2 v Intuitive recursive definition v Desirable “replay” property o W(I*1, …, I*n) Usually holds, but not always Robert Ikeda 10
Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 “I enjoyed Avatar” Avatar 7 “I loved Twilight” Twilight 7 Avatar 4 “Avatar was okay” Robert Ikeda 11 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1
Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 “I enjoyed Avatar” Avatar 7 “I loved Twilight” Twilight 7 Avatar 4 “Avatar was okay” Robert Ikeda 12 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1
Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 13 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1
Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count “Avatar was great” Inferred Movie Ratings Movie Rating “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 14 Rating Medians #Movies Per Rating Movie Median #Movies Avatar 7 2 1 Twilight 2 7 1
Replay Property Example Twitter Posts M R R Tweet. Scan Summarize Count Inferred Movie Ratings Rating Medians Nonmonotonic Reduce Function Movie Rating “Avatar was great” One-Many “I hated Twilight” Avatar 8 “Twilight was pretty bad” Twilight 0 Twilight 2 Avatar 7 Twilight 7 Avatar 4 “I enjoyed Avatar And Twilight too” “Avatar was okay” Robert Ikeda 15 #Movies Per Rating Nonmonotonic Reduce Movie Median #Movies Avatar 7 2 1 Twilight 7 2 7 12
Capturing and Tracing Provenance v Map functions § Add the input ID to each of the output elements v Reduce functions § Add the input reduce-key to each of the output elements v Tracing § Straightforward recursive algorithms Robert Ikeda 16
RAMP System v Built as an extension to Hadoop v Supports Map. Reduce Workflows § Each node is a Map. Reduce job v Provenance capture is transparent § Retaining Hadoop’s parallel execution and fault tolerance v Users need not be aware of provenance capture § Wrapping is automatic § RAMP stores provenance separately from the input and output data Robert Ikeda 17
RAMP System: Provenance Capture v Hadoop components § § § Record-reader Mapper Combiner (optional) Reducer Record-writer Robert Ikeda 18
RAMP System: Provenance Capture Input Wrapper Record. Reader (ki, vi) p (ki, 〈vi, p〉) Wrapper (ki, vi) Mapper p (km, vm) (km, 〈vm, p〉) Robert Ikeda Map Output 19 Map Output
RAMP System: Provenance Capture Map Output (km, [〈vm 1, p 1〉, …, 〈vmn, pn〉]) (km, [vm 1, …, vmn]) Wrapper (km, [vm 1, …, vmn]) Reducer (ko, vo) (ko, 〈vo, km. ID〉) Wrapper (ko, vo) Record. Writer q (km. ID, pj) (q, km. ID) Robert Ikeda Output 20 Output Provenance
Experiments v 51 large EC 2 instances (Thank you, Amazon!) v Two Map. Reduce “workflows” § Wordcount • Many-one with large fan-in • Input sizes: 100, 300, 500 GB § Terasort • One-one • Input sizes: 93, 279, 466 GB Robert Ikeda 21
Results: Wordcount Robert Ikeda 22
Results: Terasort Robert Ikeda 23
Summary of Results v Overhead of provenance capture § Terasort • 20% time overhead, 21% space overhead § Wordcount • 76% time overhead, space overhead depends directly on fan-in v Backward-tracing § Terasort • 1. 5 seconds for one element § Wordcount • Time directly dependent on fan-in Robert Ikeda 24
Future Work v RAMP § Selective provenance capture § More efficient backward and forward tracing § Indexing v General § Incorporating SQL processing nodes Robert Ikeda 25
PANDA A System for Provenance and Data “stanford panda”
- Disease-specific workflows
- Sample jira workflows
- Unified software development process
- Goanywhere advanced workflows
- Iteration workflows in software project management
- Sirsi workflows
- What is provenance
- Provenance semirings
- Fhir provenance example
- Software of unknown provenance
- "provenance properties"
- Generalized map ap human geography
- Multiway join
- Mapreduce types and formats
- Map reduce word count
- Google map reduce
- Map reduce combine
- Map reduce algorithm
- Document
- Map reduce paper
- Map-reduce
- Map reduce program
- Map-reduce
- Sherpamap
- Mapreduce: simplified data processing on large clusters
- Java map reduce
- Lisp map reduce