CS 347 Parallel and Distributed Data Management Notes

CS 347: Parallel and Distributed Data Management Notes X: S 4 Hector Garcia-Molina CS 347 Lecture 9 B 1

Material based on: CS 347 Lecture 9 B 2

S 4 • Platform for processing unbounded data streams – general purpose – distributed – scalable – partially fault tolerant (whatever this means!) CS 347 Lecture 9 B 3

Data Stream Terminology: event (data record), key, attribute Question: Can an event have duplicate attributes for same key? (I think so. . . ) Stream unbounded, generated by say user queries, purchase transactions, phone calls, sensor readings, . . . CS 347 Lecture 9 B 4

S 4 Processing Workflow user specified “processing unit” CS 347 Lecture 9 B 5

Inside a Processing Unit PE 1 key=a processing element PE 2 key=b . . . key=z PEn CS 347 Lecture 9 B 6

Example: • Stream of English quotes • Produce a sorted list of the top K most frequent words CS 347 Lecture 9 B 7

CS 347 Lecture 9 B 8

Processing Nodes processing node key=a key=b . . . hash(key) =1 . . . hash(key) =m . . . CS 347 Lecture 9 B 9

Dynamic Creation of PEs • As a processing node sees new key attributes, it dynamically creates new PEs to handle them • Think of PEs as threads CS 347 Lecture 9 B 10

Another View of Processing Node CS 347 Lecture 9 B 11

Failures • Communication layer detects node failures and provides failover to standby nodes • What happens events in transit during failure? (My guess: events are lost!) CS 347 Lecture 9 B 12

How do we do DB operations on top of S 4? • Selects & projects easy! • What about joins? CS 347 Lecture 9 B 13

What is a Stream Join? R S • For true join, would need to store all inputs forever! Not practical. . . • Instead, define window join: – at time t new R tuple arrives – it only joins with previous w S tuples CS 347 Lecture 9 B 14

One Idea for Window Join code for PE: for each event e: if e. rel=R [store in Rset(last w) for s in Sset: output join(e, s) ] else. . . R S “key” is join attribute CS 347 Lecture 9 B 15

One Idea for Window Join code for PE: for each event e: if e. rel=R [store in Rset(last w) for s in Sset: output join(e, s) ] else. . . R S “key” is join attribute Is this right? ? ? (enforcing window on a per-key value basis) Maybe add sequence numbers to events to enforce correct window? CS 347 Lecture 9 B 16

Another Idea for Window Join code for PE: for each event e: if e. rel=R [store in Rset(last w) for s in Sset: if e. C=s. C then output join(e, s) ] else if e. rel=S. . . R S All R & S events have “key=fake”; Say join key is C. CS 347 Lecture 9 B 17

Another Idea for Window Join code for PE: for each event e: if e. rel=R [store in Rset(last w) for s in Sset: if e. C=s. C then output join(e, s) ] else if e. rel=S. . . R S All R & S events have “key=fake”; Say join key is C. CS 347 Lecture 9 B Entire join done in one PE; no parallelism 18

Do You Have a Better Idea for Window Join? R S CS 347 Lecture 9 B 19

Final Comment: Managing state of PE state CS 347 Who manages state? S 4: user does Mupet: System does Is state persistent? Lecture 9 B 20

CS 347: Parallel and Distributed Data Management Notes X: Hyracks Hector Garcia-Molina CS 347 Lecture 9 B 21

Hyracks • Generalization of map-reduce • Infrastructure for “big data” processing • Material here based on: Appeared in ICDE 2011 CS 347 Lecture 9 B 22

A Hyracks data object: site 1 site 2 • Records partitioned across N sites • Simple record schema is available (more than just key-value) site 3 CS 347 Lecture 9 B 23

Operations operator CS 347 distribution rule Lecture 9 B 24

Operations (parallel execution) op 1 op 2 op 3 CS 347 Lecture 9 B 25

Example: Hyracks Specification CS 347 Lecture 9 B 26

Example: Hyracks Specification initial partitions 2 replicated partitions CS 347 input & output is a set of partitions mapping to new partitions Lecture 9 B 27

Notes • Job specification can be done manually or automatically CS 347 Lecture 9 B 28

Example: Activity Node Graph CS 347 Lecture 9 B 29

Example: Activity Node Graph sequencing constraint activities CS 347 Lecture 9 B 30

Example: Parallel Instantiation CS 347 Lecture 9 B 31

Example: Parallel Instantiation stage (start after input stages finish) CS 347 Lecture 9 B 32

System Architecture CS 347 Lecture 9 B 33

Map-Reduce on Hyracks CS 347 Lecture 9 B 34

Library of Operators: • • • File reader/writers Mappers Sorters Joiners (various types 0 Aggregators • Can add more CS 347 Lecture 9 B 35

Library of Connectors: • • • N: M 1: 1 hash partitioner hash-partitioning merger (input sorted) rage partitioner (with partition vector) replicator • Can add more! CS 347 Lecture 9 B 36

Hyracks Fault Tolerance: Work in progress? CS 347 Lecture 9 B 37

Hyracks Fault Tolerance: Work in progress? save output to files The Hadoop/MR approach: save partial results to persistent storage; after failure, redo all work to reconstruct missing data CS 347 Lecture 9 B 38

Hyracks Fault Tolerance: Work in progress? Can we do better? Maybe: each process retains previous results until no longer needed? Pi output: r 1, r 2, r 3, r 4, r 5, r 6, r 7, r 8 have "made way" to final result CS 347 Lecture 9 B current output 39

CS 347: Parallel and Distributed Data Management Notes X: Pregel Hector Garcia-Molina CS 347 Lecture 9 B 40

Material based on: • In SIGMOD 2010 • Note there is an open-source version of Pregel called GIRAPH CS 347 Lecture 9 B 41

Pregel • A computational model/infrastructure for processing large graphs • Prototypical example: Page Rank a d PR[i+1, x] = f(PR[i, a]/na, PR[i, b]/nb) x b e f CS 347 Lecture 9 B 42

Pregel a PR[i+1, x] = f(PR[i, a]/na, PR[i, b]/nb) d x b e f • Synchronous computation in iterations • In one iteration, each node: – gets messages from neighbors – computes – sends data to neighbors CS 347 Lecture 9 B 43

Pregel vs Map-Reduce/S 4/Hyracks/. . . • In Map-Reduce, S 4, Hyracks, . . . workflow separate from data • In Pregel, data (graph) drives data flow CS 347 Lecture 9 B 44

Pregel Motivation • Many applications require graph processing • Map-Reduce and other workflow systems not a good fit for graph processing • Need to run graph algorithms on many procesors CS 347 Lecture 9 B 45

Example of Graph Computation CS 347 Lecture 9 B 46

Termination • After each iteration, each vertex votes to halt or not • If all vertexes vote to halt, computation terminates CS 347 Lecture 9 B 47

Vertex Compute (simplified) • Available data for iteration i: – Input. Messages: { [from, value] } – Output. Messages: { [to, value] } – Out. Edges: { [to, value] } – My. State: value • Out. Edges and My. State are remembered for next iteration CS 347 Lecture 9 B 48

Max Computation • change : = false • for [f, w] in Input. Messages do if w > My. State. value then [ My. State. value : = w change : = true ] • if (superstep = 1) OR change then for [t, w] in Out. Edges do add [t, My. State. value] to Output. Messages else vote to halt CS 347 Lecture 9 B 49

Page Rank Example CS 347 Lecture 9 B 50

Page Rank Example iteration count iterate thru Input. Messages My. State. value shorthand: send same msg to all CS 347 Lecture 9 B 51

Single-Source Shortest Paths CS 347 Lecture 9 B 52

Architecture worker A input data 1 master worker B input data 2 worker C sample record: [a, value] graph has nodes a, b, c, d. . . CS 347 Lecture 9 B 53

Architecture worker A vertexes a, b, c input data 1 master worker B vertexes d, e input data 2 worker C partition graph and assign to workers CS 347 vertexes f, g, h Lecture 9 B 54

Architecture worker A vertexes a, b, c input data 1 master worker B vertexes d, e read input data worker C vertexes f, g, h CS 347 Lecture 9 B input data 2 worker A forwards input values to appropriate workers 55

Architecture worker A vertexes a, b, c input data 1 master worker B vertexes d, e run superstep 1 input data 2 worker C vertexes f, g, h CS 347 Lecture 9 B 56

Architecture worker A vertexes a, b, c halt? input data 1 master worker B vertexes d, e worker C vertexes f, g, h CS 347 Lecture 9 B input data 2 at end superstep 1, send messages 57

Architecture worker A vertexes a, b, c input data 1 master worker B vertexes d, e run superstep 2 input data 2 worker C vertexes f, g, h CS 347 Lecture 9 B 58

Architecture worker A vertexes a, b, c input data 1 master worker B vertexes d, e checkpoint input data 2 worker C vertexes f, g, h CS 347 Lecture 9 B 59

Architecture worker A vertexes a, b, c master worker B vertexes d, e checkpoint worker C write to stable store: My. State, Out. Edges, Input. Messages (or Output. Messages) CS 347 vertexes f, g, h Lecture 9 B 60

Architecture worker A vertexes a, b, c master worker B vertexes d, e if worker dies, find replacement & restart from latest checkpoint CS 347 Lecture 9 B 61

Architecture worker A vertexes a, b, c input data 1 master worker B vertexes d, e input data 2 worker C vertexes f, g, h CS 347 Lecture 9 B 62

Interesting Challenge • How best to partition graph for efficiency? a d g x b e f CS 347 Lecture 9 B 63

CS 347: Parallel and Distributed Data Management Notes X: Big. Table, HBASE, Cassandra Hector Garcia-Molina CS 347 Lecture 9 B 64

Sources • HBASE: The Definitive Guide, Lars George, O’Reilly Publishers, 2011. • Cassandra: The Definitive Guide, Eben Hewitt, O’Reilly Publishers, 2011. • Big. Table: A Distributed Storage System for Structured Data, F. Chang et al, ACM Transactions on Computer Systems, Vol. 26, No. 2, June 2008. CS 347 Lecture 9 B 65

Lots of Buzz Words! • “Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database that bases its distribution design on Amazon’s dynamo and its data model on Google’s Big Table. ” • Clearly, it is buzz-word compliant!! CS 347 Lecture 9 B 66

Basic Idea: Key-Value Store Table T: CS 347 Lecture 9 B 67

Basic Idea: Key-Value Store Table T: keys are sorted • API: – lookup(key) value – lookup(key range) values – get. Next value – insert(key, value) – delete(key) • Each row has timestemp • Single row actions atomic (but not persistent in some systems? ) • No multi-key transactions • No query language! CS 347 Lecture 9 B 68

Fragmentation (Sharding) server 1 server 2 server 3 tablet • use a partition vector • “auto-sharding”: vector selected automatically CS 347 Lecture 9 B 69

Tablet Replication server 3 server 4 primary backup server 5 backup • Cassandra: Replication Factor (# copies) R/W Rule: One, Quorum, All Policy (e. g. , Rack Unaware, Rack Aware, . . . ) Read all copies (return fastest reply, do repairs if necessary) • HBase: Does not manage replication, relies on HDFS CS 347 Lecture 9 B 70

Need a “directory” • Table Name: Key Server that stores key Backup servers • Can be implemented as a special table. CS 347 Lecture 9 B 71

Tablet Internals memory disk Design Philosophy (? ): Primary scenario is where all data is in memory. Disk storage added as an afterthought CS 347 Lecture 9 B 72

Tablet Internals tombstone memory flush periodically disk • • tablet is merge of all segments (files) disk segments imutable writes efficient; reads only efficient when all data in memory periodically reorganize into single segment CS 347 Lecture 9 B 73

Column Family CS 347 Lecture 9 B 74

Column Family • for storage, treat each row as a single “super value” • API provides access to sub-values (use family: qualifier to refer to sub-values e. g. , price: euros, price: dollars ) • Cassandra allows “super-column”: two level nesting of columns (e. g. , Column A can have sub-columns X & Y ) CS 347 Lecture 9 B 75

Vertical Partitions can be manually implemented as CS 347 Lecture 9 B 76

Vertical Partitions column family • • CS 347 good for sparse data; good for column scans not so good for tuple reads are atomic updates to row still supported? API supports actions on full table; mapped to actions on column tables API supports column “project” To decide on vertical partition, need to know access patterns Lecture 9 B 77

Failure Recovery (Big. Table, HBase) ping memory tablet server master node spare tablet server write ahead logging log GFS or HFS CS 347 Lecture 9 B 78

Failure recovery (Cassandra) • No master node, all nodes in “cluster” equal server 1 CS 347 server 2 Lecture 9 B server 3 79

Failure recovery (Cassandra) • No master node, all nodes in “cluster” equal access any table in cluster at any server 1 server 2 server 3 that server sends requests to other servers CS 347 Lecture 9 B 80

CS 347: Parallel and Distributed Data Management Notes X: Mem. Cache. D Hector Garcia-Molina CS 347 Lecture 9 B 81

Mem. Cache. D • General-purpose distributed memory caching system • Open source CS 347 Lecture 9 B 82

What Mem. Cache. D Should Be (but ain't) get_object(X) distributed cache CS 347 cache 1 cache 2 cache 3 data source 1 data source 2 data source 3 Lecture 9 B 83

What Mem. Cache. D Should Be (but ain't) get_object(X) distributed cache 1 x data source 1 CS 347 cache 2 cache 3 data source 2 data source 3 x Lecture 9 B 84

What Mem. Cache. D Is (as far as I can tell) put(cache 1, my. Name, X) get_object(cache 1, My. Name) CS 347 cache 1 cache 2 cache 3 data source 1 data source 2 data source 3 x Lecture 9 B 85

What Mem. Cache. D Is put(cache 1, my. Name, X) each cache is hash table of (name, value) pairs get_object(cache 1, My. Name) Can purge My. Name whenever cache 1 cache 2 cache 3 no connection data source 1 CS 347 data source 2 Lecture 9 B data source 3 x 86

CS 347: Parallel and Distributed Data Management Notes X: Zoo. Keeper Hector Garcia-Molina CS 347 Lecture 9 B 87

Zoo. Keeper • Coordination service for distributed processes • Provides clients with high throughput, high availability, memory only file system client / client a Zoo. Keeper c client b d e client znode /a/d/e (has state) CS 347 Lecture 9 B 88

Zoo. Keeper Servers client server state replica client server CS 347 Lecture 9 B state replica 89

Zoo. Keeper Servers client read server state replica client server CS 347 Lecture 9 B state replica 90

Zoo. Keeper Servers writes totally ordered: used Zab algorithm client write server propagate & sych client state replica server state replica client server CS 347 Lecture 9 B state replica 91

Failures if your server dies, just connect to a different one! client server state replica client server CS 347 Lecture 9 B state replica 92

Zoo. Keeper Notes • Differences with file system: – all nodes can store data – storage size limited • API: insert node, read children, delete node, . . . • Can set triggers on nodes • Clients and servers must know all servers • Zoo. Keeper works as long as a majority of servers are available • Writes totally ordered; read ordered w. r. t. writes CS 347 Lecture 9 B 93

CS 347: Parallel and Distributed Data Management Notes X: Kestrel Hector Garcia-Molina CS 347 Lecture 9 B 94

Kestrel • Kestrel server handles a set of reliable, ordered message queues • A server does not communicate with other servers (advertised as good for scalability!) CS 347 client server queues Lecture 9 B 95

Kestrel • Kestrel server handles a set of reliable, ordered message queues • A server does not communicate with other servers (advertised as good for scalability!) log client put q client server queues get q CS 347 log Lecture 9 B 96