CS 347 Parallel and Distributed Data Management Notes
CS 347: Parallel and Distributed Data Management Notes 01: Introduction Hector Garcia-Molina CS 347 Lecture 1 1
In CS 245: Centralized DB system Software: P M Application SQL Front End Query Processor Transaction Proc. File Access . . . • Simplifications: • single front end • one place to keep locks • if processor fails, system fails, . . . CS 347 Lecture 1 2
In CS 347 • Multiple processors ( + memories) • Heterogeneity and autonomy of “components” CS 347 Lecture 1 3
Multiple processors • Opportunity for parallelism • Opportunity for reliability • Synchronization issues To illustrate synchronization problems: Two Generals Problem CS 347 Lecture 1 4
The one general problem (Trivial!) G Troops CS 347 Battlefield Lecture 1 5
The two general problem: Blue army Blue G CS 347 Enemy Red army <----------------> messengers Lecture 1 Red G 6
Rules: • Blue and red army must attack at same time • Blue and red generals synchronize through messengers • Messengers can be lost CS 347 Lecture 1 7
How Many Messages Do We Need? assume blue starts. . . BG RG attack at 9 am Is this enough? ? CS 347 Lecture 1 8
How Many Messages Do We Need? assume blue starts. . . BG RG attack at 9 am ack (red goes at 9 am) Is this enough? ? CS 347 Lecture 1 9
How Many Messages Do We Need? assume blue starts. . . BG RG attack at 9 am ack (red goes at 9 am) got ack Is this enough? ? CS 347 Lecture 1 10
Stated problem is Impossible! • Theorem: There is no protocol that uses a finite number of messages that solves the two -generals problem (as stated here) Alternatives? ? CS 347 Lecture 1 11
Probabilistic Approach? • Send as many messages as possible, hope one gets through. . . assume blue starts. . . BG RG attack at 9 am CS 347 Lecture 1 12
Eventual Commit • Eventually both sides attack. . . assume blue starts. . . BG RG attack ASAP retransmits on my way! CS 347 Lecture 1 13
Eventual Commit • One message sent every time unit • Probability of success one message is p • What is probability that red commits by time t? BG attack ASAP retransmits RG on my way! CS 347 Lecture 1 14
Eventual Commit BG attack ASAP retransmits RG on my way! • C(1) = p CS 347 Lecture 1 15
Eventual Commit BG attack ASAP retransmits RG on my way! • C(1) = p • C(2) = p + (1 -p)p CS 347 Lecture 1 16
Eventual Commit BG attack ASAP retransmits RG on my way! • • C(1) C(2) C(3) C(4) CS 347 = = p p + (1 -p)p + (1 -p)2 p + (1 -p)3 p Lecture 1 17
Eventual Commit p C(t) t CS 347 Lecture 1 18
Eventual Commit BG attack ASAP retransmits RG on my way! • How expensive is protocol? • E = expected number of messages • Homework: compute E (function of p) CS 347 Lecture 1 19
2 -Phase Eventual Commit • Eventually both sides attack. . . assume blue starts. . . BG RG ready to attack? retransmits phase 1 yes, at your disposal attack ASAP retransmits phase 2 ack CS 347 Lecture 1 20
Commit Protocols • Will study commit protocols like these. . . CS 347 Lecture 1 21
Heterogeneity Select new investments Application Stock ticker tape RDBMS Portfolio CS 347 Lecture 1 Files History of dividends, ratios, . . . 22
Autonomy Example: unable to get statistics for query optimization Example: blue general may have mind of his (or her) own! CS 347 Lecture 1 23
• So, in CS 347 we study data management with multiple processors and possible autonomy, heterogeneity – Impact on: • Data organization • Query processing • Access structures • Concurrency control • Recovery CS 347 Lecture 1 24
• Renewed Interest in Distributed/Parallel Data Processing! – Massive web data, manage with many computers – How to crawl and search the web? – Peer-to-peer systems manage huge amounts of data – Data from many sources (e. g. , comparison shopping): how to integrate? – Sensor Networks: data generated an many sensors/devices, need to analyze – Multi-player games (e. g. , Second Life): tons of distributed data CS 347 Lecture 1 25
Data It’s the Economy, Stupid! • Example: Multi-player games P P CS 347 P state P P Lecture 1 P P P 26
Data It’s the Economy, Stupid! • Example: Multi-player games P P P CS 347 P state P Lecture 1 P P P 27
Logistics • LECTURES: Mondays and Wednesdays 12: 50 pm to 2: 05 pm, Gates B 01 • INSTRUCTOR: Hector Garcia-Molina; Office: Gates Hall 434 Email: hector@cs. stanford. edu; Office Hours: Mondays, Wednesdays 11 am to 12 noon. • TEACHING ASSISTANT (tentative): – Lin Huang, Email: linhuang@cs. stanford. edu – Vasilis Verroios, Email: verroios@stanford. edu • Piazza forum (tentative): https: //piazza. com/class#spring 2014/cs 347. • SECRETARY: Marianne Siroker; Office: Gates Hall 435; Email: siroker@cs. stanford. edu; Phone: (650) 723 -0872 CS 347 Lecture 1 28
Logistics • TEXTBOOK: No required textbook. You'll be expected to read several research papers. • CLASS WEB PAGE: http: //www. stanford. edu/class/cs 347 Will contain homework assignments, course news, etc. Be sure to check it periodically. • ASSIGNMENTS: about 5 homeworks • GRADING: Homeworks: 20%, Midterm 30%, Final: 50%. CS 347 Lecture 1 29
Tentative Syllabus 2014 (Part I) • • • DATE Monday March 31 Wednesday April 2 Monday April 7 Wednesday April 9 Monday April 14 Wednesday April 16 Monday April 21 Wednesday April 23 Monday April 28 Wednesday April 30 CS 347 TOPIC Introduction [N 01] Data Fragmentation [N 02] Query processing [N 03] Query processing & Optimization [N 04] Concurrency Control, Failures [N 05] Reliable Data Management [N 06] Replicated Data Management [N 07] Partitions, Entity Resolution [N 11] Midterm Lecture 1 30
Tentative Syllabus 2014 (Part II) • • • DATE Monday May 5 Wednesday May 7 Monday May 12 Wednesday May 14 Monday May 19 Wednesday May 21 Wednesday May 28 Monday June 2 Wednesday June 4 Friday June 6 CS 347 TOPIC Peer to Peer Systems [N 08] Map-Reduce & Pig [N 09] Other Open Source Systems [N 09 b] Distributed IR [N 10] Time [N 12] Publish/Subscribe Systems [N 13] 8: 30 am!!! FINAL EXAM Lecture 1 31
Interesting New Systems • • • Storm (from Twitter) S 4 (from Yahoo) Casandra (key-value store) Hive (SQL over Hadoop) Pregel (graph execution) Kestrel (queues? ) Zoo. Keeprer (replicated data) Sparkl or Spark (Berkeley? ) H-Base Hy. Racks (UC Irvine) CS 347 • • Lecture 1 Mem. Cache-D Pnuts Dynamo (Amazon) Mega-Store (Google) Paxos G-Store (UC Santa Barbara) Elastras (UC Santa Barbara) Tao (Facebook) 32
Concepts you should be familiar with: • CS 245: query plan, cost estimation, join algorithms, recovery, logging, … • Interconnection networks (bus, mesh, hypercube, …) • Computer networks (LAN, WAN, …) CS 347 Lecture 1 33
Introductory topics • • Database architectures Client-server systems Distributed vs. parallel DB systems Cloud Computing CS 347 Lecture 1 34
DB architectures (1) Shared memory P P . . . P M CS 347 Lecture 1 35
DB architectures (2) Shared disk P P P. . . M M M. . . CS 347 Lecture 1 36
DB architectures (2 B) Shared data storage (disk or file? ) P P P. . . M M M • storage area network (SAN) • Hadoop/Google file system CS 347 Lecture 1 . . . 37
DB architectures (3) Shared nothing CS 347 P P M M . . . P M Lecture 1 38
DB architectures (4) Hybrid example P P . . . P M CS 347 Lecture 1 39
DB architectures (4) Hybrid example 2 WAN R R LAN #2 LAN #1 P M CS 347 . . . P . . . M P M Lecture 1 . . . P M 40
DB architectures (4) Hybrid Tandem-like also in: Microsoft SQLServer Parallel Data Warehouse CS 347 P P M M . . . Lecture 1 P P M M 41
DB architectures (5) Unusual? Datacycle (Broadcast disks) Entire DB broadcast CS 347 P P P M M M Lecture 1 42
(5) Unusual Sorting network P Sort net . . . M P M CS 347 Lecture 1 43
(5) Unusual — processor per track or processor per disk P’ P’ P P’ . . . M “small” processors + “tiny” memories Related idea in Oracle Exadata "DB machine" CS 347 Lecture 1 44
(6) Unusual — sensor networks B B P M sensor B battery B CS 347 P M data collection node P’ P M M B P M Lecture 1 45
Issues for selecting architecture • • • Reliability Scalability Geographic distribution of data Data “clusters” Performance Cost CS 347 Lecture 1 46
Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access CS 347 Lecture 1 client server 47
Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access CS 347 Lecture 1 client server 48
Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access CS 347 Lecture 1 client server 49
Transaction Servers • Clients ship transactions consisting of 1 or more SQL commands E. g. , Open Data. Base Connectivity (ODBC) (standard API) CS 347 Lecture 1 50
Data Servers • Client requests pages or records • Popular for OODB systems CS 347 Lecture 1 51
Issues • Object granularity • Where is data cached? • Where is locking done? CS 347 Lecture 1 52
Basic Tradeoff • Offloading work to clients • Data transmitted C C Reserve hotel room Get pages S CS 347 S Lecture 1 53
Note: Reserve hotel room Similar issues arise when we partition software/functionality within server P P M M . . . P M • Where is data cached? • Where is locking done? CS 347 Lecture 1 54
Parallel or distributed DB system? • More similarities than differences! CS 347 Lecture 1 55
• Typically, parallel DBs: – Fast interconnect – Homogeneous software – High performance is goal – Transparency is goal CS 347 Lecture 1 56
• Typically, distributed DBs: – Geographically distributed – Data sharing is goal (may run into heterogeneity, autonomy) – Disconnected operation possible CS 347 Lecture 1 57
Cloud Computing • Is CC just a marketing term? ? – utility (like power) – data or CPU cycles? – many processors, many storage units – business model CS 347 Lecture 1 58
Is CC a subset, superset, disjoint from, or overlaps with: • • • grid computing distributed computing Web 2. 0 Cluster Computing Peer-to-peer computing software as a service client-server computing data center as a computer massively parallel computing CS 347 Lecture 1 CC (A) (B) CC (C) CC (D) CC 59
Clash of the Clouds CS 347 (Economist April 4, 2009) Lecture 1 60
CC Issues • • Customer lock-in Privacy Standards Software licensing CS 347 Lecture 1 61
Next • How to describe distributed data • Query processing in parallel DBs • Query processing in distributed DBs CS 347 Lecture 1 62
Query processing in parallel DBs: • Typically: we can distribute/ partition/ sort…. data to make certain DB operations (e. g. , Join) fast CS 347 Lecture 1 63
Query processing in distributed DBs: • Typically: we are given data distribution; we need to find query processing strategy to minimize cost (e. g. , communication cost) CS 347 Lecture 1 64
- Slides: 64