Temple University CIS Dept CIS 331 Principles of

  • Slides: 60
Download presentation
Temple University – CIS Dept. CIS 331– Principles of Database Systems V. Megalooikonomou Distributed

Temple University – CIS Dept. CIS 331– Principles of Database Systems V. Megalooikonomou Distributed Databases (based on notes by Silberchatz, Korth, and Sudarshan and notes by C. Faloutsos at CMU)

General Overview - rel. model n n n Relational model - SQL Functional Dependencies

General Overview - rel. model n n n Relational model - SQL Functional Dependencies & Normalization Physical Design; Indexing Query optimization Transaction processing Database System Architecture n Distributed Databases

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery,

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc. control)

Problem – definition n centralized DB: LA CHICAGO NY

Problem – definition n centralized DB: LA CHICAGO NY

Problem – definition n distributed DB: n n DB stored in many places (cites).

Problem – definition n distributed DB: n n DB stored in many places (cites). . . connected LA NY

Problem – definition LA n distributed DB: n n n stored in many connected

Problem – definition LA n distributed DB: n n n stored in many connected cites DB typically geographically separated separately administered transactions are differentiated: n Local n Global possibly different DBMSs, DB schemas (heterogeneous) NY

Problem – definition now: connect to LA; exec sql select * from EMP; .

Problem – definition now: connect to LA; exec sql select * from EMP; . . . connect to NY; exec sql select * from EMPLOYEE; . . . NY LA EMP DBMS 1 DBMS 2 EMPLOYEE

Problem – definition ideally: connect to distr-LA; exec sql select * from EMPL; LA

Problem – definition ideally: connect to distr-LA; exec sql select * from EMPL; LA NY D-DBMS EMP DBMS 1 DBMS 2 EMPLOYEE

Pros + Cons ?

Pros + Cons ?

Pros + Cons n Pros n n n data sharing reliability & availability autonomy

Pros + Cons n Pros n n n data sharing reliability & availability autonomy (local) speed up of query processing Cons n n n software development cost more bugs increased processing overhead (msg)

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery,

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc. control)

Design of Distr. DBMS n Homogeneous distr. DBs n n n Identical DBMS Same

Design of Distr. DBMS n Homogeneous distr. DBs n n n Identical DBMS Same DB Schema Aware of one another Agree to cooperate in transaction processing Heterogeneous distr. DBs n n Different DBMS Different DB Schema May not be aware of one another May provide limited facilities for cooperation in transaction processing

Design of Distr. DBMS what are our choices of storing a table?

Design of Distr. DBMS what are our choices of storing a table?

Design of Distr. DBMS n n n replication (several copies of a table at

Design of Distr. DBMS n n n replication (several copies of a table at different sites) fragmentation (horizontal; vertical; hybrid) or both…

Design of Distr. DBMS n n Replication: a copy of a relation is stored

Design of Distr. DBMS n n Replication: a copy of a relation is stored in two or more sites Pros and cons n n n Availability Increased parallelism (possible minimization of movement of data among sites) Increased overhead on update (replicas should be consistent)

Design of Distr. DBMS vertical fragm. Fragmentation: ssn 123. . . 234 name smith.

Design of Distr. DBMS vertical fragm. Fragmentation: ssn 123. . . 234 name smith. . . johnson address wall str. . sunset blvd horiz. fragm. • keep tuples/attributes at the sites where they are used the most • ensure that the table can be reconstructed

Transparency & autonomy Issues/goals: n naming and local autonomy n replication transparency n fragmentation

Transparency & autonomy Issues/goals: n naming and local autonomy n replication transparency n fragmentation transparency n location transparency i. e. :

Problem – definition ideally: connect to distr-LA; exec sql select * from EMPL; LA

Problem – definition ideally: connect to distr-LA; exec sql select * from EMPL; LA NY D-DBMS EMP DBMS 1 DBMS 2 EMPLOYEE

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery,

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc. control)

Distributed Query processing n issues (additional to centralized q-opt) n cost of transmission (cpu,

Distributed Query processing n issues (additional to centralized q-opt) n cost of transmission (cpu, disk, #bytes-transmitted, #messages-trasmitted) n parallelism / overlap of delays minimize elapsed time? or minimize resource consumption?

Distributed Query processing S 2 SUPPLIER S 1 SHIPMENT s# p# s 1 p

Distributed Query processing S 2 SUPPLIER S 1 SHIPMENT s# p# s 1 p 1 s 2 p 1 s 5 s 3 p 5 s 11 s 2 p 9 s# . . . S 3 SUPPLIER Join SHIPMENT = ?

semijoins n n n choice of plans? plan #1: ship SHIP -> S 1;

semijoins n n n choice of plans? plan #1: ship SHIP -> S 1; join; ship -> S 3 plan #2: ship SHIP->S 3; ship SUP->S 3; join. . . others?

Distr. Q-opt – semijoins S 2 SUPPLIER S 1 SHIPMENT s# p# s 1

Distr. Q-opt – semijoins S 2 SUPPLIER S 1 SHIPMENT s# p# s 1 p 1 s 2 p 1 s 5 s 3 p 5 s 11 s 2 p 9 s# . . . S 3 SUPPLIER Join SHIPMENT = ?

Semijoins SUPPLIER S 1 SHIPMENT s# p# s 1 p 1 s 2 p

Semijoins SUPPLIER S 1 SHIPMENT s# p# s 1 p 1 s 2 p 1 s 5 s 3 p 5 s 11 s 2 p 9 s# . . . S 3 SUPPLIER Join SHIPMENT = ? n Idea: reduce the tables before shipping

Semijoins n n How to do the reduction, cheaply? E. g. , reduce ‘SHIPMENT’:

Semijoins n n How to do the reduction, cheaply? E. g. , reduce ‘SHIPMENT’:

Semijoins SUPPLIER S 1 SHIPMENT (s 1, s 2, s 5, s 11) s#

Semijoins SUPPLIER S 1 SHIPMENT (s 1, s 2, s 5, s 11) s# p# s 1 p 1 s 2 p 1 s 5 s 3 p 5 s 11 s 2 p 9 s# . . . S 3 SUPPLIER Join SHIPMENT = ? n Idea: reduce the tables before shipping

Semijoins n n n Formally: SHIPMENT’ = SHIPMENT SUPPLIER express semijoin w/ rel. algebra

Semijoins n n n Formally: SHIPMENT’ = SHIPMENT SUPPLIER express semijoin w/ rel. algebra

Semijoins n n n Formally: SHIPMENT’ = SHIPMENT SUPPLIER express semijoin w/ rel. algebra

Semijoins n n n Formally: SHIPMENT’ = SHIPMENT SUPPLIER express semijoin w/ rel. algebra

Semijoins – e. g. : n n suppose each attr. is 4 bytes Q:

Semijoins – e. g. : n n suppose each attr. is 4 bytes Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER

Semijoins SUPPLIER S 1 SHIPMENT (s 1, s 2, s 5, s 11) s#

Semijoins SUPPLIER S 1 SHIPMENT (s 1, s 2, s 5, s 11) s# p# s 1 p 1 s 2 p 1 s 5 s 3 p 5 s 11 s 2 p 9 s# . . . 4 bytes S 3 SUPPLIER Join SHIPMENT = ? n Idea: reduce the tables before shipping

Semijoins – e. g. : n n suppose each attr. is 4 bytes Q:

Semijoins – e. g. : n n suppose each attr. is 4 bytes Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER n A: 4*4 bytes

Semijoins – e. g. : n n n suppose each attr. is 4 bytes

Semijoins – e. g. : n n n suppose each attr. is 4 bytes Q 1: give a plan, with semijoin(s) Q 2: estimate its cost (#bytes shipped)

Semijoins – e. g. : n A 1: n n n reduce SHIPMENT to

Semijoins – e. g. : n A 1: n n n reduce SHIPMENT to SHIPMENT’ -> S 3 SUPPLIER -> S 3 do join @ S 3 Q 2: cost?

Semijoins SUPPLIER S 1 SHIPMENT (s 1, s 2, s 5, s 11) s#

Semijoins SUPPLIER S 1 SHIPMENT (s 1, s 2, s 5, s 11) s# p# s 1 p 1 s 2 p 1 s 5 s 3 p 5 s 11 s 2 p 9 s# . . . 4 bytes S 3 4 bytes

Semijoins – e. g. : n A 2: n n 72 4*4 3*8 4*8

Semijoins – e. g. : n A 2: n n 72 4*4 3*8 4*8 0 bytes - reduce SHIPMENT to SHIPMENT’ bytes - SHIPMENT’ -> S 3 bytes - SUPPLIER -> S 3 bytes - do join @ S 3 bytes TOTAL

Other plans?

Other plans?

Other plans? P 2: n reduce SHIPMENT to SHIPMENT’ n reduce SUPPLIER to SUPPLIER’

Other plans? P 2: n reduce SHIPMENT to SHIPMENT’ n reduce SUPPLIER to SUPPLIER’ n SHIPMENT’ -> S 3 n SUPPLIER’ -> S 3

Other plans? P 3: n reduce SUPPLIER to SUPPLIER’ n SUPPLIER’ -> S 2

Other plans? P 3: n reduce SUPPLIER to SUPPLIER’ n SUPPLIER’ -> S 2 n do join @ S 2 n ship results -> S 3

A brilliant idea: two-way semijoins n n n (not in book, not in final

A brilliant idea: two-way semijoins n n n (not in book, not in final exam) reduce both relations with one more exchange: [Kang, ’ 86] ship back the list of keys that didn’t match CAN NOT LOSE! (why? ) further improvement: n or the list of ones that matched – whatever is shorter!

Two-way Semijoins SUPPLIER S 1 S 2 SHIPMENT (s 1, s 2, s 5,

Two-way Semijoins SUPPLIER S 1 S 2 SHIPMENT (s 1, s 2, s 5, s 11) s# p# s 1 p 1 s 2 p 1 s 3 p 5 s 2 p 9 s# . . . (s 5, s 11) s 5 s 11 S 3

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery,

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc. control)

Transactions – recovery n Problem: e. g. , a transaction moves $100 from NY

Transactions – recovery n Problem: e. g. , a transaction moves $100 from NY $50 to LA, $50 to Chicago n n n 3 sub-transactions, on 3 systems how to guarantee atomicity (all-ornone)? Observation: additional types of failures (links, servers, delays, time-outs. . )

Transactions – recovery n Problem: e. g. , a transaction moves $100 from NY

Transactions – recovery n Problem: e. g. , a transaction moves $100 from NY -> $50 to LA, $50 to Chicago

Distributed recovery CHICAGO T 1, 2: +$50 How? NY LA NY T 1, 3:

Distributed recovery CHICAGO T 1, 2: +$50 How? NY LA NY T 1, 3: +$50 T 1, 1: -$100

Distributed recovery Step 1: choose coordinator CHICAGO T 1, 2: +$50 NY LA NY

Distributed recovery Step 1: choose coordinator CHICAGO T 1, 2: +$50 NY LA NY T 1, 3: +$50 T 1, 1: -$100

Distributed recovery Step 2: execute a commit protocol, e. g. , “ 2 phase

Distributed recovery Step 2: execute a commit protocol, e. g. , “ 2 phase commit” when a transaction T completes execution (i. e. , when all sites at which T has executed inform the transaction coordinator Ci that T has completed) Ci starts the 2 PC protocol -> n

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit time T 1, 3

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit Y Y time T 1, 3

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit Y Y commit time T 1, 3

2 phase commit (e. g. , T 1, 3 failure) T 1, 1 (coord.

2 phase commit (e. g. , T 1, 3 failure) T 1, 1 (coord. ) T 1, 2 prepare to commit time

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit Y N time T 1, 3

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit

2 phase commit T 1, 1 (coord. ) T 1, 2 prepare to commit Y N abort time T 1, 3

Distributed recovery n n Many, many additional details (what if the coordinator fails? what

Distributed recovery n n Many, many additional details (what if the coordinator fails? what if a link fails? etc) and many other solutions (e. g. , 3 -phase commit)

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery,

Overview n n Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc. control)

Distributed conc. control n n also more complicated: distributed deadlocks!

Distributed conc. control n n also more complicated: distributed deadlocks!

Distributed deadlocks T 2, la T 1, la T 2, ny CHICAGO LA NY

Distributed deadlocks T 2, la T 1, la T 2, ny CHICAGO LA NY NY T 1, ny

Distributed deadlocks T 2, la T 2, ny T 1, la T 1, ny

Distributed deadlocks T 2, la T 2, ny T 1, la T 1, ny LA NY

Distributed deadlocks T 2, la T 2, ny T 1, la T 1, ny

Distributed deadlocks T 2, la T 2, ny T 1, la T 1, ny LA NY

LA Distributed deadlocks T 2, la T 1, la • cites need to exchange

LA Distributed deadlocks T 2, la T 1, la • cites need to exchange wait-for graphs • clever algorithms, to reduce # messages NY T 2, ny T 1, ny

Conclusions n n Distr. DBMSs: not deployed BUT: produced clever ideas: n n n

Conclusions n n Distr. DBMSs: not deployed BUT: produced clever ideas: n n n semijoins distributed recovery / conc. control which can be useful for n n n parallel db / clusters ‘active disks’ replicated db (e-commerce servers)