Outline Distributed DBMS Introduction Background Distributed DBMS Architecture
Outline Distributed DBMS Introduction Background Distributed DBMS Architecture Distributed Database Design Distributed Query Processing à Query Processing Methodology à Distributed Query Optimization Transaction Management Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 1
Useful References Textbook Principles of Distributed Database Systems , Chapter 6 Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 2
Query Processing high level user query processor low level data manipulation commands Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 3
Query Processing Components Query language that is used à SQL: “intergalactic dataspeak” Query execution methodology à The steps that one goes through in executing highlevel (declarative) user queries. Query optimization à How do we determine the “best” execution plan? Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 4
Selecting Alternatives SELECT FROM WHERE AND ENAME EMP, ASG EMP. ENO = ASG. ENO DUR > 37 Project Select Join Strategy 1 ENAME( DUR>37 EMP. ENO=ASG. ENO (EMP ASG)) Strategy 2 ENAME(EMP ENO ( DUR>37 (ASG))) Strategy 2 avoids Cartesian product, so is “better” Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 5
What is the Problem? Site 1 Site 2 ASG 1= ENO≤“E 3”(ASG) Site 3 ASG 2= ENO>“E 3”(ASG) EMP 1= ENO≤“E 3”(EMP) result = EMP 1’ EMP 2’ EMP 1’ Site 4 EMP 1’=EMP 1 Site 1 ’ ENOASG 1’= DUR>37(ASG 1) Distributed DBMS Site 5 EMP 2= ENO>“E 3”(EMP) Result Site 5 Site 3 Site 4 result 2=(EMP 1 EMP 2) ENO DUR>37(ASG 1) EMP 2’=EMP 2 Site 2 ’ ENOASG 2 ASG 1 ASG 2 EMP 1 EMP 2 Site 1 Site 2 Site 3 Site 4 ASG 2’= DUR>37(ASG 2) © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 6
Cost of Alternatives Assume: à size(EMP) = 400, size(ASG) = 1000 à tuple access cost = 1 unit; tuple transfer cost = 10 units Strategy 1 produce ASG': (10+10) tuple access cost transfer ASG' to the sites of EMP: (10+10) tuple transfer cost produce EMP': (10+10) tuple access cost 2 transfer EMP' to result site: (10+10) tuple transfer cost 20 200 40 200 Total cost 460 Strategy 2 transfer EMP to site 5: 400 tuple transfer cost transfer ASG to site 5 : 1000 tuple transfer cost produce ASG': 1000 tuple access cost join EMP and ASG': 400 20 tuple access cost 4, 000 10, 000 1, 000 8, 000 Total cost 23, 000 Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 7
Query Optimization Objectives Minimize a cost function I/O cost + CPU cost + communication cost These might have different weights in different distributed environments Wide area networks à communication cost will dominate (80 – 200 ms) low bandwidth low speed high protocol overhead à most algorithms ignore all other cost components Local area networks à communication cost not that dominant (1 – 5 ms) à total cost function should be considered Can also maximize throughput Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 8
Complexity of Relational Operations Operation Assume à relations of cardinality n à sequential scan Complexity Select Project (without duplicate elimination) O(n) Project (with duplicate elimination) Group O(nlog n) Join Semi-join O(nlog n) Division Set Operators Cartesian Product Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez O(n 2) Page 7 -9. 9
Query Optimization Issues – Types of Optimizers Exhaustive search à cost-based à optimal à combinatorial complexity in the number of relations Heuristics à not optimal à regroup common sub-expressions à perform selection, projection first à replace a join by a series of semijoins à reorder operations to reduce intermediate relation size à optimize individual operations Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 10
Query Optimization Issues – Optimization Granularity Single query at a time à cannot use common intermediate results Multiple queries at a time à efficient if many similar queries à decision space is much larger Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 11
Query Optimization Issues – Optimization Timing Static à compilation optimize prior to the execution à difficult to estimate the size of the intermediate results error propagation à can amortize over many executions à R* Dynamic à à run time optimization exact information on the intermediate relation sizes have to reoptimize for multiple executions Distributed INGRES Hybrid à compile using a static algorithm à if the error in estimate sizes > threshold, reoptimize at run time à MERMAID Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 12
Query Optimization Issues – Statistics Relation à cardinality à size of a tuple à fraction of tuples participating in a join with another relation Attribute à cardinality of domain à actual number of distinct values Common assumptions à independence between different attribute values à uniform distribution of attribute values within their domain Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 13
Query Optimization Issues – Decision Sites Centralized à single site determines the “best” schedule à simple à need knowledge about the entire distributed database Distributed à cooperation among sites to determine the schedule à need only local information à cost of cooperation Hybrid à one site determines the global schedule à each site optimizes the local subqueries Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 14
Query Optimization Issues – Network Topology Wide area networks (WAN) – point-to-point à characteristics low bandwidth low speed high protocol overhead à communication cost will dominate; ignore all other cost factors à global schedule to minimize communication cost à local schedules according to centralized query optimization Local area networks (LAN) à à Distributed DBMS communication cost not that dominant total cost function should be considered broadcasting can be exploited (joins) special algorithms exist for star networks © 1998 M. Tamer Özsu & Patrick Valduriez Page 7 -9. 15
- Slides: 15