1 Outline Introduction Background Distributed Database Design Database

1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control Distributed Query Processing Overview Query decomposition and localization Distributed query optimization Multidatabase Query Processing Distributed Transaction Management Data Replication Parallel Database Systems Distributed Object DBMS Peer-to-Peer Data Management Web Data Management Current Issues

Query Processing in a DDBMS high level user query processor Low-level data manipulation commands for D-DBMS 2

3 Query Processing Components Query language that is used Query execution methodology The steps that one goes through in executing high-level (declarative) user queries. Query optimization SQL: “intergalactic dataspeak” How do we determine the “best” execution plan? We assume a homogeneous D-DBMS

Selecting Alternatives SELECT ENAME FROM EMP, ASG WHERE EMP. ENO = ASG. ENO AND RESP = "Manager" Strategy 1 ENAME ( RESP=“Manager” EMP. ENO=ASG. ENO (EMP×ASG)) Strategy 2 ENAME (EMP ⋈ENO ( RESP=“Manager” (ASG)) Strategy 2 avoids Cartesian product, so may be “better”. 4

5 What is the Problem? - Selecting the “best” strategy Site 1 Site 2 ASG 1=σENO≤“E 3”(ASG) Site 3 Site 4 ASG 2= σENO>“E 3”(ASG) EMP = σ 1 ENO≤“E 3”(EMP) Site 5 EMP 2= σENO>“E 3”(EMP) Site 5 Result Site 5 result= (EMP 1 ⋃ EMP 2)⋈ENOσRESP=“Manager”(ASG 1 ⋃ ASG 2) ASG 1 Site 1 ASG 2 Site 2 EMP 1 Site 3 EMP 2 Site 4 Site 3 Site 4 EMP’ 1=EMP 1 ⋈ENO ASG’ 1 EMP’ 2=EMP 2 ⋈ENO ASG’ 2 Strategy X Site 1 Site 2 Strategy Y

6 Cost of Alternative Strategies Assume size(EMP) = 400, size(ASG) = 1000 tuple access cost = 1 unit; tuple transfer cost = 10 units Strategy X transfer EMP to site 5: 400 tuple transfer cost transfer ASG to site 5: 1000 tuple transfer cost produce ASG': 1000 tuple access cost join EMP and ASG': 400 20 tuple access cost Total Cost 4, 000 10, 000 1, 000 8, 000 23, 000 Strategy Y produce ASG': (10+10) tuple access cost transfer ASG' to the sites of EMP: (10+10) tuple transfer cost produce EMP': (10+10) tuple access cost 2 transfer EMP' to result site: (10+10) tuple transfer cost Total Cost 20 200 460

7 Query Optimization Objectives Minimize a cost function I/O cost + CPU cost + communication cost NOTE: These might have different weights in different distributed environments. Wide area networks communication cost may dominate or vary much bandwidth speed high protocol overhead Local area networks communication cost not that dominant total cost function should be considered Can also maximize throughput

Complexity of Relational Operations Operation Assume relations of cardinality n sequential scan Select Project (without duplicate elimination) Project (with duplicate elimination) Group 8 Complexity O(n) O(n log n) Join Semi-join O(n log n) Division Set Operators Cartesian Product O(n 2)

Query Optimization Issues – Types 9 Of Optimizers Exhaustive search Cost-based Optimal Combinatorial complexity in the number of relations Heuristic search Try to restrict the search space in order to find a very good solution (although may not be optimal) Regroup common sub-expressions Perform selection, projection first Replace a join by a series of semijoins Reorder operations to reduce intermediate relation size Optimize individual operations

Query Optimization Issues – Optimization Granularity Single query at a time Cannot use common intermediate results Multiple queries at a time Efficient if many similar queries Decision space is much larger 10

Query Optimization Issues – 11 Optimization Timing Static Compilation-time optimization optimize prior to the execution + The size of the intermediate results are estimated (using database statistics). Errors in the estimates may lead to suboptimal strategies. - The optimization cost can amortize over multiple executions. Dynamic Run-time optimization + Exact information on the intermediate relation sizes Estimation errors are minimized. - Have to re-optimize for multiple executions of the same query Hybrid Compile using a static algorithm If the error in estimate sizes > threshold, re-optimize at run time.

Query Optimization Issues – Statistics on the database Relation Cardinality of relations (or fragments of relations) Size of a tuple Fraction of tuples participating in a join with another relation Attribute Cardinality of domain Actual number of distinct values Common assumptions Independence between different attribute values Uniform distribution of attribute values within their domain Periodic updates of statistics to maintain accuracy 12

Query Optimization Issues – Decision Sites Centralized Single site determines the “best” schedule Simple Need knowledge about the entire distributed database Distributed Cooperation among sites to determine the schedule Need only local information Cost of cooperation Hybrid One site determines the global schedule Each site optimizes the local subqueries 13

Query Optimization Issues – 14 Network Topology Wide area networks (WAN) – point-to-point Characteristics: Low bandwidth, Low speed, High protocol overhead Communication cost will dominate; ignore all other cost factors Global schedule to minimize communication cost Local schedules according to centralized query optimization Local area networks (LAN) Communication cost not that dominant Total cost function should be considered LAN broadcasting can be exploited (to optimize processing of joins) In client-server systems, query optimization involves data shipping and/or query shipping.

Query Processing in Client/Server Systems query shipping The client sends a query to the server. The server processes the query and sends the result back to the client. + Only the query and the result are sent over the network. - The client resources are not exploited. - High workload on the server data shipping The server sends the data to the clients, and coordinates with the clients in processing the query. The clients are doing the actual data processing. + Client resources are utilized in query processing. - More complex, coordinated query processing

Query Optimization Issues – 16 Replicated Fragments A distributed relation is usually divided into relation fragments. Localization: Distributed queries expressed on global relations are mapped into queries on physical fragments of relations by translating relations into fragments. The fragments are replicated at different sites. Why? Implications to query optimization?

Query Optimization Issues – 17 Semijoins Using joins or semijoins? A semijoin is particularly useful for improving the processing of distributed join operators as it reduces the size of data exchanged between sites. Semijoins help to reduce the join operands. Trade-offs?

18 Distributed Query Processing Methodology Calculus Query on Distributed Relations Query Decomposition GLOBAL SCHEMA Algebraic Query on Distributed Relations CONTROL SITE Data Localization FRAGMENT SCHEMA Fragment Query Global Optimization STATS ON FRAGMENTS Optimized Fragment Query with Communication Operations LOCAL SITES Local Optimization Optimized Local Queries LOCAL SCHEMAS