Distributed Database Systems COP 5711 Outline Introduction Distributed

Distributed Database Systems COP 5711

Outline • Introduction • Distributed Database Architectures • Distributed Database Design • Query Processing

What is a Distributed Database System ? A distributed database is a collection of databases which are distributed over different computers of a computer network. • Each site has autonomous processing capability and can perform local applications. • Each site also participates in the execution of at least one global application which requires accessing data at several sites. Shared-Nothing System Global application No local applications Local application

Replication & Fragmentation Transparency • The user is unaware of the replication of fragments • Queries are specified on the relations (rather than the fragments). Applications see only the relations Data fragments Relation R Applications do not need to know this Copy 1 of R 1 Copy 1 of R 2 Fragment R 1 Fragment R 2 Site A Copy 2 of R 1 Site B Fragment R 3 Fragment R 4 Site C Copy 2 of R 2

Why Distributed Databases ? 1. Local Autonomy: permits setting and enforcing local policies regarding the use of local data (suitable for organization that are inherently decentralized). 2. Improved Performance: The regularly used data is proximate to the users and given the parallelism inherent in distributed systems. 3. Improved Reliability/Availability: q q Data replication can be used to obtain higher reliability and availability. The autonomous processing capability of the different sites ensures a graceful degradation property. 4. Incremental Growth: supports a smooth incremental growth with a minimum degree of impact on the already existing sites. 5. Shareability: allows preexisting sites to share data. 6. Reduced Communication Overhead: The fact that many applications are local clearly reduces the communication overhead with respect to centralized databases.

Disadvantages of DDBSs Cost: replication of effort (manpower). Security: More difficult to control Complexity: • The possible duplication is mainly due to reliability and efficiency considerations. Data redundancy, however, complicates update operations. • If some sites fail while an update is being executed, the system must make sure that the effects will be reflected on the data residing at the failing sites as soon as the system can recover from the failure. • The synchronization of transactions on multiple sites is considerably harder than for a centralized system.

Distributed DBMS Architecture

ANSI/SPARC Architecture External Schema External view Conceptual Schema Conceptual view Internal Schema Internal view External view CREAT VIEW … CREAT TABLE … B+-tree Index … Internal view: deals with the physical definition and organization of data. Conceptual view: abstract definition of the database. It is the “real world” view of the enterprise being modeled in the database. External view: individual user’s view of the database.

Distributed Data Systems A distributed database can be defined as • a logically integrated collection of shared data which is • physically distributed across the nodes of a computer network. Simplify software development Improve system performance Logically integrated Physically distributed

A Taxonomy of Distributed Data Systems No local user Distributed data systems Homogeneous Heterogeneous (Multidatabase) Loosely coupled (interoperable DB systems using export schema) Tightly coupled (/w global schema)

Architecture of a Homogeneous DDBMS Global user view 1 . . . Global user view n Global Schema Fragmentation Schema Allocation Schema Local conceptual schema 1 Local conceptual schema n Local internal schema 1 Local internal schema n Local DB 1 Local DB n A homogeneous DDBMS resembles a centralized DB, but instead of storing all the data at one site, the data is distributed across a number of sites in a network.

Architecture of a Homogeneous DDBMS Global user view 1 Global user view n Global Schema Fragmentation Schema Allocation Schema Local conceptual schema 1 Local conceptual schema n Local internal schema 1 Local internal schema n Local DB 1 Local DB n A homogeneous DDBMS also looks like a shared-nothing parallel DBMS; however, a data fragment is a relation of a local database

Fragmentation Schema & Allocation Schema Fragmentation Schema: describes how the global relations are divided into fragments. Allocation Schema: specifies at which sites each fragment is stored. Applications see only the relations (global schema) Allocation Relation R Fragmentation Copy 1 of R 1 Copy 1 of R 2 Fragment R 1 Fragment R 2 Site A Copy 2 of R 1 Site B Fragment R 3 Fragment R 4 Site C Copy 2 of R 2

Homogeneous vs. Heterogeneous • Homogeneous DDBMS Global users – No local users – Most systems do not have local schemas (i. e. , every user uses the same schema) • Heterogeneous DDBMS Homogeneous Database management system DBMS

Homogeneous vs. Heterogeneous • Homogeneous DDBMS – No local users Global user Local user Multidatabase Management system – Most systems do not have local schemas (i. e. , every user uses the same schema) • Heterogeneous DDBMS Local user – There are both local and global users – Multidatabase systems are split into: DBMS • Tightly Coupled Systems: have a global schema • Loosely Coupled Systems: do not have a global schema.

Schema Architecture of a Tightly. Coupled System Global user view 1 Global user view n Global Conceptual Schema Auxiliary Schema 1 Local user view 2 An individual node’s participation in the MDB is defined by means of a participation schema. Local Participation Schema 1 Local Participation Schema n Auxiliary Schema 1 Local Conceptual Schema n Local user view 1 Local Internal Schema n Local user view 2 Local DB 1 Local DB n

Auxiliary Schema (1) Auxiliary schema describes the rules which govern the mappings between the local and global levels. q Rules for unit conversion: may be required when one site expresses distance in kilometers and another in miles, … q Rules for handling null values: may be necessary where one site stores additional information which is not stored at another site. – Example: One site stores the name, home address and telephone number of its employees, whereas another just stores names and addresses.

Auxiliary Schema (2) q Rules for naming conflicts: naming conflicts occur when: § semantically identical data items are named differently • DNAME Department name (at Site 1) • DEPTNAME Department name (at Site 2) § semantically different data items are named identically. • NAME Department name (at Site 1) • NAME Manager name (at Site 2) q Rules for handling data representation conflicts: Such conflicts occur when semantically identical data items are represented differently in different data source. § Example: Data represented as a character string in one database may be represented as a real number in the other database.

Auxiliary Schema (3) q Rules for handling data scaling conflicts: Such conflicts occur when semantically identical data items stored in different databases using different units of measure. § Example: “Large”, “New”, “Good”, etc. These problems are called domain mismatch problems

Loosely-Coupled Systems (Interoperable Database Systems) No global conceptual schema Local user view 1 Local user view 2 Global user view 1 Global user view 2 Global user view 3 Local Conceptual schema 1 Local Conceptual Schema 2 Local Conceptual Schema n Local internal schema 1 Local internal Schema 2 Local internal Schema n Local DB 1 Local DB 2 Local DB n

Loosely-Coupled Systems No global conceptual schema Interoperable DB system using export schema Local user view 1 Local user view 2 Global user view 1 Export schema 1 Global user view 2 Export schema 2 Export Schema 3 Global user view m Export Schema n Local Conceptual schema 1 Local Conceptual Schema 2 Local Conceptual Schema n Local internal schema 1 Local internal Schema 2 Local internal Schema n Local DB 1 Local DB 2 Local DB n

Loosely-Coupled vs. Tightly Coupled No global conceptual schema

A Book on Distributed DB 2020 edition ISBN-13: 9783030262525 ISBN-10: 3030262529

Distributed Database Design • Top-Down Approach: The database system is being designed from scratch. • Issue: fragmentation & allocation • Bottom-up Approach: Integrating existing databases into one database • Issue: Design of the export and global schemas.

Fragmentation Alternatives J JNO JNAME BUDGET J 1 J 2 J 3 J 4 Instrumental Database Dev. CAD/CAM Maintenance 150, 000 135, 000 250, 000 350, 000 LOC Montreal New York Paris Horizontal Partitioning J 1 JNO J 1 J 2 JNAME Instrumental Database Dev. Vertical Partitioning BUDGET LOC 150, 000 135, 000 Montreal New York JNO JNAME BUDGET LOC J 3 J 4 CAD/CAM Maintenance. 150, 000 310, 000 Montreal Paris JNO J 1 J 2 J 3 J 4 BUDGET 150, 000 135, 000 250, 000 310, 000 JNAME Instrumentation Database Devl CAD/CAM Maintenance LOC Montreal New York Paris Must ensure lossless join

Why fragment at all? Reasons: • Interquery concurrency: Executing multiple queries on different fragments simultaneously • Intraquery concurrency: Executing multiple subqueries of a query on different fragments simultaneously Query 2 Query 1 Intraquery concurrency Query 1. 1 F 2 Interquery concurrency Query 1. 2 F 3 F 4 F 5 Data Fragments

Fragmentation Disadvantage • Vertical fragmentation may incur overhead. • Attributes participating in a functional dependency may be allocated to different sites. F Integrity checking is more costly. SSN Name Skills Functional dependency Salary

Degree of Fragmentation • Application views are usually subsets of relations. Hence, it is only natural to consider subsets of relations as distribution units. • The appropriate degree of fragmentation is dependent on the applications. Relation A distribution unit Very popular at location i Query A distribution unit 40 tuples

Correctness Rules • Vertical Partitioning • Lossless-Join decomposition • Dependency preservation • Horizontal Partitioning • Disjoint fragments

Simple Predicates Given a relation R(A 1, A 2, …, An) where Ai has domain Di, a simple predicate pj defined on R has the form p j : Ai Value where Example: and Value J JNO J 1 J 2 J 3 J 4 JNAME Instrumental Database Dev. CAD/CAM Maintenance Simple predicates: A simple predicate define a data fragment BUDGET 150, 000 135, 000 250, 000 350, 000 LOC Montreal New York Orlando p 1: JNAME = “Maintenance” P 2: BUDGET < 200, 000 p 2

MINTERM PREDICATE Given a set of simple predicates for relation R. P = {p 1, p 2, …, pm} The set of minterm predicates Employee M = {m 1, m 2, …, mn} is defined as M = {m i | m i = where Possible simple predicates: P 1: TITLE=“Elect. Eng. ” P 2: TITLE=“Syst. Analy” P 3: TITLE=“Mech. Eng. ” P 4: TITLE=“Programmer” P 5: SAL ≤ 35, 000 P 6: SAL > 35, 000 NAME } TITLE SAL Brandon Elect. Eng. 40, 000 Kacey Syst. Analy. 54, 000 Jacob Mech. Eng. 32, 000 Kevin Programmer 42, 000 Some corresponding minterm predicates: A minterm predicate defines a data fragment

MINTERM PREDICATE Given a set of simple predicates for relation R. P = {p 1, p 2, …, pm} The set of minterm predicates Employee M = {m 1, m 2, …, mn} is defined as M = {m i | m i = where SELECT * FROM Employee WHERE Title = “Elect. Eng. ” AND SAL > 30, 000 Query predicate is a minterm predicate NAME } TITLE SAL Brandon Elect. Eng. 40, 000 Kacey Syst. Analy. 54, 000 Jacob Mech. Eng. 32, 000 Kevin Programmer 42, 000 Some corresponding minterm predicates: A minterm predicate defines a data fragment

Primary Horizontal Fragmentation A primary horizontal fragmentation is defined by a selection operation on the relation E ENO ENAME TITLE A possible fragmentation of J is defined as follows: J JNO JNAME BUDGET LOC L 3 L 2 G ENO JNO RESP DUR

Horizontal Fragments Thus, a horizontal fragment Ri of relation R consists of all the tuples of R that satisfy a minterm predicate mi Example: Salary > 100000 Λ Age < 20 RELATION There as many horizontal fragments (also called minterm fragments) as there are minterm predicates. Minterm fragment Minterm fragment

Completeness: Scenario 1 Simple Predicates A 1 ≥ k 1 Minterm Fragments F 2 A 3 ≤ k 3 A 4 = k 4 p 1 F 1 A 2 = k 2 Pr Applications p 1 p 3 p 4 F 3 Complete The fragments look homogeneous (i. e. , accessed with the same probability) to any of the applications App 1 p 4 App 2 App 3 App 4 A 4 accesses any tuple of fragment F 3 with the same probability p 4

Completeness: Scenario 2 Simple Predicates A 1 ≥ k 1 Minterm Fragments A 3 ≤ k 3 p 1 F 1 A 2 = k 2 F 2 p 1 p 3 A 4 = k 4 F 3 A 5 was not considered Applications App 1 p 3 p 4 p 5 App 2 App 3 App 4 SELECT … FROM … WHERE A 5 > k 5

Completeness: Scenario 2 Simple Predicates A 1 ≥ k 1 Minterm Fragments p 1 F 1 A 2 = k 2 F 2 A 3 ≤ k 3 A 4 = k 4 F 31 F 3 A 5 < k 5 Applications p 1 p 3 App 1 p 3 p 4 p 5 App 2 App 3 App 4 F 32 Additional simple predicate A 5 ≥ k 5 Now complete !

Completeness A set of simple predicate Pr is said to be complete if and only if there is an equal probability of access by every application to any two tuples belonging to any minterm fragment that is defined according to Pr.

Completeness Example 1 Case 1: The only application that accesses J wants to access the tuples according to the location. Horizonal fragmentation The set of simple predicates J J 1 30% LOC=“New York” J 2 LOC=“Orlando” J 3 Pr = 40% 30% Application LOC=“Montreal”, LOC=“New York”, LOC=“Orlando” is complete because each tuple of each fragment has the same probability of being accessed.

Completeness Example 2 Example: J 1 J 2 J 3 JNO 001 JNAME Instrumental JNO 004 007 JNAME GUI CAD/CAM JNO 003 JNAME Database Dev. BUDGET 150, 000 LOC Montreal BUDGET LOC 135, 000 New York 250, 000 New York BUDGET LOC 310, 000 Orlando LOC=“Montreal”, Pr = LOC=“New York”, LOC=“Orlando” n atio c li App 1 Budget < $200, 000 App Not homogeneous licat ion 2 Case 2: There is a second application which accesses only those project tuples where the budget is less than $200, 000. Ø Since tuple “ 004” is accessed more frequently than tuple “ 007” by this application, Pr is not complete. Ø To make the set of simple predicates complete, we need to add (BUDGET< 200, 000) to Pr.

Completeness Example 2 Example: J 1 J 2 J 3 JNO 001 JNO 004 007 JNO 003 JNAME Instrumental JNAME GUI CAD/CAM JNAME Database Dev. BUDGET 150, 000 LOC Montreal BUDGET LOC 135, 000 New York 250, 000 New York BUDGET LOC 310, 000 Orlando LOC=“Montreal”, Pr = LOC=“New York”, LOC=“Orlando” n atio lic p p A 1 Smallbudget projects App Not homogeneous licat ion 2 The set of simple predicates should include “BUDGET” !

Completeness Example 2 BUDGET<=200, 000 LOC=“Montreal” J 1 J LOC=“New York” J 2 LOC=“Orlando” J 3 J 11 J 12 BUDGET>200, 000 BUDGET<=200, 000 J 21 Small-budget projects in New York J 22 BUDGET>200, 000 BUDGET<=200, 000 J 31 J 32 BUDGET>200, 000 Homogeneous again

Completeness Example 2 This makes data allocation easier Homogeneous Completeness is a desirable property because a complete set defines fragments that are not only logically uniform (in that they all satisfy the minterm predicate), but statistically homogeneous.

Redundant Fragmentation I want only apples I also want the other apples Too much fragmentation ! Logically uniform & statistically homogeneous fragment Fragment 1 Fragment 2

Relevant: Let mi and mj be two almost identical minterm predicates: mi = p 1 Λ p 2 Λ p 3 fragment fi mj = p 1 Λ ¬ p 2 Λ p 3 fragment fj The simple predicate p 2 is relevant if and only if Access frequency Cardinality f p 1 f 1 p 3 f 12 p 2 fi ¬p 2 fj Prob 1 Prob 2 A Prob 1 ≠ Prob 2

Relevant That is, there should be at least one application that accesses fi and fj differently. i. e. , The simple predicate pi should be relevant in determining a fragmentation. f p 1 f 1 p 3 f 12 p 2 fi ¬p 2 fj Prob 1 Prob 2 A Prob 1 ≠ Prob 2

Minimality If all the predicates of a set Pr are relevant, Pr is minimal. f p 1 f 1 p 3 f 12 p 2 fi ¬p 2 fj Prob 1 Prob 2 A Prob 1 = Prob 2

A Complete and Minimal Example Two applications: 1. One application accesses the tuples according to location. 2. Another application accesses only those project tuples where the budget is less than $200, 000. Case 1: Pr={Loc=“Montreal”, Loc=“New York”, Loc=“Orlando”, BUDGET<=200, 000, BUDGET>200, 000} is complete and minimal. Case 2: If, however, we were to add the predicate JNAME= “Instrumentation” to Pr, the resulting set would not be minimal since the new predicate is not relevant with respect to the applications. (see next slide)

“Relevant” Example BUDGET<=200, 000 LOC=“Montreal” J 1 J LOC=“New York” J 2 LOC=“Orlando” J 3 JNAME = “Instrument” J 11 J 122 BUDGET>200, 000 JNAME! “Instrument” BUDGET<=200, 000 J 21 [ JNAME = “Instrument” ] is not relevant. J 22 BUDGET>200, 000 BUDGET<=200, 000 J 31 J 32 Database Fragmentation should base on the applications BUDGET>200, 000 Relevant Irrelevant

Information for Fragment Allocation • Minterm Selectivity sel(mi): number of tuples that would be accessed by a query specified according to a given minterm predicate. • Access Frequency acc(qi): the access frequency of queries in a given period. Minterm selectivity and access frequency guide the allocation activity Very popular at location i Give this fragment to location i Query 4000 tuples

Meaningful Minterm Predicates Applications: • Take the salary and determine a raise accordingly. • The employee records are managed in two places, one handling the records of those with salary less than or equal to $30, 000 and the other handling the records of those who earn more than $30, 000. All the applications are about salary: Pr={p 1: SAL ≤ 30, 000, p 2: SAL > 30, 000} is complete and minimal. The minterm predicates: This fragment is empty is contradictory Implications: We are left with two meaningful minterm predicates M = {m 2, m 3}

Compute Complete & Minimal Set Rule: Only relevant simple predicates are used in the fragmentation. • Repeat until the predicate set is complete P M I M’ – Find a simple predicate pi that is relevant – Determine minterm fragments fi and fj according to pi and ¬pi, respectively – Accept pi , fi , and fj – Remove any pk and fk from acceptance list if pk becomes irrelevant /* the list is minimal */

Compute Complete & Minimal Set Rule: Only relevant simple predicates are used in the fragmentation. • Repeat until the predicate set is complete P M – Find a simple predicate pi that is relevant – Determine minterm fragments fi and fj according to pi and ¬pi, respectively – Accept pi , fi , and fj – Remove any pk and fk from acceptance list if pk becomes irrelevant /* the list is minimal */ • Determine the set of minterm predicates M (using the acceptance list) I • Determine the set of implications I (among the acceptance list) M’ • For each mi in M, remove mi if it is contradictory according to I

Derived Horizontal Fragmentation Example: Not a good idea to fragment these two relations independently PAY (TITLE, SAL) EMP (ENO, ENAME, TITLE)

Derived Horizontal Fragmentation • In some cases, the horizontal fragmentation of a relation cannot be based on a property of its own attributes, but is derived from the horizontal fragmentation of another relation. • Derived fragmentation is used to facilitate the join between fragments. Example: Not a good idea to fragment these two relations independently PAY (TITLE, SAL) EMP (ENO, ENAME, TITLE)

Not Using Derived Fragmentation Example: Not a good idea to fragment these two relations independently PAY (TITLE, SAL) EMP (ENO, ENAME, TITLE) One can divide EMP into EMP 1 and EMP 2 based on TITLE and divide PAY into PAY 1, PAY 2, PAY 3 based on SAL using primary fragmentation. To join EMP and PAY, we have the following scenarios. EMP 1 PAY 1 EMP 2 PAY 2 EMP 3 PAY 3 Fragments based on ENO Fragments based on TITLE More communication overhead !

Benefits of Derived Fragmentation PAY (TITLE, SAL) EMP (ENO, ENAME, TITLE) Primary Fragmentation: SJ: semi-join EMP 1 = EMP SJ PAY 1 EMP 2 = EMP SJ PAY 2 EMP 3 = EMP SJ PAY 3 EMP 1 and PAY 1 can be allocated to the same site EMP 1 PAY 1 EMP 2 PAY 2 EMP 3 PAY 3 Derived fragmentation Primary fragmentation based on TITLE

Apply horizontal fragmentation for R 1 (R 1 PK, …) Chain Relationships Apply derived fragmentation for R 2 based on R 1 R 2 (R 2 PK, R 1 FK, …) R 3 (R 3 PK, R 2 FK, …). . . Apply derived fragmentation for R 3 based on R 2 • • Design the primary fragmentation for R 1. Derive the derived fragmentation for Rk as follows: • Rk = Rk SJRKFK=R(k-1)PK R(k-1) • for 2 k n in that order.

Derived Fragmentation EMP (ENO, ENAME, TITLE) Join might be required PROJ (PNO, PNAME, BUDGET) EMP_PROJ (ENO, PNO, RESP, DUR) • How do we fragment EMP_PROJ ? – Semi-Join with EMP, or – Semi-Join with PROJ • Criterion: Support the more-frequent join operation

VERTICAL FRAGMENTATION Purpose: Identify fragments Ri such that many applications can be executed using just one vertical fragment. Advantage: When many applications which use R 1 and many applications which use R 2 are issued at different sites, fragmenting R avoids communication overhead. Local access. No communication A 7 A 1 R 2 Site 1 Site 2

A B A C B D C D F E G F G B H H D H E L L F H L L L H L L L H A B C D E A D B C F F • Splitting is considered only for attributes that do not participate in the primary key. (Primary key must be in each vertical fragment) • The split approach involves three steps: H H C Split Approach H H A G E G 1. Obtain attribute affinity matrix. 2. Use a clustering algorithm to group some attributes together based on the attribute affinity matrix. This algorithm produces a clustered affinity matrix. 3. Use a partitioning algorithm to partition attributes such that set of attributes are accessed solely or for the most part by distinct set of applications.

Attribute Usage Matrix Attribute-query relationship PROJ PNO PNAME BUDGET LOC A 1 q 1: SELECT BUDGET FROM PROJ WHERE PNO=Value; q 2: SELECT PNAME, BUDGET FROM PROJ; q 3: SELECT PNAME FROM PROJ WHERE LOC=Value; q 4: SELECT SUM(BUDGET) FROM PROJ WHERE Loc=Value A 2 A 3 use(qi, Aj) = q 1 q 2 q 3 q 4 A 4 1 if Aj is referenced by qi 0 otherwise A 1 A 2 A 3 A 4 Attribute Usage Matrix

Attribute Affinity Measure Relation R Ai I use query qk at Site m Site n Site m qi qk qi Ak Aj refs(qk) : Number of accesses to attributes (Ai, Aj) for each execution of qk at site s refs(qk) x accs(qk) = How much Site S would like Ai and Aj together due to query qk qi Site s qk qi accs (qk) : Application access frequency of qk at site s.

Attribute Affinity Measure Sum over each query that uses both Ai and Aj Popularity of using Ai and Aj together For each query qk that uses both Ai and Aj Sum over all sites that host such a query Popularity of such Ai-Aj pair at all sites Use(q 1, A 2) = 1 refs(qk) x accs(qk) = How much Site S would like Ai and Aj together due to query qk

Attribute Affinity Matrix For each query qk that uses both Ai and Aj A 1 A 2 A 3 A 4 Attribute Affinity Matrix Popularity of such Ai-Aj pair at all sites

Attribute Affinity Matrix Example A 1 A 2 A 3 A 4 q 1 q 2 q 3 q 4 Attribute Usage Matrix A 1 A 2 A 3 A 4 Attribute Affinity Matrix (AA) Next Step - Determine clustered affinity (CA) matrix

Clustered Affinity Matrix Step 1: Initialize CA Copy first 2 columns A 1 A 2 A 3 A 4 Attribute Affinity Matrix (AA) A 1 A 2 A 3 A 4 Clustered Affinity Matrix (CA)

Clustered Affinity Matrix Step 2: Determine Location for A 3 Dummy attribute A 0 3 possible positions for A 3 e n o h c i h W A 0 A 3 A 1 A 2 A 3 A 4 A 5 A 1 A 2 A 3 A 4 Attribute Affinity Matrix (AA) ? r e t t is be A 1 A 2 A 3 A 1 A 3 A 2 A 0 A 1 A 2 A 3 A 4 A 5 A 1 A 2 A 3 A 4 Clustered Affinity Matrix (CA) Answer: Compare Cont(A 0, A 3, A 1), Cont(A 1, A 3, A 2), and Cont(A 2, A 3, A 4)

Clustered Affinity Matrix Step 2: Determine the order for A 3 When z=x, aff(Ax, Ax) · aff(Ax, Ay) = aff(Ax, Ay) Contribution When z=x When z≠x and z≠y Ax Ax Ak Ay Az Ay

Clustered Affinity Matrix Step 2: Determine the order for A 3 Cont(A 0, A 3, A 1) = 8820 Cont(A 1, A 3, A 2) = 10150 Cont(A 2, A 3, A 4) = 1780 Since Cont(A 1, A 3, A 2) is the greatest, [A 1, A 3, A 2] is the best order (configuration). A 1 A 2 A 3 A 4 Attribute Affinity Matrix (AA) A 1 A 3 A 2 A 4 A 1 A 2 A 3 A 4 Clustered Affinity Matrix (CA) Note: aff(A 0, Ai)=aff(Ai, A 0)=aff(A 5, Ai)=aff(Ai, A 5)=0 by definition

Clustered Affinity Matrix Step 2: Determine the order for A 4 Since Cont(A 3, A 2, A 4) is the biggest, [A 3, A 2, A 4] is the best order. A 1 A 2 A 3 A 4 Attribute Affinity Matrix (AA) A 1 A 3 A 2 A 4 A 1 A 2 A 3 A 4 Clustered Affinity Matrix (CA)

Clustered Affinity Matrix Step 3: Re-order the Rows Arrange the rows in the same order as the columns A 1 A 3 A 2 A 4 A 1 A 2 A 3 A 4 Clustered Affinity Matrix (CA) A 1 A 3 A 2 A 4 Clustered Affinity Matrix (CA)

Partitioning Find the sets of attributes that are accessed, for the most part, by distinct sets of applications We look for a good dividing points along the diagnose Bad grouping since A 1 and A 2 are never accessed together A 1 A 3 A 2 A 4 Clustered Affinity Matrix (CA) Cluster 1: Cluster 2: A 1 & A 3 A 2 & A 4 Two vertical fragments: PROJ 1(A 1, A 3) and PROJ 2(A 2, A 4) A 4 and A 3 are usually not accessed together A 4 and A 2 are often accessed together

MIXED FRAGMENTATION • Apply horizontal fragmentation to vertical fragments. • Apply vertical fragmentation to horizontal fragments. Example: Applications about work at each department reference tuples of employees in the departments located around the site with 80% probability. EMP(ENUM, NAME, SAL, TAX, MGRNUM, DNUM) ENUM NAME TAX SAL ENUM NAME MGRNUM DNUM Jacksonville Orlando Miami Vertical fragmentation Horizontal Fragmentation (local work)

Fragment Allocation Fragmentation Allocation

ALLOCATION – Notations i: fragment index j: site index k: application index fkj: the frequency of application k at site j the number of retrieval references of application k to fragment i. rki: uki: the number of update references of application k to fragment i. nki = rki + uki Site j Fragment i rki uki Application k /w freq. fkj

ALLOCATION – Notations i: fragment index j: site index k: application index fkj: the frequency of application k at site j the number of retrieval references of application k to fragment i. rki: uki: the number of update references of application k to fragment i. nki = rki + uki Site j Fragment i nki Application k /w freq. fkj

ALLOCATION – Notations Site j Fragment i nki Application k /w freq. fkj

Allocation of Horizontal Fragments (1) No replication: Best Fit Strategy • The number of local references of Ri at site j is Fragment i Benefit fragment i to. Site j Sum over all applications k at Site j Application k How much Site j would like to have fragment i due to Application k • Ri is allocated at site j* such that Bij* is maximum. Advantage: A fragment is allocated to a site that needs it most. Disadvantage: It disregards the “mutual” effect of placing a fragment at a given site if a related fragment is also at that site (e. g. , due to derived fragmentation).

Allocation of Horizontal Fragments (2) All beneficial sites approach (replication) Good for Site j Fragment i Site j Bad for other sites Other sites Savings due to retrieval references Cost of update references from other sites • Ri is allocated at all sites j* such that Bij* > 0. • When all Bij’s are negative, a single copy of Ri is placed at the site such that Bij* is maximum.

Allocation of Horizontal Fragments (3) Considering reliability and availability: di The degree of redundancy of Ri Fi The reliability and availability benefit of having Ri fully replicated. (di) The reliability and availability benefit when the fragment has d i copies. β Reliability The benefit of introducing a new copy of Ri at site j : Fi 1 Same as All Beneficial Sites approach in last slide Redundancy Also takes into account the benefit of availability di

Allocation of Vertical Fragments Notations Applications of type A 3 at site PSr that access both Rs and Rt PSr Processin g Site PSr Applications of type A 2 at site PSr , that access only Rt A 1 A 3 R Rs As At A 4 PSs PSt PS 4 A 2 Rs and Rt are two vertical fragments of relation R Rt . . . An PSn Other processing sites Should we allocate fragments Rs and Rt to other sites instead ?

Allocation of Vertical Fragments What is the benefit of allocating fragment Rs to site PSs , and fragment Rt to site PSt ? PSs likes it PSt also likes it PSr As A 1 A 3 Rs Rt PSs A 4 . . . An Applications of type As at PSs PSr loses the two fragments and does not like it A 2 At PS 4 PSn The remaining locations do not like it because the fragments must be downloaded from two (instead of one) locations This formula can be used within an exhaustive “splitting” algorithm by trying all possible combinations of sites s and t. PSt

SUMMARY Top-Down Design of a distributed DB consists of 4 phases: – Phase 1: Global schema design (same as in centralized DB design) – Phase 2: Fragmentation • Horizontal Fragmentation – Primary fragmentation: Determent a complete and minimal set of predicates – Derived fragmentation: Use semijoin • Vertical Fragmentation Identify fragments such that many applications can be executed using just one fragment (clustering the attributes). – Phase 3: Allocation Minimize the number of remote accesses – Phase 4: Physical schema design (same as in centralized DB design).

Database Integration Bottom-up Design

Overview • The design process in multidatabase systems is bottomup. – The individual databases actually exists – Designing the global conceptual schema (GCS) involves integrating these local databases into a multidatabase. Global Conceptual Schema

Overview • The design process in multidatabase systems is bottomup. – The individual databases actually exists – Designing the global conceptual schema (GCS) involves integrating these local databases into a multidatabase. • Database integration can occur in two steps: 1) Schema Translation 2) Schema Integration Database 1 Translator 1 In. S 1 Intermediate schema in canonical representation Database 2 Translator 2 Database 3 Translator 3 In. S 2 INTEGRATOR GCS In. S 3 2 1

Network Data Model (Review) • There are two basic data structures in the network model: records and sets. Record type: a group of records of the same type. Set type: indicates a many-to-one relationship in the direction of the arrow. DEPARTMENT (DEPT-NAME, BUDGET, MANAGER) Employs owner record type set type EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY) • Implementation of set instances: A one-to-many relationship DEPARTMENT (owner record) Database Jones, L. Patel, J. member record type EMPLOYEE (member records) Vu, K.

Review: Entity-Relationship (ER) Model Entity & Entity Set Relationship 89

Review: Entity-Relationship (ER) Model from to name ssn dname did lot budget No Constraint Employees Works_In Departments manyto-many since name ssn dname did lot Employees Key Constraint Manages budget Departments 1 -tomany 90

Review: ER vs Relational KEY Many-tomany KEY KEY Key constraint KEY KEY Foreign Key Merge 91

Example: Three Local Databases We choose ER as the canonical representation

Schema Translation: Relational to ER Database 1 S (TITLE, SAL) ENO E (ENO, ENAME, TITLE) J (JNO, JNAME, BUDGET, LOC, CNAME) ENAME N G (ENO, JNO, RESP, DUR) ENAME E TITLE N SAL RESP G DUR JNO JNAME M J BUDGET LOC CNAME 1 S TITLE ENO G DUR PAY • E & J have a many-tomany relationship • E & S have a 1 -to-many relationship RESP JNO JNAME M J BUDGET LOC CNAME Database 1 in ER representation SAL Treat salary as an attribute of an engineer entity (to make it look more like Database 3) Relationships may be identified from the foreign keys defined for each relation.

Schema Translation: Network to ER Database 2 DEPARTMENT 1 EMPLOYEE WORKS-IN N WORK Dummy record type DEPARTMENT 1 EMPLOYS Works-in Employs EMPLOYEE M WORK N EMPLOYS M EMPLOYEE Database 2 in ER representation • Map each record type in the network schema to an entity and each set type to a relationship. • Network model uses dummy records in its representation of many-to-many relationships that need to be recognized during mapping.

After Translation Engineer No. Database 3 Engineer Name N ENGINEER Title Project No. Responsibility Budget 1 WORKS IN PROJECT N Salary ENO ENAME Contract Date E TITLE N SAL RESP G DUR JNO 1 CLIENT JNAME M Client Name J BUDGET LOC CNAME Database 2 Location CONTRACTED BY Duration Database 1 Project Name DEPARTMENT Address All three databases in ER representation, ready for integration N EMPLOYS M EMPLOYEE

Schema Integration Schema integration follows the translation process and generates the GCS by integrating the intermediate schemas. 1) Identify the components of a database which are related to one another. Two components can be related as (1) equivalent, (2) one contained in the other one, (3) overlapped, or (4) disjoint. 2) Select the best representation for the GCS. 3) Integrate the components of each intermediate schema.

Integration Methodologies Integration Process Binary Ladder Balanced N-ary One-shot Iterative Binary: Decreases the potential integration complexity and lead toward automation techniques. One-shot: There is no implied priority for integration order of schemas, and the trade-off can be made among all schemas rather than among a few.

Integration Process Schema integration occurs in a sequence of four steps: 1) Preintegration: establish the “rules” of the integration process before actual integration occurs. 2) Comparison: naming and structural conflicts are identified. 3) Conformation: resolve naming and structural conflicts 4) Merging and restructuring: all schemas must be merged into a single database schema and then restructured to create the “best integrated schema.

Step 1: Preintegration 1. An integration method (binary or n-ary) must be selected and the schema integration order defined. – The order implicitly defines priorities. 2. Candidate keys in each schema are identified to enable the integrator to determine dependencies implied by the schemas. 3. The mapping or transformation rules should be described before integration begins. – e. g. , mapping from degree Celsius in one schema to degrees Fahrenheit in another.

Preintegration Example (NOTE: This three databases are different from the last example) In. S 1 They do not work on projects

Key Identification to determine dependencies implied by the schemas In. S 1 KEYS In. S 1: Engineer No. in ENGINEER Project No. in PROJECT Client name in CLIENT In. S 2: E# in EMPLOYEE Dept-name in DEPARTMENT In. S 3: Eno in E Jno in J

Schema Comparison • Some parts of In. S 1 (about engineers) and In. S 3 (about engineers) occur in In. S 2 (about employees) • In. S 1 and In. S 3 are subsets of In. S 2 In. S 1

Integration Method Integrate the engineering centers first. Their schemas look similar The schema of the nonengineering location looks quite different 2 1 In. S 3 In. S 2 Integration order

Step 2: Schema Comparison Naming Conflict (1) Synonyms: Two identical entities that have different names. In. S 1 ENGINEER Engineering No Engineer Name Salary WORKSIN Responsibility Duration PROJECT Project No Project Name Location In. S 3 E Eno Ename Sal G Resp Dur J Jno Jname Loc In. S 1

Step 2: Schema Comparison Naming Conflict (2) Homonyms: Two different entities that have identical names. • In In. S 1, ENGINEER. Title refers to the title of engineers. In. S 1 • In In. S 2, EMPLOYEE. Title refers to the title of all employees. domain (EMPLOYEE. Title) >> domain (ENIGNEREER. Title)

Schema Comparison – Structural Conflicts (1) • Type conflicts: occur when the same object is represented by an attribute in one schema and by an entity in another schema. – The client of a project is modeled as an entity in In. S 1, however – the client is included as an attribute of the J entity in In. S 3 Resp EMPLOYS Dur JNO M N Jname CONTRACTED BY Budget J Cname In. S 3 Loc Contract Date 1 CLIENT Client Name Address PROJECT In. S 1

Schema Comparison – Structural Conflicts (2) This is 1 -to-many Dependency conflicts: occur when different relationship modes are used to represent the same thing in different schemas. Eno Engineer No. Title ENGINEER Title N ENGINEER Sal N WORKS IN EMPLOYS Dur 1 PROJECT In. S 1 Salary This is many-to-many Resp Ename Project No. Engineer Name M In. S 3 J

Schema Comparison: Structural Conflicts (3) • Key conflicts: occur when different candidate keys are available and different primary keys are selected in different schemas • Behavioral conflicts: are implied by the modeling mechanism, – e. g. , deletion of the last employee causes the dissolution of the department.

Step 3: Conformation Naming Conflicts Naming conflicts are resolved simply by renaming conflict ones. Synonyms: rename the schema of In. S 3 to conform to the naming of In. S 1. In. S 3 E Eno Engineering No Ename Engineering Name Salary G Resp Dur Responsibility Duration J Jno Project No Jname Project Name Location In. S 1 ENGINEER Engineering No Engineer Name Salary WORKSIN Responsibility Duration PROJECT Project No Project Name Location Homonyms: • Prefix each attribute by the name of the entity to which it belong, e. g. , ENGINEER. Title EMPLOYEE. Title • and prefix each entity by the name of the schema to which it belongs. e. g. , In. S 1. ENGINEER In. S 2. EMPLOYEE

Step 3: Resolving Structural Conflict In. S 1 OPTION 1 Transform this attribute to an entity C

Step 3: Resolving Structural Conflicts Transforming entities/attributes/relationships among one another Engineer No. In. S 3 Project No. Responsibility Engineer Name Project Name Budget ENGINEER Title Engineer No. OPTION 2 WORKS IN Salary M PROJECT WORKS IN Relationship between projects and clients is m: n Project No. Responsibility N Location Client Name Duration Engineer Name ENGINEER Title N M Project Name Budget PROJECT N C-P Duration M Client Name C Location New In. S 3

Step 3: Schema Integration Merging & Restructuring Merging requires that the information contained in the participating schemas be retained in the integrated schema. Merging using the IS-A relationship (Employees vs Engineers) In. S 2 (Employees) In. S 3 In. S 1 (Engineers) Use In. S 3 as the final schema since it is more general in terms of the C-P relationship (many-to-many vs 1 -to-many) (next page)

Integrate In. S 1 & In. S 3 The many-to -many relationship in In. S 3 is more general Engineer No. Engineer Name Title Salary N WORKS IN M N CONTRACTED BY Duration M Client Name N C Budget Location CONTRACTED BY 1 CLIENT Client Name Project Name PROJECT Contract Date Project No. Responsibility 1 Duration In. S 1 Engineer Name ENGINEER WORKS IN Salary Use In. S 3 as the final schema Engineer No. N ENGINEER Project No. Responsibility Budget Location In. S 3 Address OPTION 2 is used

Merging & Restructuring Example Final Result: ENGINEER Duration Responsibility N WORKS IN E# Name EMPLOYEE N EMPLOYS 1 In. S 2 Title Address Manager Project Name Budget PROJECT N Location CONTRACTED BY M CLIENT Client name In. S 1/In. S 3 Address SAL DEPARTMENT Budget M Project No. Dept-name Unfortunately, Conformation and restructuring stages are an art rather then a science

Query Processing in Multidatabase Systems

Query Processing in Three Steps 1. Global query is decomposed into local queries Schema Integration Local Schema 1 Local Schema 2 Local Schema 3 Translator 1 Translator 2 Translator 3 In. S 1 In. S 2 Q 1, 1 Q 1, 2 In. S 3 Q 1, 3 INTEGRATOR Q 1 GCS

Query Processing in Three Steps 2. Each local query is translated into queries over the corresponding local database system Schema Integration Local Schema 1 Q’ 1, 1 Local Schema 2 Q’ 1, 3 Q’ 1, 2 Translator 1 Translator 2 In. S 1 Local Schema 3 Translator 3 In. S 2 Q 1, 1 Q 1, 2 In. S 3 Q 1, 3 INTEGRATOR Q 1 GCS

Query Processing in Three Steps 3. Results of the local queries are combined into the answer Schema Integration Local Schema 1 Q’ 1, 1 Translator 2 In. S 1 Combine Local Schema 3 Q’ 1, 2 Translator 1 Final answer Local Schema 2 Translator 3 In. S 2 Q 1, 1 Q 1, 2 In. S 3 Q 1, 3 INTEGRATOR Q 1 GCS

Query Processing in Three Steps 1. Global query is decomposed into local queries 2. Each local query is translated into queries over the corresponding local database system 3. Results of the local queries are combined into the answer Schema Integration Local Schema 1 Local Schema 2 Local Schema 3 Translator 1 Translator 2 Translator 3 In. S 1 In. S 2 INTEGRATOR GCS In. S 3

Outline • Major query processing components in multidatabase systems: – Query Decomposition – Query Translation – Global Query Optimization • Techniques for each of the above components

Query Decomposition

Query Decomposition Overview Global Query decomposition & global optimization SQ 1 Query translator 1 TQ 1 DB 1 SQ 2 Query translator 2 TQ 2 DB 2 . . . … . . . SQn Query translator n TQn DBn PQ 1 … PQn SQi export-schema subquery in global query language TQi target query (local subquery) in local query language PQi postprocessing query used to combine results returned by subqueries to form the answer

Assumptions • We use the object-oriented data model to present a query decomposition algorithm • To simplify the discussion, we assume that there are only two export schemas: ES 1 Emp 1: SSN Name Salary Age ES 2 Emp 2: SSN Name Salary Rank

Object-Oriented DB Objects belong to other persistent classes world: The world of C, denoted by world(C), is the set of real-world objects described by C. World Type type: Given a class C, the type of C, denoted by type(C), is the set of attributes defined for C and their corresponding domain. ion ns e t x E A Persistent Class C extension: The extension of C, denoted by extension(C), is the set instances contained in C.

Object-Oriented DB: Example Cars, tables, trees, apples, buildings Human Type t Curren ts studen Student “Class”

Schema Integration • Integration through outerjoin • Integration through outerunion (generalization) Two local databases

Review: Outerjoin The outerjoin of relation R 1 and R 2 (R 1 ⋈o R 2 ) is the union of three components: – the join of R 1 and R 2, /* matching pairs – dangling tuples of R 1 padded with null values, and /* no matching tuples in R 2 – dangling tuples of R 2 padded with null values. /* no matching tuples in R 1

Outerjoin Example Emp 1 Emp. O = Emp 1 ⋈o Emp 2 (Outerjoin on OID) OID SSN Name Salary Age 3 6789 Smith 90, 000 40 OID SSN Name Salary Age Rank 4 4321 Chang 62, 000 30 1 2222 Ahad 98, 000 null S. Mgr. 5 8642 Patel 75, 000 35 2 7531 Wang 95, 000 mull S. Mgr. 3 6789 Smith Inconsistent 40 Mgr. 4 4321 Chang 62, 000 30 null 5 8642 Patel 75, 000 35 null Emp 2 OID SSN Name Salary Rank 1 2222 Ahad 98, 000 S. Mgr. 2 7531 Wang 95, 000 S. Mgr. 3 6789 Smith 25, 000 Mgr. Dangling Tuple

Outerunion Emp 2 OID SSN Name Salary Rank 1 2222 Ahad 98, 000 S. Mgr. 2 7531 Wang 95, 000 3 6789 Smith 25, 000 Emp. G = Emp 1 Uo Emp 2 OID SSN Name Salary Age Rank S. Mgr. 1 2222 Ahad 98, 000 null S. Mgr. 2 7531 Wang 95, 000 mull S. Mgr. 3 6789 Smith Conflict null Mgr. 3 6789 Smith Conflict 40 null Emp 1 OID SSN Name Salary Age 4 4321 Chang 62, 000 30 null 3 6789 Smith 90, 000 40 5 8642 Patel 75, 000 35 null 4 4321 Chang 62, 000 30 5 8642 Patel 75, 000 35

Schema Integration Using Outerjoin Two classes C 1 and C 2 can be integrated by equi-outerjoining the two classes on the OID to form a new class C. /* C = C 1 ⋈o C 2 – extension(C ) = extension(C 1 ) ⋈o extension(C 2 ) – type(C ) = type(C 1 ) ⋃ type(C 2 ) – world(C ) = world(C 1 ) ⋃ world(C 2 ) As seen by Integration Two local databases global applications

Schema Integration thru Generalization • e pe rt Ex Generalization is Professor Attributes Staff Age Name Hours Phone Outer union

Generalization Example Emp 1: SSN Name Salary Age Emp 2: SSN Name Salary Rank Emp. G: SSN Name Salary Emp 2 Generalization Emp 1 Emp. G SSN Name Salary Emp 1 Age Rank Emp 2 SSN More specific Age Name Salary Rank

Inconsistency Resolution • The schema integration techniques work as long as there is no data inconsistency • If data inconsistency occurs, aggregate functions may be used to resolve the problem Example: Emp. O. Salary = Emp 1. Salary + Emp 2. Salry Local databases

Inconsistency Resolution: Generalization Export Schemas Emp 1: SSN Name Salary Age Emp 2: SSN Name Salary Rank Integrated Schema Generalization World (Emp 2) World (Emp 1) world(Emp 2) – world(Emp 1) ⋂ world(Emp 2) Aggregate Functions – Some examples: world(Emp 1) – world(Emp 2) Emp. G. Name = Emp 1. Name, if Emp. G is in world(Emp 1) = Emp 2. Name, if Emp. G is in world(Emp 2) – world(Emp 1) Use the name at location 1 Location 2 1 Use the name at location 1 Employees working at both locations

Inconsistency Resolution: Generalization Export Schemas Emp 1: SSN Name Salary Age Emp 2: SSN Name Salary Rank Integrated Schema Generalization World (Emp 2) World (Emp 1) world(Emp 2) – world(Emp 1) ⋂ world(Emp 2) Aggregate Functions – Some examples: world(Emp 1) – world(Emp 2) Emp. G. Name = Emp 1. Name, if Emp. G is in world(Emp 1) = Emp 2. Name, if Emp. G is in world(Emp 2) – world(Emp 1) Employees working at both locations. They have two salaries Emp. G. Salary = Emp 1. Salary, if Emp. G is in world(Emp 1) – world(Emp 2) = Emp 2. Salary, if. Emp. G is in world(Emp 2) – world(Emp 1) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. G is in world(Emp 1) ⋂ world(Emp 2)

Inconsistency Resolution: Outerjoin Export Schemas Emp 1: SSN Name Salary Age Emp 2: SSN Name Salary Rank Integrated Schema Emp. O: SSN Name Salary Outer Age join Rank world(Emp 2) – world(Emp 1) Aggregate Functions – Some examples: Emp. O. Age = Emp 1. Age, if Emp. O is in world(Emp 1) = Null, if Emp. O is in world(Emp 2) – world(Emp 1) Emp. O. Rank = Emp 2. Rank, if Emp. O is in world(Emp 2) = Null, if Emp. O is in world(Emp 1) – world(Emp 2) world(Emp 1) ⋂ world(Emp 2) world(Emp 1) – world(Emp 2) World (Emp 1) World (Emp 2) Emp. O. Salary = Emp 1. Salary, if Emp. O is in world(Emp 1) – world(Emp 2) = Emp 2. Salary, if Emp. O is in world(Emp 2) – world(Emp 1) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. O is in world(Emp 1) ⋂ world(Emp 2)

Query Decomposition Emp 1 Emp. O = Emp 1 ⋈o Emp 2 (Outerjoin on OID) OID SSN Name Salary Age 3 6789 Smith 90, 000 40 OID SSN Name Salary Age Rank 4 4321 Chang 62, 000 30 1 2222 Ahad 98, 000 null S. Mgr. 5 8642 Patel 75, 000 35 2 7531 Wang 95, 000 mull S. Mgr. 3 6789 Smith Inconsistent 40 Mgr. 4 4321 Chang 62, 000 30 null 5 8642 Patel 75, 000 35 null Emp 2 OID SSN Name Salary Rank 1 2222 Ahad 98, 000 S. Mgr. 2 7531 Wang 95, 000 S. Mgr. 3 6789 Smith 25, 000 Mgr. Dangling Tuple

Query Decomposition Global Query Q From Emp 2 Emp. O = Emp 1 ⋈o Emp 2 (Outerjoin on OID) Decomposition Q 1 Q 2 Q 3 OID SSN Name Salary Age Rank 1 2222 Ahad 98, 000 null S. Mgr. 2 7531 Wang 95, 000 mull S. Mgr. 3 6789 Smith SUM 40 4 4321 Chang 62, 000 30 null 5 8642 Patel 75, 000 35 null From both Mgr. From Emp 1 Emp. O. Salary = Emp 1. Salary, if Emp. O is in world(Emp 1) – world(Emp 2) = Emp 2. Salary, if Emp. O is in world(Emp 2) – world(Emp 1) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. O is in world(Emp 1) ⋂ world(Emp 2)

Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From Emp. O. Name, Emp. O. Rank Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Assume Outerjoin is used for schema integration Obtain a partition of world(Emp. O) based on the aggregate function used to resolve the data inconsistency. Part 1 Emp. O. Salary = Emp 1. Salary, if Emp. O is in world(Emp 1) – world(Emp 2) Part 3 Part 2 = Emp 2. Salary, if. Emp. O is in world(Emp 2) – world(Emp 1) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. O is in world(Emp 1) ⋂ world(Emp 2) world(Emp 1) Part 1 Part 3 Part 2 world(Emp 2)

Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From Emp. O. Name, Emp. O. Rank Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Assume Outerjoin is used for schema integration Obtain a partition of world(Emp. O) based on the aggregate function (last slide) used to resolve the data inconsistency. Inconsistency Function: Option 1 (based on Salary) part. 1: world(Emp 1) – world(Emp 2) part. 2: world(Emp 2) – world(Emp 1) part. 3: world(Emp 1) ⋂ world(Emp 2) world(Emp 1) Part 1 Part 3 Emp. O. Salary = Emp 1. Salary, if Emp. O is in world(Emp 1) – world(Emp 2) = Emp 2. Salary, if Emp. O is in world(Emp 2) – world(Emp 1) Part 2 world(Emp 2) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. O is in world(Emp 1) ⋂ world(Emp 2)

Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From Emp. O. Name, Emp. O. Rank Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Obtain a partition of world(Emp. O) based on the aggregate function used to resolve the data inconsistency. Part 1 world(Emp 1) Part 1 Part 2 world(Emp 2) Part 2 “Age” is not stored at Location 2

Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From Emp. O. Name, Emp. O. Rank Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Obtain a partition of world(Emp. O) based on the aggregate function used to resolve the data inconsistency. Inconsistency Function: Emp. O. Age = Emp 1. Age, if Emp. O is in world(Emp 1) = Null, if Emp. O is in world(Emp 2) – world(Emp 1) Option 2 (based on Age) part. 1: world(Emp 1) part. 2: world(Emp 2) – world(Emp 1) Part 1 world(Emp 2) Part 2 “Age” is not stored at Location 2

Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From Emp. O. Name, Emp. O. Rank Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Obtain a partition of world(Emp. O) based on the aggregate function used to resolve the data inconsistency. Option 1 (based on Salary) part. 1: world(Emp 1) – world(Emp 2) part. 2: world(Emp 2) – world(Emp 1) part. 3: world(Emp 1) ⋂ world(Emp 2) world(Emp 1) 1 Option 2 (based on Age) part. 1: world(Emp 1) part. 2: world(Emp 2) – world(Emp 1) Using two partitions cannot resolve inconsisitency due to “Salary” world(Emp 1) 3 2 world(Emp 2) 1 2 world(Emp 2) We use Option 1 since it is the finest partition among all the partitions Decompose the query into 3 subqueries

Query Decomposition Another Example Option 1: Option 2: world(Emp 1) 1 world(Emp 1) 2 1 world(Emp 2) Use finer partition (Option 3): world(Emp 1) 1 3 2 2 world(Emp 2)

Query Decomposition Step 2: Query Decomposition Global Query: Select Emp. O. Name, Emp. O. Rank From Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Partition: 1 world(Emp 1) 3 2 world(Emp 2) Query Decomposition: Obtain a query for each subset in the chosen partition. Aggregate functions relevant to the query Emp. O. Age = Emp 1. Age, if Emp. O is in world(Emp 1) = Null, if Emp. O is in world(Emp 2) – world(Emp 1) Emp. O. Salary = Emp 1. Salary, if Emp. O is in world(Emp 1) – world(Emp 2) = Emp 2. Salary, if. Emp. O is in world(Emp 2) – world(Emp 1) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. O is in world(Emp 1) ⋂ world(Emp 2)

Query Decomposition Step 2: Query Decomposition Global Query: Select Emp. O. Name, Emp. O. Rank From Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Partition: 1 3 world(Emp 1) 2 world(Emp 2) part. 1: Select Emp 1. Name From Emp 1 Where Emp 1. Salary > 80, 000 AND Emp 1. Age > 35 AND Emp 1. SSN NOT IN (Select Emp 2. SSN From Emp 2) part. 2: This subquery is discarded because Emp. O. Age is Null. Select Emp 2. Name, Emp 2. Rank From Emp 2 Where Emp 2. Salary > 80, 000 AND empty Emp 2. Age > 35 AND Emp 2. SSN NOT IN Emp. O. Age = Emp 1. Age, if Emp. O is in world(Emp 1) (Select Emp 1. SNN = Null, if Emp. O is in world(Emp 2) – world(Emp 1) From Emp 1) Emp. O. Salary = Emp 1. Salary, if Emp. O is in world(Emp 1) – world(Emp 2) Query Decomposition: Obtain a query for each subset in the Result is chosen partition. = Emp 2. Salary, if. Emp. O is in world(Emp 2) – world(Emp 1) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. O is in world(Emp 1) ⋂ world(Emp 2)

Query Decomposition Step 2: Query Decomposition Global Query: Select Emp. O. Name, Emp. O. Rank From Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Partition: 1 world(Emp 1) 3 2 world(Emp 2) part. 1: Select Emp 1. Name From Emp 1 Where Emp 1. Salary > 80, 000 AND Emp 1. Age > 35 AND Emp 1. SSN NOT IN (Select Emp 2. SSN From Emp 2) part. 2: This subquery is discarded because Emp. O. Age is Null. part. 3: Select Emp 1. Name, Emp 2. Rank From Emp 1, Emp 2 Where Sum(Emp 1. Salary, Emp 2. Salary) > 80, 000 AND Emp 1. Age > 35 AND Emp. O. Age = Emp 1. Age, if Emp. O is in world(Emp 1) Emp 1. SSN = Emp 2. SSN = Null, if Emp. O is in world(Emp 2) – world(Emp 1) Query Decomposition: Obtain a query for each subset in the chosen partition. Emp. O. Salary = Emp 1. Salary, if Emp. O is in world(Emp 1) – world(Emp 2) = Emp 2. Salary, if. Emp. O is in world(Emp 2) – world(Emp 1) = Sum(Emp 1. Salary, Emp 2. Salary), if Emp. O is in world(Emp 1) ⋂ world(Emp 2)

Query Decomposition Step 2: Query Decomposition part. 1: Select Emp 1. Name From Emp 1 Where Emp 1. Salary > 80, 000 AND Emp 1. Age > 35 AND Emp 1. SSN NOT IN (Select Emp 2. SSN on i Query Decomposition: Obtain a t From Emp 2) ica f i query for each subset in the d o M chosen partition. ery u Q part. 2: This subquery is discarded because Emp. O. Age is Null. Global Query: Select Emp. O. Name, Emp. O. Rank From Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 Emp 1. Salary 3 Emp 1. Age Emp 1. Salary + 1 Emp 2. Salary 2 Emp 2. Salary Emp 1. Age = null world(Emp 1) world(Emp 2) part. 3: Select Emp 1. Name, Emp 2. Rank From Emp 1, Emp 2 Where Sum(Emp 1. Salary, Emp 2. Salary) > 80, 000 AND Emp 1. Age > 35 AND Emp 1. SSN = Emp 2. SSN

Query Decomposition Step 3: Further Decomposition STEP 3: Some resulting query may still reference data from more than one database. They need to be further decomposed into subqueries and possibly also postprocessing queries Before Select From Where STEP 3: Emp 1. Name Emp 1. Salary > 80, 000 and Emp 1. Age > 35 and Emp 1. SSN NOT IN (Select Emp 2. SSN From Emp 2) Select Emp 1. Name From Emp 1 Where Emp 1. Salary > 80, 000 and Emp 1. Age > 35 and Emp 1. SSN NOT IN X X Insert INTO X Select Emp 2. SSN From Emp 2)

Query Decomposition Step 4: Query Optimization STEP 4: It may be desirable to reduce the number of subqueries by combining subqueries for the same database.

Query Translation

Query Translation (1) IF THEN Global Query Language ≠ Local Query Language Export Schema Subquery Translator Local Query Language

Query Translation (2) IF the source query language has a higher expressive power THEN EITHER – Some source queries cannot be translated; or – they must be translated using both • the syntax of the target query language, and • some facilities of a high-level programming language. Example: A recursive OODB query may not be translated into a relational query using SQL alone.

Object-Oriented Database Example Schema

Object-Oriented Database Example Auto 677 Red 17 • • • Company 17 Ford 99… 9 135 6546 People 6546 Kevin 135 677 25 Kevin lives in Orlando Kevin is the President of a company Kevin is the President of a car company Kevin has a red car Kevin’s car is manufactured by his own company City 135 Orlando Florida Schema

Some Simple Queries Auto 677 Red 17 Company 17 Ford 99… 9 135 6546 People City 6546 135 Kevin 135 677 25 Orlando Florida Where do all the Kevin’s live? Retrieve People. Hometown. Name Where People. name = ‘Kevin’ For all people whose name is ‘Kevin’ … what is the name of their hometown ? People. Name People. Hometown. Name

Some Simple Queries Auto 677 Red 17 Company 17 Ford 99… 9 135 6546 what are their name ? People 6546 Kevin 135 677 25 Who live in Orlando? Retrieve People. Name Where People. Hometown. name = ‘Orlando’ City 135 Orlando Florida For all people who live in Orlando …

Some Simple Queries Auto 677 Red 17 Company 17 Ford 99… 9 135 6546 People 6546 Kevin 135 677 25 City 135 Orlando Florida What is the color of the car owned by the President of Clean Energy ? Retrieve Company. President. Auto. Color Where Company. Name = ‘Clean Energy’

Some Simple Queries Auto 677 Red 17 Company 17 Ford 99… 9 135 6546 People 6546 Kevin 135 677 25 City 135 Orlando Florida Find 53 -year-old Presidents of car companies ? Retrieve Auto. Manufacturer. President. Name Where Auto. Manufacturer. President. Age = 53 Note: Not every object of Company is a car company

Equivalent Relational Schema OODB Schema: Auto OID Color Manufacturer Company OID Name Profit Headquarter President Equivalent Relational Schema: People OID Name Hometown Automobile Age Foreign key Auto (Auto-OID, Color, Company-OID) Company (Company-OID, Name, Profit, City-OID, People-OID) People (People-OID, Name, Age, City-OID, Auto-OID) City (City-OID, Name, State) City OID Name State

Relational-to-OO Translation Global Query: Select Auto 1. * From Auto 1, Auto 2, Company, People, City 1, City 2 Auto (Auto-OID, Color, Company-OID) Company (Company-OID, Name, Profit, City-OID, People. OID) People (People-OID, Name, Age, City-OID, Auto-OID) City (City-OID, Name, State) n yo r e u al A q ation e rel abas t da Where Auto 1. Company-OID = Company-OID AND Company. People-OID = People-OID AND People. Age = 52 AND People. Auto-OID = Auto 2. Auto-OID AND Auto 2. Color = “red” AND People. City-OID = City 1. City-OID AND City 1. Name = City 2. Name AND Company. City-OID = City 2. City-OID We translate this relational query to an equivalent query on the following OODB in three steps

Step 1: Construct Relational Predicate Graph Auto (Auto-OID, Color, Company-OID) Company (Company-OID, Name, Profit, City-OID, People. OID) People (People-OID, Name, Age, City-OID, Auto-OID) City (City-OID, Name, State) Global Query: Relational Predicate Graph: Select Auto 1. * From 1 2 3 4 5 6 Auto 1, Auto 2, Company, People, City 1, City 2 Where Auto 1. Company-OID = Company-OID AND Company. People-OID = People-OID AND People. Age = 52 AND People. Auto-OID = 0 Auto 2. Auto-OID AND Auto 2. Color = “red” AND People. City-OID = City 1. City-OID AND City 1. Name = City 2. Name AND Company. City-OID = City 2. City-OID Auto 1 1) Company-OID ID 6) O y. Cit 2) People-OID People City 2 e am 5) N ) Join ( City 1 Company Age=52 D C 4) I -O y it 3) Auto-OID Auto 2 Color=red Find all red cars own by a 52 year old who is the President 0+1+2+3 of a car manufacturer and lives in the same city of the car manufacturer 4+5+6

Step 2: Construct OO Predicate Graph: Auto 1 Company-OID City 1 Name Relational Predicate Graph: Auto 1 1) Company-OID D 6) I -O y t Ci 2) People-OID People City 2 5 me ) Na (J oin) City 1 Company Age=52 C 4) D OI ity 3) Auto-OID Auto 2 Color=red City 2 Company OID ter) y r t Ci qua d ea (H D OI n) y Cit etow om (H People-OID People Age=52 Auto-OID Auto 2 Color=red

Step 3: Construct the OO Query Predicate 3 OO Predicate Graph: Auto 1 Company-OID City 1 Name City 2 Company OID ter) y r Cit qua d ea (H D OI n) y Cit etow om (H People-OID Predicate 1 People Age=52 Auto-OID Auto 2 Color=red Predicate 2 OO Query: 1 Where Auto. Manufacturer. President. Age = 52 AND 2 Auto. Manufacturer. President. Automobile. Color = red AND Auto. Manufacturer. Headquarter. Name = 3 Auto. Manufacturer. President. Hometown. Name

For Test 2: OO Predicate Graph: Auto 1 Company-OID Company People-OID People Age=56 City-OID City Name=Orlando

Global Query Optimization

Query Optimization (1) CASE 1: A single target query is generated IF the target database system has a query optimizer THEN the query optimizer can be used to optimize the translated query ELSE the translator has to consider the performance issues

Query Optimization (2) CASE 2: A set of target queries is needed. • It might pay to have the minimum number of queries – It minimizes the number of invocations of the target system – It may also reduce the cost of combining the partial results • It might pay for a set to contain target queries that can be well coordinated – The results or intermediate results of the queries processed earlier can be used to reduce the cost of processing the remaining queries

Global Query Optimization (1) Global Query: Select Emp. O. Name, Emp. O. Rank From Emp. O Where Emp. O. Salary > 80, 000 AND Emp. O. Age > 35 part. 3: Select Emp 1. Name, Emp 2. Rank From Emp 1, Emp 2 Where Sum(Emp 1. Salary, n Emp 2. Salary) > 80, 000 AND o i t a Emp 1. Age > 35 AND c i f i Emp 1. SSN = Emp 2. SSN od e u Q Emp 1. Salary 3 Emp 1. Age Emp 1. Salary + 1 Emp 2. Salary Emp 1. Age world(Emp 1) m y r 2 Emp 2. Salary Age = null world(Emp 2) A query obtained by the query modification process may still reference data from more than one database. Some global strategy is needed to process such query

Global Query Optimization (2) Select Emp 1. Name, Emp 2. Rank From Emp 1, Emp 2 /* access two databases Where sum(Emp 1. Salary, Emp 2. Salary) > 80, 000 AND Emp 1. Age > 35 AND Emp 1. SSN = Emp 2. SSN → Some global strategy is needed to process such queries Site 1 Emp 1 Age > 35 form result Emp 1 Less data transmission form result Emp 2 Site 2 Emp 1 form result Emp 2 Site 1 Site 3 Site 2

Without Data Inconsistency • Attribute A value Operator Example: Salary > $75, 000

Without Data Inconsistency • Example: Each employee can have only one salary �� Emp. O. Salary > 100, 000 (Emp. O) Emp 2 Emp 1 Result OID SSN Name Salary Age OID SSN Name Salary Rank 4 6789 Smith 105, 000 40 1 2222 Ahad 98, 000 S. Mgr. 5 4321 Chang 62, 000 30 2 7531 Wang 95, 000 S. Mgr. 6 8642 Patel 75, 000 35 3 6789 Jones 85, 000 Mgr.

With Data Inconsistency • Emp 1 Emp 2 OID SSN Name Salary Age 3 6789 Smith 90, 000 40 4 4321 Chang 62, 000 30 5 8642 Patel 75, 000 35 OID SSN Name Salary Rank 1 2222 Ahad 98, 000 S. Mgr. 2 7531 Wang 95, 000 S. Mgr. 3 6789 Smith 25, 000 Mgr. The correct answer should have the record for Smith. However, the above query returns an empty set Smith does have a combined salary greater than 100, 000

Notations C 1 C = C 1 ⋈o C 2 OID SSN Name Salary Age 3 6789 Smith 90, 000 40 OID SSN Name Salary Age Rank 4 4321 Chang 62, 000 30 1 2222 Ahad 98, 000 null S. Mgr. 5 8642 Patel 75, 000 35 2 7531 Wang 95, 000 mull S. Mgr. 3 6789 Smith Inconsistent 40 Mgr. 4 4321 Chang 62, 000 30 null 5 8642 Patel 75, 000 35 null C 2 OID SSN Name Salary Rank 1 2222 Ahad 98, 000 S. Mgr. 2 7531 Wang 95, 000 S. Mgr. 3 6789 Smith 25, 000 Mgr. C 1 -O: Those tuples of C 1 that have no matching tuples in C 2 (private part) C 1 -C: Those tuples of C 1 that have matching tuples in C 2 (overlap part) C = C 1 ⋈o C 2 = C 1 -O ⋃o C 2 -O ⋃o (C 1 -C ⋈OID C 2 -C)

Data Inconsistency - Optimization Schema integration through outer-join • 3

Distribution of Selections (1) Expensive operation Attribute A is defined by an aggregate function to resolve inconsistency, e. g. , Emp. O = sum(Emp 1. Salary, Emp 2. Salary)

Distribution of Selections (2) Example: 1 2 3

Distribution of Selection (3) Four cases are identified when all arguments of the aggregate function (for resolving conflicts) are non-negative A = f(A 1, A 2), A 1 ≥ 0 and A 2 ≥ 0 Example: sum(Emp 1. Salary, Emp 2. Salary)

Distribution of Selection (3) Four cases were identified when all arguments of the aggregate function (for resolving conflicts) are non-negative A = f(A 1, A 2), A 1 ≥ 0 and A 2 ≥ 0 1. f(A 1, A 2) op a ≡ A 1 op a AND A 2 op a: б. A op a (C 1 -C ⋈ C 2 -C) = б. A op a (C 1 -C) ⋈ б. A op a ( C 2 -C) Example: max(Emp 1 -C. Salary, Emp 2 -C. Salary) < 30 K ≡ Emp 1 -C. Salary < 30 K AND Emp 2 -C. Salary < 30 K An aggregate function This form is useful for optimization

Distribution of Selection (3) Four cases were identified when all arguments of the aggregate function (for resolving conflicts) are non-negative A = f(A 1, A 2), A 1 ≥ 0 and A 2 ≥ 0 2. f(A 1, A 2) op a ≡ f(A 1 op a, A 2 op a) op a: б. A op a(C 1 -C ⋈ C 2 -C) = б. A op a(б. A 1 op a(C 1 -C) ⋈ б. A 2 op a(C 2 -C)) Example: sum(Emp 1 -C. Salary, Emp 2 -C. Salary) < 30 K ≡ sum(Emp 1 -C. Salary < 30 K, Emp 2 -C. Salary < 30 K) < 30 K

Distribution of Selection (4) •

• A op > ≥ ≤ < = ≠ sum(A 1, A 2) 4 4 2 2 3 4 avg(A 1, A 2) 4 4 2 2 3 4 max(A 1, A 2) 4 4 1 1 3 4 min(A 1, A 2) 1 1 4 4 3 4 No improvement possible A: Defined by the aggregate function to resolve conflict

Problem in Global Query Optimization (1) Important information about local entity sets that is needed to determine global query processing plans may not be provided by the local database systems. – Example: cardinalities availability of fast access paths – Techniques: • Sampling queries may be designed to collect statistics about the local databases. • A monitoring system can be used to collect the completion time for subqueries. This can be used to better estimate subsequent subqueries.

Problems in Global Query Optimization (2) • Different query processing algorithms may have been used in different local database systems. → Cooperation across different systems difficult Examples: Semijoin may not be supported on some local systems. • Data transmission between different local database systems may not be fully supported. Examples: – A local database system may not allow update operations – For many nonrelational systems, the instances of one entity set are more likely to be clustered with the instances of other entity sets. Such clustering makes it very expensive to extract data for one entity set. → Need more sophisticated decomposition algorithms.