Vertical Fragmentation Has been studied within the centralized

Vertical Fragmentation Has been studied within the centralized context à design methodology à physical clustering More difficult than horizontal, because more alternatives exist. Two approaches : à grouping attributes to fragments à splitting relation Distributed DBMS to fragments © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 1

Vertical Fragmentation Overlapping fragments à grouping Non-overlapping fragments à splitting We do not consider the replicated key attributes to be overlapping. Advantage: Easier to enforce functional dependencies (for integrity checking etc. ) Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 2

VF – Information Requirements Application Information à Attribute affinities a measure that indicates how closely related the attributes are This is obtained from more primitive usage data à Attribute usage values Given a set of queries Q = {q 1, q 2, …, qq} that will run on the relation R[A 1, A 2, …, An], use(qi, Aj) = 1 if attribute Aj is referenced by query qi 0 otherwise use(qi, • ) can be defined accordingly Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 3

VF – Definition of use(qi, Aj) Consider the following 4 queries for relation PROJ q 1 : q 3 : SELECT FROM WHERE BUDGET PROJ PNO=Value PNAME PROJ LOC=Value q 2: SELECT FROM PNAME, BUDGET PROJ q 4: SELECT FROM WHERE SUM(BUDGET) PROJ LOC=Value Let A 1= PNO, A 2= PNAME, A 3= BUDGET, A 4= LOC Distributed DBMS A 1 A 2 A 3 A 4 q 1 1 0 q 2 0 1 1 0 q 3 0 1 q 4 0 0 1 1 © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 4

VF – Affinity Measure aff(Ai, Aj) The attribute affinity measure between two attributes Ai and Aj of a relation R[A 1, A 2, …, An] with respect to the set of applications Q = (q 1, q 2, …, qq) is defined as follows : aff (Ai, Aj) query access Distributed DBMS all queries that access Ai and Aj all sites (query access) access frequency of a query © 1998 M. Tamer Özsu & Patrick Valduriez access execution Page 5. 5

VF – Calculation of aff(Ai, Aj) Assume each query in the previous example accesses the attributes once during each execution. Also assume the access frequencies Then aff(A 1, A 3) = 15*1 + 20*1+10*1 = 45 and the attribute affinity matrix Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez S 1 S 2 S 3 q 1 15 20 10 q 2 5 0 0 q 3 25 25 25 q 4 3 0 0 A 1 AAA 1 45 is A 2 0 A 3 45 A 4 0 A 2 A 3 A 4 0 45 0 80 5 75 5 53 3 75 3 78 Page 5. 6

VF – Clustering Algorithm Take the attribute affinity matrix AA and reorganize the attribute orders to form clusters where the attributes in each cluster demonstrate high affinity to one another. Bond Energy Algorithm (BEA) has been used for clustering of entities. BEA finds an ordering of entities (in our case attributes) such that the global affinity measure AM (affinity of A and A with their neighbors) i i j j is maximized. Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 7

Bond Energy Algorithm Input: The AA matrix Output: The clustered affinity matrix CA which is a perturbation of AA Initialization: Place and fix one of the columns of AA in CA. Iteration: Place the remaining n-i columns in the remaining i+1 positions in the CA matrix. For each column, choose the placement that makes the most contribution to the global affinity measure. Row order: Order the rows according to the column ordering. Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 8

Bond Energy Algorithm “Best” placement? Define contribution of a placement: cont(Ai, Ak, Aj) = 2 bond(Ai, Ak)+2 bond(Ak, Al) – 2 bond(Ai, Aj) where n bond(Ax, Ay) = aff(A , A ) z x z y z 1 Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 9

BEA – Example Consider the following AA matrix and the corresponding CA matrix where A 1 and A 2 have been placed. Place A 3: Ordering (0 -3 -1) : cont(A 0, A 3, A 1) = 2 bond(A 0 , A 3)+2 bond(A 3 , A 1)– 2 bond(A 0 , A 1) = 2* 0 + 2* 4410 – 2*0 = 8820 Ordering (1 -3 -2) : cont(A 1, A 3, A 2) = 2 bond(A 1 , A 3)+2 bond(A 3 , A 2)– 2 bond(A 1, A 2) = 2* 4410 + 2* 890 – 2*225 = 10150 Ordering (2 -3 -4) : cont (A 2, A 3, A 4) Distributed DBMS = 1780 © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 10

BEA – Example Therefore, the CA matrix has to form A 1 A 3

BEA – Example When A 4 is placed, the final form of the CA matrix (after row organization) is A 1 A 3 A 2 A 4 Distributed DBMS A 1 45 45 0 0 A 3 45 53 5 3 A 2 0 5 80 75 A 4 0 3 75 78 © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 12

VF – Algorithm How can you divide a set of clustered attributes {A 1, A 2, …, An} into two (or more) sets {A 1, A 2, …, Ai} and {Ai, …, An} such that there are no (or minimal) applications that access both (or more than one) of the sets. A 1 A 2 A 3 … Ai Ai+1. . . Am . . . A 1 A 2 TA Ai . . . Ai+1 BA Am Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 13

VF – ALgorithm Define TQ = set of applications that access only TA BQ = set of applications that access only BA OQ = set of applications that access both TA and BA and CTQ = total number of accesses to attributes by applications that access only TA CBQ = total number of accesses to attributes by applications that access only BA COQ = total number of accesses to attributes by applications that access both TA and BA Then find the point along the diagonal that maximizes CTQ CBQ COQ 2 Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 14

VF – Algorithm Two problems : Cluster forming in the middle of the CA matrix à Shift a row up and a column left and apply the algorithm to find the “best” partitioning point à Do this for all possible shifts à Cost O(m 2) More than two clusters à m-way partitioning à try 1, 2, …, m– 1 split points along diagonal and try to find the best point for each of these à Cost O(2 m) Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 15

VF – Correctness A relation R, defined over attribute set A and key K, generates the vertical partitioning FR = {R 1, R 2, …, Rr}. Completeness à The following should be true for A: A = A Ri Reconstruction à Reconstruction can be achieved by R = K Ri Ri FR Disjointness à TID's are not considered to be overlapping since they are maintained by the system à Duplicated keys are not considered to be overlapping Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 16

Hybrid Fragmentation R HF HF R 1 R 2 Distributed DBMS VF VF VF

Fragment Allocation Problem Statement Given F = {F 1, F 2, …, Fn} S ={S 1, S 2, …, Sm} Q = {q 1, q 2, …, qq} fragments network sites applications Find the "optimal" distribution of F to S. Optimality à Minimal cost Communication + storage + processing (read & update) Cost in terms of time (usually) à Performance Response time and/or throughput à Constraints Distributed DBMS Per site constraints (storage & processing) © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 18

Information Requirements Database information à selectivity of fragments à size of a fragment Application information à access types and numbers à access localities Communication network information à unit cost of storing data at a site à unit cost of processing at a site Computer system information à bandwidth à latency à communication overhead Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 19

Allocation File Allocation (FAP) vs Database Allocation (DAP): à Fragments are not individual files relationships have to be maintained à Access to databases is more complicated remote file access model not applicable relationship between allocation and query processing à Cost of integrity enforcement should be considered à Cost of concurrency control should be considered Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 20

Allocation – Information Requirements Database Information à selectivity of fragments à size of a fragment Application Information à à à number of read accesses of a query to a fragment number of update accesses of query to a fragment A matrix indicating which queries updates which fragments A similar matrix for retrievals originating site of each query Site Information à unit cost of storing data at a site à unit cost of processing at a site Network Information à communication cost/frame between two sites à frame size Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 21

Allocation Model General Form min(Total Cost) subject to response time constraint storage constraint processing constraint Decision Variable xij Distributed DBMS 1 0 if fragment Fi is stored at site Sj otherwise © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 22

Allocation Model Total Cost all queries query processing cost all sites all fragments cost of storing a fragment at a site Storage Cost (of fragment Fj at Sk) (unit storage cost at Sk) (size of Fj) xjk Query Processing Cost (for one query) processing component + transmission component Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 23

Allocation Model Query Processing Cost Processing component access cost + integrity enforcement cost + concurrency control cost à Access cost all sites all fragments (no. of update accesses+ no. of read accesses) xij local processing cost at a site à Integrity enforcement and concurrency control costs Can Distributed DBMS be similarly calculated © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 24

Allocation Model Query Processing Cost Transmission component cost of processing updates + cost of processing retrievals à Cost of updates all sites all fragments all sites update message cost all fragments acknowledgment cost à Retrieval Cost all fragments min all sites (cost of retrieval command cost of sending back the result) Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 25

Allocation Model Constraints à Response Time execution time of query ≤ max. allowable response time for that query à Storage Constraint (for a site) all fragments storage requirement of a fragment at that site storage capacity at that site à Processing constraint (for a site) all queries processing load of a query at that site processing capacity of that site Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 26

Allocation Model Solution Methods à FAP is NP-complete à DAP also NP-complete Heuristics based on à single commodity warehouse location (for FAP) à knapsack problem à branch and bound techniques à network flow Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 27

Allocation Model Attempts to reduce the solution space à assume all candidate partitionings known; select the “best” partitioning à ignore replication at first à sliding window on fragments Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 28