CS 347 Distributed Databases and Transaction Processing Notes

CS 347: Distributed Databases and Transaction Processing Notes 03: Query Processing Hector Garcia-Molina CS 347 Notes 03 1

Query Processing • Decomposition • Localization • Optimization CS 347 Notes 03 2

Decomposition • Same as in centralized system • Normalization • Eliminating redundancy • Algebraic rewriting CS 347 Notes 03 3

Normalization • Convert from general language to a “standard” form (e. g. , Relational Algebra) CS 347 Notes 03 4

Example Select A, C From R, S Where (R. B=1 and S. D=2) or (R. C>3 and S. D. =2) A, C (R. B=1 v R. C>3) S. D. =2 R CS 347 Conjunctive normal form S Notes 03 5

Also: Detect invalid expressions E. g. : Select * from R where R. A =3 R does not have “A” attribute CS 347 Notes 03 6

Eliminate redundancy E. g. : in conditions: (S. A=1) (S. A>5) False (S. A<10) (S. A<5) S. A<5 CS 347 Notes 03 7

E. g. : Common sub-expressions U S R CS 347 cond U cond T S R cond T R Notes 03 8

Algebraic rewriting E. g. : Push conditions down cond R CS 347 cond 3 S R Notes 03 cond 1 cond 2 S 9

• After decomposition: – One or more algebraic query trees on relations • Localization: – Replace relations by corresponding fragments CS 347 Notes 03 10

Localization steps (1) Start with query (2) Replace relations by fragments (3) Push : up (use CS 245 rules) , : down (4) Simplify – eliminate unnecessary operations CS 347 Notes 03 11

Notation for fragment [R: cond] fragment CS 347 conditions its tuples satisfy Notes 03 12

Example A (1) E=3 R CS 347 Notes 03 13

(2) E=3 [R 1: E < 10] CS 347 [R 2: E 10] Notes 03 14

(3) E=3 [R 1: E < 10] CS 347 E=3 [R 2: E 10] Notes 03 15

(3) E=3 [R 1: E < 10] E=3 [R 2: E 10] CS 347 Notes 03 Ø 16

(4) E=3 [R 1: E < 10] CS 347 Notes 03 17

Rule 1 A B CS 347 C 1[R: c 2] C 1[R: c 1 c 2] [R: False] Ø Notes 03 18

In example A: E=3[R 2: E 10] E=3 [R 2: E=3 E 10] E=3 [ R 2: False] Ø CS 347 Notes 03 19

Example B (1) A R CS 347 A=common attribute S Notes 03 20

(2) A [R 1: A<5] [R 2: 5 A 10] [R 3: A>10] [S 1: A<5] [S 2: A 5] CS 347 Notes 03 21

(3) A A A [R 1: A<5][S 1: A<5] [R 1: A<5][S 2: A 5] [R 2: 5 A 10][S 1: A<5] A A A [R 2: 5 A 10][S 2: A 5] [R 3: A>10][S 1: A<5] [R 3: A>10][S 2: A 5] CS 347 Notes 03 22

(3) A A A [R 1: A<5][S 1: A<5] [R 1: A<5][S 2: A 5] [R 2: 5 A 10][S 1: A<5] A A A [R 2: 5 A 10][S 2: A 5] [R 3: A>10][S 1: A<5] [R 3: A>10][S 2: A 5] CS 347 Notes 03 23

(4) A A A [R 1: A<5][S 1: A<5] [R 2: 5 A 10][S 2: A 5] [R 3: A>10][S 2: A 5] CS 347 Notes 03 24

Rule 2 [R: C 1] [R CS 347 A A [S: C 2] S: C 1 C 2 R. A = S. A] Notes 03 25

In step 4 of Example B: [R 1: A<5] [R 1 CS 347 A A A [S 2: A 5] S 2: R 1. A < 5 S 2. A 5 R 1. A = S 2. A ] S 2: False] Ø Notes 03 26

Localization with derived fragmentation Example C (2) K R 1: R 2: A<10 A 10 CS 347 S 1: K=R. K S 2: K=R. K R. A<10 R. A 10 Notes 03 27

(3) K [R 1][S 1] CS 347 K [R 1][S 2] K [R 2][S 1] Notes 03 K [R 2][S 2] 28

(4) K K [R 1: A<10] S 1: K=R. K R. A<10 CS 347 [R 2: A 10] S 2: K=R. K R. A 10 Notes 03 29

In step 4 of Example C: [R 1: A<10] [R 1 CS 347 K K K [S 2: K=R. K R. A 10] S 2: R 1. A<10 S 2. K=R. K R. A 10 R 1. K= S 2. K] S 2: False ] (K is key of R, R 1) Ø Notes 03 30

(4) K K [R 1: A<10] S 1: K=R. K R. A<10 [R 2: A 10] S 2: K=R. K R. A 10 (4) simplified more: K K R 1 CS 347 R 2 S 1 Notes 03 S 2 31

• Localization with vertical fragmentation Example D (1) A R 1(K, A, B) R CS 347 R 2(K, C, D) Notes 03 32

(2) A K R 1 (K, A, B) CS 347 R 2 (K, C, D) Notes 03 33

A (3) K K, A R 1 (K, A, B) CS 347 K, A not really needed R 2 (K, C, D) Notes 03 34

(4) A R 1 (K, A, B) CS 347 Notes 03 35

Rule 3 • Given vertical fragmentation of R: Ri = Ai (R), Ai A • Then for any B A: B (R) = B [ CS 347 i R i | B Ai Ø ] Notes 03 36

• Localization with hybrid fragmentation Example E k<5 [ k, A R] R 2 = k 5 [ k, A R] R 1 = R 3 = k, B R CS 347 Notes 03 37

Query: A k=3 R CS 347 Notes 03 38

Reduced Query: A k=3 R 1 CS 347 Notes 03 39

Summary - Query Processing • Decomposition • Localization • Optimization – Overview – Tricks for joins + other operations – Strategies for optimization CS 347 Notes 03 40

Optimization Process: Generate query P 1 plans Estimate size of intermediate results Estimate cost of C 1 plan ($, time, …) P 2 P 3 Pn C 2 C 3 Cn pick minimum CS 347 Notes 03 41

Differences with centralized optimization: • New strategies for some operations (semi-join, range-partitioning, sort, …) • Many ways to assign and schedule processors CS 347 Notes 03 42

Parallel/distributed sort Input: CS 347 (a) relation R on single site/disk (b) R fragmented/partitioned by sort attribute (c) R fragmented/partitioned by other attribute Notes 03 43

Output (a) sorted R on single site/disk (b) fragments/partitions sorted 5. . . 6. . . 10 F 1 CS 347 12. . . 15 19. . . 20 21 50 F 2 F 3 Notes 03 44

Basic sort • R(K, …), sort on K • Fragmented on K Vector: ko, k 1, … kn 7 3 CS 347 ko 10 11 17 14 Notes 03 k 1 20 27 22 45

• Algorithm: each fragment sorted independently • If necessary, ship results CS 347 Notes 03 46

Same idea on different architectures: Shared nothing: P 1 M M sorts F 1 sorts F 2 F 1 Shared memory: Net P 1 F 1 CS 347 P 2 Notes 03 F 2 P 2 M F 2 47

Range partitioning sort • R(K, …. ), sort on K • R located at one or more site/disk, not fragmented on K CS 347 Notes 03 48

• Algorithm: (a) Range partition on K (b) Basic sort Ra Rb R 1 Local sort R’ 1 ko R 2 Local sort R’ 3 Result k 1 R 3 CS 347 Notes 03 49

• Selecting a good partition vector CS 347 7. . . 52 11 14 31. . . 8 15 11 32 17 Ra Rb Notes 03 10. . . 12 4 Rc 50

Example • Each site sends to coordinator: – Min sort key – Max sort key – Number of tuples • Coordinator computes vector and distributes to sites (also decides # of sites for local sorts) CS 347 Notes 03 51

• Sample scenario: Coordinator receives: SA: Min=5 Max=10 # = 10 tuples SB: Min=7 Max=17 # = 10 tuples CS 347 Notes 03 52

• Sample scenario: Coordinator receives: SA: Min=5 Max=10 # = 10 tuples SB: Min=7 Max=17 # = 10 tuples Expected tuples: 5 10 ko? CS 347 2 1 15 20 [assuming we want to sort at 2 sites] Notes 03 53

Expected tuples: 5 10 ko? CS 347 2 1 15 20 [assuming we want to sort at 2 sites] Notes 03 54

Expected tuples: 5 2 1 10 ko? Expected tuples with key < ko 15 20 [assuming we want to sort at 2 sites] = Total tuples 2 2(ko - 5) + (ko - 7) = 10 3 ko = 10 + 7 = 27 ko = 9 CS 347 Notes 03 55

Variations • Send more info to coordinator – Partition vector for local site Eg. Sa: 3 3 3 # tuples 5 6 8 10 local vector - Histogram 5 CS 347 6 7 Notes 03 8 9 10 56

More than one round E. g. : (1) Sites send range and # tuples (2) Coordinator returns “preliminary” vector Vo (3) Sites tell coordinator how many tuples in each Vo range (4) Coordinator computes final vector Vf CS 347 Notes 03 57

Can you come up with a distributed algorithm? (no coordinator) CS 347 Notes 03 58

Parallel external sort-merge • Same as range-partition sort, except sort first Ra Rb Local sort Ra’ Rb’ In order R 1 ko R 2 Result k 1 R 3 Merge CS 347 Notes 03 59

Parallel external sort-merge • Same as range-partition sort, except sort first Ra Rb Local sort Note: can use merging network if available (e. g. , Teradata) Ra’ Rb’ In order R 1 ko R 2 Result k 1 R 3 Merge CS 347 Notes 03 60

• Parallel/distributed Join Input: Output: CS 347 Relations R, S May or may not be partitioned R S Result at one or more sites Notes 03 61

Partitioned Join (Equi-join) Local join Ra R 1 Sa Rb R 2 Sb R 3 Sc f(A) CS 347 Result Notes 03 f(A) 62

Notes: • Same partition function f is used for both R and S (applied to join attribute) • f can be range or hash partitioning • Local join can be of any type (use any CS 245 optimization) • Various scheduling options e. g. , (a) partition R; partition S; join (b) partition R; build local hash table for R; partition S and join CS 347 Notes 03 63

More notes: • We already know why part-join works: R 1 R 2 R 3 S 1 S 2 S 3 R 1 S 1 R 2 S 2 R 3 S 3 • Useful to give this type of join a name, because we may want to partition data to make partition-join possible (especially in parallel DB system) CS 347 Notes 03 64

Even more notes: • Selecting good partition function f very important: – Number of fragments – Hash function – Partition vector CS 347 Notes 03 65

• Good partition vector – Goal: | Ri |+| Si | the same – Can use coordinator to select CS 347 Notes 03 66

Asymmetric fragment + replicate join Local join Ra R 1 S Sa Rb R 2 S Sb R 3 S f partition CS 347 Result Notes 03 union 67

Notes: • Can use any partition function f for R (even round robin) • Can do any join — not just equi-join e. g. : R S R. A < S. B CS 347 Notes 03 68

General fragment and replicate join Ra R 1 Rb R 2 R 3 f 1 partition -> 3 fragments CS 347 n copies of each fragment Notes 03 69

R 1 S 1 R 1 S 2 R 2 S 1 R 2 S 2 R 3 S 1 R 3 S 2 All nxm pairings of R, S fragments S is partitioned in similar fashion Result CS 347 Notes 03 70

Notes: • Asymmetric F+R join is special case of general F+R • Asymmetric F+R may be good if S small • Works for non-equi-joins CS 347 Notes 03 71

• Semi-join • Goal: reduce communication traffic • R A S (R A S) A S or R (R CS 347 A A (S S) Notes 03 R) or A A (S A R) 72

Example: R R CS 347 A B 2 10 25 30 a b c d S S Notes 03 A C 3 10 15 25 32 x y z w x 73

Example: R R A B 2 10 25 30 a b c d S S A R = [2, 10, 25, 30] CS 347 Notes 03 A C 3 10 15 25 32 x y z w x 74

Example: R R Ans: R S A B 2 10 25 30 a b c d S S A R = [2, 10, 25, 30] S CS 347 A C 3 10 15 25 32 x y z w x A C 10 y R= 25 w Notes 03 75

R A B 2 10 25 30 a b c d A S 3 10 15 25 32 Computing transmitted data in • with semi-join R (S R): T = 4 |A| +2 |A+C| + result • with join R S: T = 4 |A+B| + result CS 347 Notes 03 C x y z w x example: 76

R A B 2 10 25 30 a b c d A S 3 10 15 25 32 Computing transmitted data in • with semi-join R (S R): T = 4 |A| +2 |A+C| + result • with join R S: T = 4 |A+B| + result CS 347 Notes 03 C x y z w x example: better if say |B| is large 77

In general: • Say R is smaller relation • (R A S) A S better than R size ( A S) + size (R CS 347 Notes 03 A A S if S) < size (R) 78

• Similar comparisons for other semi-joins • Remember: only taking into account transmission cost CS 347 Notes 03 79

• Trick: Encode A S (or A R ) as a bit vector key in S 001101000010100 <----one bit/possible key-------> CS 347 Notes 03 80

Three way joins with semi-joins Goal: R CS 347 S T Notes 03 81

Three way joins with semi-joins Goal: R S T Option 1: R’ S’ T where R’ = R S; S’ = S CS 347 Notes 03 T 82

Three way joins with semi-joins Goal: R S T Option 1: R’ S’ T where R’ = R S; S’ = S Option 2: R’’ S’ where R’’ = R CS 347 T S’; S’ = S Notes 03 T T 83

Many options! Number of semi-join options is exponential in # of relations in join CS 347 Notes 03 84

Privacy Preserving Join • • Site 1 has R(A, B) R Site 2 has S(A, C) site 1 Want to compute R S Site 1 should NOT discover any S info not in the join • Site 2 should NOT discover any R info not in the join CS 347 Notes 03 S site 2 85

Semi-Join Does Not Work • If Site 1 sends A R to Site 2, site 2 leans all keys of R! R A a 1 a 2 a 3 a 4 B b 1 b 2 b 3 b 4 A R = (a 1, a 2, a 3, a 4) site 1 CS 347 S A a 1 a 3 a 5 a 7 C c 1 c 2 c 3 c 4 site 2 Notes 03 86

Fix: Send hashed keys • Site 1 hashes each value of A before sending • Site 2 hashes (same function) its own A values to see what tuples match R A a 1 a 2 a 3 a 4 B b 1 b 2 b 3 b 4 site 1 CS 347 A R = (h(a 1), h(a 2), h(a 3), h(a 4)) Site 2 sees it has h(a 1), h(a 3) (a 1, c 1), (a 3, c 3) Notes 03 S A a 1 a 3 a 5 a 7 C c 1 c 2 c 3 c 4 site 2 87

What is problem? R A a 1 a 2 a 3 a 4 B b 1 b 2 b 3 b 4 site 1 CS 347 A R = (h(a 1), h(a 2), h(a 3), h(a 4)) Site 2 sees it has h(a 1), h(a 3) (a 1, c 1), (a 3, c 3) Notes 03 S A a 1 a 3 a 5 a 7 C c 1 c 2 c 3 c 4 site 2 88

What is problem? R A a 1 a 2 a 3 a 4 B b 1 b 2 b 3 b 4 site 1 A R = (h(a 1), h(a 2), h(a 3), h(a 4)) Site 2 sees it has h(a 1), h(a 3) (a 1, c 1), (a 3, c 3) S A a 1 a 3 a 5 a 7 C c 1 c 2 c 3 c 4 site 2 • Dictionary attack! Site 2 takes all keys, a 1, a 2, a 3. . . and checks if h(a 1), h(a 2), h(a 3) matches what Site 1 sent. . . CS 347 Notes 03 89

Adversary Model • Honest but Curious – dictionary attack is possible (cheating is internal and can’t be caught) – sending incorrect keys not possible (cheater could be caught) CS 347 Notes 03 90

One Solution (Agrawal et al) • Use commutative encryption function – Ei(x) = x encryption using site i private key – E 1( E 2 (x)) = E 2( E 1 (X)) – Shorthand for example: E 1(x) is x E 2(x) is x E 1(E 2(x)) is x CS 347 Notes 03 91

Solution: R A a 1 a 2 a 3 a 4 B b 1 b 2 b 3 b 4 (a 1, a 2, a 3, a 4) (a 1, a 3, a 5, a 7) site 1 S A a 1 a 3 a 5 a 7 C c 1 c 2 c 3 c 4 site 2 computes (a 1, a 3, a 5, a 7), intersects with (a 1, a 2, a 3, a 4) (a 1, b 1), (a 3, b 3) CS 347 Notes 03 92

Why does this solution work? CS 347 Notes 03 93

Other Privacy Preserving Operations? • Inequality join R S R. A > S. A • Similarity Join CS 347 R S sim(R. A, S. A)<e Notes 03 94

Other parallel operations • Duplicate elimination – Sort first (in parallel) then eliminate duplicates in result – Partition tuples (range or hash) and eliminate locally • Aggregates – Partition by grouping attributes; compute aggregate locally CS 347 Notes 03 95

Example: Ra Rb • sum (sal) group by dept CS 347 Notes 03 96

Example: Ra Rb • sum (sal) group by dept CS 347 Notes 03 97

Example: sum Ra Rb sum • sum (sal) group by dept CS 347 Notes 03 98

Example: less data! Ra Rb • sum (sal) group by dept CS 347 Notes 03 99

Example: less data! sum Ra Rb sum • sum (sal) group by dept CS 347 Notes 03 100

Example: less data! sum Ra Rb sum • sum (sal) group by dept CS 347 Notes 03 101

Preview: Map Reduce data A 1 data B 1 data C 1 data B 2 data C 2 data A 3 CS 347 Notes 03 102

Enhancements for aggregates • Perform aggregate during partition to reduce data transmitted • Does not work for all aggregate functions… Which ones? CS 347 Notes 03 103

Selection • Range or hash partition • Straightforward But what about indexes? CS 347 Notes 03 104

Indexing • Can think of partition vector as root of distributed index: k 1 Local indexes ko Site 1 CS 347 Site 2 Notes 03 Site 3 105

• Index on non-partition attribute ko k 1 Index sites Tuple sites CS 347 Notes 03 106

Notes: • If index is not too big, it may be better to keep whole and make copies. . . • If updates are frequent, can partition update work. . . (Question: how do we handle split of B-Tree pages? ) CS 347 Notes 03 107

• Extensible or linear hashing R 1 f R 2 R 3 R 4 <- add CS 347 Notes 03 108

• How do we adapt schemes? • Where do we store directory, set of participants. . . ? • Which one is better for a distributed environment? • Can we design a hashing scheme with no global knowledge (P 2 P)? CS 347 Notes 03 109

Summary: Query processing • Decomposition and Localization • Optimization – Overview – Tricks for joins, sort, . . – Tricks for inter-operations parallelism – Strategies for optimization CS 347 Notes 03 110

Inter-operation parallelism • Pipelined • Independent CS 347 Notes 03 111

Pipelined parallelism Site 2 c Site 1 R S Join Site 1 R CS 347 Tuples matching result Probe S c Notes 03 112

Independent parallelism Site 2 Site 1 R S T V (1) temp 1 R S; temp 2 T (2) result temp 1 temp 2 CS 347 Notes 03 V 113

• Pipelining cannot be used in all cases e. g. : Hash Join CS 347 Stream of R tuples S tuples Notes 03 114

Summary As we consider query plans for optimization, we must consider various tricks: - for individual operations - for scheduling operations CS 347 Notes 03 115