Performance Guarantees for Distributed Reachability Queries Wenfei Fan
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1, 2 Xin Wang 1 1 University 2 Harbin Yinghui Wu 1, 3 of Edinburgh Institute of Technology 3 University of California, Santa Barbara 1
outline ü Querying distributed real-life graphs • Real-life graphs are often fragmented/distributed • Distributed reachability queries Partial Evaluation • Distributed bounded reachability queries • Distributed regular reachability queries ü Distributed reachability with Map. Reduce ü Experimental study ü Conclusion Distributed query evaluation with performance guarantees Yinghui Wu VLDB 2012 2
Distributed Real-life Graphs ü Real life graphs are distributed • Geo-distributed, e. g. , data centers • Decentralization, e. g. , social networks • Distributed entity and personal information Real-life graphs are Yinghui purposely or naturally distributed Wu VLDB 2012 3
Distributed querying methods ü Federated/centralized graph database • collect and link graph fragments • query the centralized graph Q fragments . . . centralized querying Q(G) construction and maintenance cost Yinghui Wu VLDB 2012 4
Distributed querying methods ü Graph exploration strategy • Master node and slave node • Predefined graph partition and query execution plan Q intermediate results . . . query plan master node slave node Q(G) no bounds on visit numbers and data shipment Yinghui Wu VLDB 2012 5
Querying a distributed social network Mat, "HR" Fred, "HR" Ann, "CTO" DC 1 Emmy, "HR" Walt, "HR" Ben, "MK" DC 2 Dan, "DB" (DB*∪HR*) Jack, "MK" Bill, "DB" Mark, "FA" Ross, "HR" Mark, "FA" Q Pat, "SE" centralized Graph exploration? method? DC 3 Tom, "AI" Wu VLDB 2012 Using partial evaluation Yinghui to obtain performance guarantees 6
Partial evaluation ü Partial evaluation (a. k. a program specialization) • given a function f(s, d) and a part of input e. g. , s, specializes f(s, d) w. r. t s • only conducts the part of f’s computation that depends on s • generates a residual function f’ s f (s, d) f’ (d) for graph queries? Fi Q (Fi, G) Q’ (G) Partial evaluation: generating partial answer Yinghui Wu VLDB 2012 7
Distributed graphs and graph queries ü Distributed graph an in-node of F 1 a virtual node of F 1 • graph fragmentation Ann, "CTO" Fred, "HR" F = (F, Gf) Mat, "HR " Ann, "CTO" F 1 • fragment graph Gf Ann, "CTO" Emmy, "HR" Walt, "HR" a cross edge ü Reachability query fragment (DB*∪HR*) 5 • reachability query Qr(s, t) F 2 Jack, "MK" Bill, "DB" • bounded reachability query Qbr(s, t, l) • regular reachability Ross, "HR" Mark, "FA" Qr(Ann, Mark) Qbr(Ann, Mark, 5) (path) query Qrr(s, t, R) Gf § R: : = ε| a | RR | R∪R | R* Yinghui Wu VLDB 2012 Pat, "SE" Mark, "FA" Qrr(Ann, Mark, (DB*∪HR*)) Tom, "AI" F 3 8
Distributed graph querying framework Q coordinating site Sc and a set of graph fragments F 1, …, Fn distributing at Sc: Q(Fi) fragments Q(G) QQ post Q to fragments local evaluation: partially evaluate Q Q(Fi) coordinator Sc . . . Q(Fi) Assembling at Sc Applying partial evaluation to graph querying Yinghui Wu VLDB 2012 9
Distributed reachability queries ü Performance guarantees: Over a fragmentation F = (F, Gf) of a graph G, reachability queries can be evaluated (a) in O(|Vf||Fm|) time, (b) by visiting each site only once, and (c) with the total network traffic bounded by O(|Vf|2), where Gf = (Vf , Ef) and Fm is the largest fragment in F. ü A distributed reachability evaluation algorithm Dis. Reach • Coordinator Sc posts qr(s, t) to each fragment site in F • Each site locally evaluates qr(s, t) in parallel, and produces partial answer as a set of Boolean equations • Sc collects and assembles the partial results Yinghui Wu VLDB 2012 10
Distributed reachability: partial evaluation ü Local evaluate each qr(v, t) on Fi in parallel: qr(v, t) = Xv 1’ or … or Xvn’ v ü for each in-node v’ in Fi, qr(v, v’) decides if v’ reaches t; introduce a Boolean variable to each v’ Xv’ = qr(v’, t) t v’ ü Partial answer to qr(v, t): a set of Boolean formula, disjunction of variables of v’ to which v can reach t Partial evaluation by introducing Boolean variables Yinghui Wu VLDB 2012 11
Distributed reachability: assembling ü Collect the Boolean equation Xs = Xv set at coordinator Sc O(|Vf|) ü solve a Boolean equation Xv = Xv’’ or Xv’ system over a dependency graph Xv’ = Xt ü qr(s, t) is true iff Xs = true at Sc Xv’’ = false Xt = 1 Partial evaluation by introducing Boolean variables Yinghui Wu VLDB 2012 12
Distributed reachability queries: example 1. 2. Q Dispatch Q to fragments (at Sc) Partial evaluation: generating Boolean equations (at Fi) Fred, "HR" Emmy, "HR" Walt, "HR" F 2 F 1 Ann Q Q Sc 3. Assembling: solving equation system (at Sc) Mat, "HR" Bill, "DB" Jack, "MK" Mark Ross, "HR" F 3 Pat, "SE" Yinghui Wu VLDB 2012 Tom, "AI" 13
Distributed bounded reachability queries 1. 2. 3. Q Dispatch Q to fragments (at Sc) Fred, "HR" Partial evaluation: generating 1 A weighted dependency F graph equations (at Fi) Assembling: solving equation system (at Sc) Mat, "HR" Emmy, "HR" Walt, "HR" F 2 Ann Q Q Sc Bill, "DB" Jack, "MK" Mark Ross, "HR" F 3 Variables denoting numeric values Pat, "SE" Yinghui Wu VLDB 2012 Tom, "AI" 15
Distributed bounded reachability queries ü Performance guarantees: bounded reachability queries can be evaluated with the same performance guarantees as for reachability queries. number of visits total network traffic computational cost Local evaluation • each site is visited once • O(|F|) • O(|Vf|2) • independent of |G| • O(|Vf||Fm|) • parallel computation • Any strategy (i. e. , indexes, graph partitions) can be applied on fragment Performance guarantees for distributed bounded reachability Yinghui Wu VLDB 2012 16
Distributed regular reachability queries ü Performance guarantees: Over a fragmentation F = (F, Gf) of a graph G, regular reachability queries qrr(s, t, R) can be evaluated (a) in O((|Vf|2+|Fm|)|R|2 ) time, (b) by visiting each site only once, and (c) with the total network traffic bounded by O(R|2 |Vf|2), where Gf = (Vf , Ef) and Fm is the largest fragment in F. ü Query automaton Gq(R) of R: <Vq, Eq, Lq, us, ut> Automaton representation for queries Yinghui Wu VLDB 2012 17
Query automaton ü A node v is a match of state uv in Gq(R) iff (1) they have the same label, and (2) there is a path ρ from v to t and a path ρ’ from uv to ut , s. t. ρ and ρ’ induce the same label ü Given a graph G, qrr(s, t, R) over G is true if and only if s is a match of us in Gq(R) Ann Mat, "HR" Fred, "HR" Walt, "HR" DB HR Emmy, "HR" Mark, "FA" FA Ross, "HR" 18 Yinghui Wu VLDB 2012
Distributed regular query evaluation: algorithm Query distribution Partial evaluation assembling • construct query automaton Gq(R) • post Gq(R) to each graph fragment • compute partial answers as a set of Boolean formula vectors • collect partial answers and construct dependency graph • assembling final answer Yinghui Wu VLDB 2012 19
Distributed regular query evaluation: partial evaluation ü ü ü For each node v in Fi, assign v. rvec: a vector of O(|Vq|) Boolean formulas, each entry v. rvec[u] denotes if v matches u introduce a Boolean variable X(v’, w) to each virtual node v’ of Fi and a state w in Vq, denoting if v’ matches w Partial answer to qrr(s, t): a set of Boolean formula from each innodes of Fi f 21 f 22 f 11 f 12 v 1 v 2 f 1 v’ f 1 k … f 2 v’ … fkv’ v’ t X(v’, w) vq … f 2 k … wq t qrr Partial evaluation by introducing Boolean variables Yinghui Wu VLDB 2012 20
Distributed regular query evaluation: assembling ü ü Collects partial results as set of Boolean formulas f 11 f 12 Constructs a dependency graph: a node vd for each in-node and each entry of its formula vector, labeled with Boolean formula and an edge for dependencies f 1 k … v 1 v 2 vd(s, us) vd(v 2, vq) vd(v 1, vq) v’ t vd(v’, w) vq ü Checks the reachability of vd(s, us) can reach vd(t, ut) in the dependency graph … wq vd(t, ut)=true t qrr Partial evaluation by introducing Boolean variables Yinghui Wu VLDB 2012 21
Distributed Regular Reachability Evaluation: Example 1. 2. Q Dispatch Q to Fred, "HR" Test reachability in dependency graph fragments (at Sc) Partial evaluation: generating a set of Boolean equations (at Fi) Mat, "HR" Emmy, "HR" Walt, "HR" F 2 F 1 Q Q Sc Jack, "MK" Bill, "DB" 3. Assembling: solving equation system (at Sc) Ross, "HR" F 3 Pat, "SE" vector of Boolean formulas Tom, "AI" distributed regular. Yinghui reachability Wu VLDB 2012 query evaluation 22
Distributed Reachability with Map. Reduce coordinator 1. 2. 3. generates query automata Gq; partition graph G to K fragments (as a key/value pair) (i, <Fi, Gq> ) Map function: local evaluation upon (i, <Fi, Gq>) and generates <1, rvset> Reduce function: assembles collected partial results and writes <0, ans> to distributed file system. O(Fm) + |R|2|Vf |2) O(Fm) F 1 1, < k, < > q G , Fk , … G> q … mapper 1 2|Vf |2) mapper m O(|R| 1, r vse t mapper k 1 1 reducer se v r , tk Processing path <0, ans> Partial evaluation properly fits in 2012 Map. Reduce framework Yinghui Wu VLDB 24
Experimental Evaluation ü Experimental setting • Real-life datasets • Synthetic data: larger random graphs following densification law • Algorithms: § dis. Reach, dis. Reachn and dis. Reachm § dis. Dist and dis. Distn § dis. RPQ, dis. RPQn and dis. RPQd § MRd. RPQ Yinghui Wu VLDB 2012 25
Distributed reachability ü Efficiency and scalability 20% and 6% 9% of dis. Reachn three thousand visits over 4 fragments dis. Reach outperforms centralized and message-passing approaches 26
Distributed regular reachability ü Efficiency and network traffic Time: 60% of dis. RPQn Traffic: at most 25% and 3% dis. RPQ takes much less time and communication cost Yinghui Wu VLDB 2012 27
Distributed regular reachability (cont. ) ü Scalability Scales well with the number of fragments; takes less time over more fragments dis. RPQ scales well over the number of fragments Yinghui Wu VLDB 2012 28
Performance of Map. Reduce implementation ü Efficiency and Scalability scales well with the size of fragments Takes less time with more mappers Takes more time over more complex queries Partial evaluation works well in 2012 Map. Reduce model Yinghui Wu VLDB 29
Conclusion ü Distributed reachability querying • Partial evaluation based distributed evaluation • Reachability, bounded reachability and regular reachability queries • Performance guarantees ü Partial evaluation can be naturally conducted as Map. Reduce ü Future work • Distributed evaluation for other queries, e. g. , graph pattern matching using simulation • Combining partial evaluation and incremental computation Partial evaluation based distributed query evaluation Yinghui Wu VLDB 2012 30
Performance Guarantees for Distributed Reachability Queries Thank you! 29
- Slides: 29