CS 268 DHTs Scott Shenker and Ion Stoica

How Did it Start? § A killer application: Naptser - Free music over the

Model § § Each user stores a subset of files Each user has access

Main Challenge § Find where a particular file is stored E F D E?

Other Challenges § § Scale: up to hundred of thousands or millions of machines

Napster § § Assume a centralized index system that maps files (songs) to machines

Napster: Example m 5 E m 6 F E? E E? m 5 m

Gnutella § § § Distribute file location Idea: flood the request Hot to find

Gnutella § § § Ad-hoc topology Queries are flooded for bounded number of hops

Distributed Hash Tables (DHTs) § Abstraction: a distributed hash-table data structure - insert(id, item);

DHT Design Goals § § § Make sure that an item (file) identified is

Structured Networks § § Distributed Hash Tables (DHTs) Hash table interface: put(key, item), get(key)

Content Addressable Network (CAN) § § Associate to each node and item a unique

CAN Example: Two Dimensional Space § § Space divided between nodes All nodes cover

CAN Example: Two Dimensional Space § Node n 2: (4, 2) joins space is

CAN Example: Two Dimensional Space § Nodes n 4: (5, 5) and n 5:

CAN Example: Two Dimensional Space § § Nodes: n 1: (1, 2); n 2:

CAN Example: Two Dimensional Space § Each item is stored by the node who

CAN: Query Example § § Each node knows its neighbors in the d-space Forward

CAN: Node Joining I new node 1) Discover some node “I” already in CAN

CAN: Node Joining (x, y) I new node 2) Pick random point in space

CAN: Node Joining (x, y) J I new node 3) I routes to (x,

CAN: Node Joining J new 4) split J’s zone in half… new node owns

Node departure § Node explicitly hands over its zone and the associated (key, value)

Chord § § Associate to each node and item a unique id in an

Identifier to Node Mapping Example § § § Node 8 maps [5, 8] Node

Lookup § § lookup(37) Each node maintains its successor Route packet (ID, data) to

Joining Operation § § Each node A periodically sends a stabilize() message to its

Joining Operation § § Node with id=50 joins the ring Node 50 needs to

Joining Operation § § § succ=4 pred=44 Node 50: send join(50) to node 15

Joining Operation - update predecessor to 50 send notify() back 0) 4 ed =5

Joining Operation (cont’d) § 4 8 =50 ) 58 red ify( p § Node

Joining Operation (cont’d) § § Node 44 sends a stabilize message to its new

Joining Operation (cont’d) § This completes the joining operation! pred=50 4 58 succ=58 pred=44

Achieving Efficiency: finger tables Finger Table at 80 i 0 1 2 3 4

Achieving Robustness § § § To improve robustness each node maintains the k (>

CAN/Chord Optimizations § Reduce latency - Chose finger that reduces expected time to reach

Conclusions § § § Distributed Hash Tables are a key component of scalable and

Two Other Papers Ø § Boon Loo Tau et al, “Enhancing P 2 P

Goal § § P 2 P Search - What is the best design? Target

Keyword Search using DHTs § Inverted Lists hashed by keyword (term) in the DHT

Solution Space & Proposal Flood-based Network (Gnutella) (All items) Flood Query Very Few Results?

Gnutella Measurements § “Quality” of Searches: - Recall (% of all relevant items retrieved)

Result Size CDF Flooding is good enough. Single Query “Union-of-30” Query Our Focus 45

Query Latency Few Results Many results Queries that return few results have poor response

Hybrid Search Hybrid = “Best of both worlds” Flood-based Network (Search Popular Items) Flood

Challenges § Identifying Rare Items - Query Results Size (QRS) • Publish items from

Identifying Rare Items (%) • Upper-bound: Perfect scheme with complete knowledge of network •

Results § Improved Response Time: - PIER (DHT) returns first result in 10 seconds

Motivation § New DHTs constantly proposed - CAN, Chord, Pastry, Tapestry, Plaxton, Viceroy, Kademlia,

Our approach: Component-based analysis § Break DHT design into independent components § Analyze impact

Three aspects of a DHT design 1) Geometry: a graph structure that inspires a

Chord Distance function captures Ring 000 111 001 110 010 101 d(100, 111) =

Chord Neighbor and Route selection Algorithms 111 000 110 001 d(000, 001) = 1

One Geometry, Many Algorithms § Algorithm : exact rules for selecting neighbors, routes -

Insight: Geometry => Flexibility => Performance § Geometry captures flexibility in selecting algorithms §

Route selection flexibility allowed by Ring Geometry 111 000 110 001 010 101 011

Neighbor selection flexibility allowed by Ring Geometry 000 111 001 110 010 101 011

Geometries we compare Geometry Algorithm Ring Chord, Symphony Hypercube CAN Tree Plaxton Hybrid =

Metrics for flexibilities § FNS: Flexibility in Neighbor Selection = number of node choices

Flexibility in neighbor selection (FNS) for Tree h=3 h=2 h=1 000 001 010 011

Flexibility in route selection (FRS) for Hypercube 110 d(010, 011) = 3 100 d(010,

Summary of our flexibility analysis Flexibility Ordering of Geometries Neighbors Hypercube << Tree, XOR,

Analysis of Static Resilience Two aspects of robust routing § Dynamic Recovery : how

Does flexibility affect Static Resilience? Tree << XOR ≈ Hybrid < Hypercube < Ring

Analysis of Overlay Path Latency § Goal: Minimize end-to-end overlay path latency - not

Which is more effective, FNS or FRS? Plain << FRS << FNS ≈ FNS+FRS

Does Geometry affect performance of FNS or FRS? No, performance of FNS/FRS is independent

Summary of results § FRS matters for Static Resilience - Ring has the best

Conclusions § Routing Geometry is a fundamental design choice - Geometry determines flexibility -

Slides: 73

Download presentation

CS 268: DHTs Scott Shenker and Ion Stoica April 11, 2005 1

How Did it Start? § A killer application: Naptser - Free music over the Internet § Key idea: share the content, storage and bandwidth of individual (home) users Internet 2

Model § § Each user stores a subset of files Each user has access (can download) files from all users in the system 3

Main Challenge § Find where a particular file is stored E F D E? A C B 4

Other Challenges § § Scale: up to hundred of thousands or millions of machines Dynamicity: machines can come and go any time 5

Napster § § Assume a centralized index system that maps files (songs) to machines that are alive How to find a file (song) - Query the index system return a machine that stores the required file • Ideally this is the closest/least-loaded machine - ftp the file § Advantages: - Simplicity, easy to implement sophisticated search engines on top of the index system § Disadvantages: - Robustness, scalability (? ) 6

Napster: Example m 5 E m 6 F E? E E? m 5 m 1 m 2 m 3 m 4 m 5 m 6 m 4 C A m 1 D A B C D E F B m 3 m 2 7

Gnutella § § § Distribute file location Idea: flood the request Hot to find a file: - Send request to all neighbors - Neighbors recursively multicast the request - Eventually a machine that has the file receives the request, and it sends back the answer § Advantages: - Totally decentralized, highly robust § Disadvantages: - Not scalable; the entire network can be swamped with request (to alleviate this problem, each request has a TTL) 8

Gnutella § § § Ad-hoc topology Queries are flooded for bounded number of hops No guarantees on recall xyz Query: “xyz” 9

Distributed Hash Tables (DHTs) § Abstraction: a distributed hash-table data structure - insert(id, item); - item = query(id); (or lookup(id); ) - Note: item can be anything: a data object, document, file, pointer to a file… § Proposals - CAN, Chord, Kademlia, Pastry, Tapestry, etc 10

DHT Design Goals § § § Make sure that an item (file) identified is always found Scales to hundreds of thousands of nodes Handles rapid arrival and failure of nodes 11

Structured Networks § § Distributed Hash Tables (DHTs) Hash table interface: put(key, item), get(key) O(log n) hops Guarantees on recall (K , I ) K I 1 1 K I K I I 1 K I put(K 1, I 1) K I get (K 1) K I 12

Content Addressable Network (CAN) § § Associate to each node and item a unique id in an ddimensional Cartesian space on a d-torus Properties - Routing table size O(d) - Guarantees that a file is found in at most d*n 1/d steps, where n is the total number of nodes 13

CAN Example: Two Dimensional Space § § Space divided between nodes All nodes cover the entire space Each node covers either a square or a rectangular area of ratios 1: 2 or 2: 1 Example: - Node n 1: (1, 2) first node that joins cover the entire space 7 6 5 4 3 n 1 2 1 0 0 1 2 3 4 5 6 14 7

CAN Example: Two Dimensional Space § Node n 2: (4, 2) joins space is divided between n 1 and n 2 7 6 5 4 3 n 2 n 1 2 1 0 0 1 2 3 4 5 6 15 7

CAN Example: Two Dimensional Space § Node n 2: (4, 2) joins space is divided between n 1 and n 2 7 6 n 3 5 4 3 n 2 n 1 2 1 0 0 1 2 3 4 5 6 16 7

CAN Example: Two Dimensional Space § Nodes n 4: (5, 5) and n 5: (6, 6) join 7 6 n 4 n 3 5 n 5 4 3 n 2 n 1 2 1 0 0 1 2 3 4 5 6 17 7

CAN Example: Two Dimensional Space § § Nodes: n 1: (1, 2); n 2: (4, 2); n 3: (3, 5); n 4: (5, 5); n 5: (6, 6) Items: f 1: (2, 3); f 2: (5, 1); f 3: (2, 1); f 4: (7, 5); 7 6 n 5 n 4 n 3 5 f 4 4 f 1 3 n 2 n 1 2 f 3 1 f 2 0 0 1 2 3 4 5 6 18 7

CAN Example: Two Dimensional Space § Each item is stored by the node who owns its mapping in the space 7 6 n 4 n 3 5 n 5 f 4 4 f 1 3 n 2 n 1 2 f 3 1 f 2 0 0 1 2 3 4 5 6 19 7

CAN: Query Example § § Each node knows its neighbors in the d-space Forward query to the neighbor that is closest to the query id Example: assume n 1 queries f 4 Can route around some failures 7 6 n 4 n 3 5 n 5 f 4 4 f 1 3 n 2 n 1 2 f 3 1 f 2 0 0 1 2 3 4 5 6 20 7

CAN: Node Joining I new node 1) Discover some node “I” already in CAN 21

CAN: Node Joining (x, y) I new node 2) Pick random point in space 22

CAN: Node Joining (x, y) J I new node 3) I routes to (x, y), discovers node J 23

CAN: Node Joining J new 4) split J’s zone in half… new node owns one half 24

Node departure § Node explicitly hands over its zone and the associated (key, value) database to one of its neighbors § Incase of network failure this is handled by a take-over algorithm § Problem : take over mechanism does not provide regeneration of data § Solution: every node has a backup of its neighbours 25

Chord § § Associate to each node and item a unique id in an unidimensional space 0. . 2 m-1 Goals - Scales to hundreds of thousands of nodes - Handles rapid arrival and failure of nodes § Key design decision - Decouple correctness from efficiency § Properties - Routing table size O(log(N)) , where N is the total number of nodes - Guarantees that a file is found in O(log(N)) steps 26

Identifier to Node Mapping Example § § § Node 8 maps [5, 8] Node 15 maps [9, 15] Node 20 maps [16, 20] … Node 4 maps [59, 4] 4 58 8 15 § Each node maintains a pointer to its successor 44 20 35 32 27

Lookup § § lookup(37) Each node maintains its successor Route packet (ID, data) to the node responsible for ID using successor pointers 4 58 8 node=44 15 44 20 35 32 28

Joining Operation § § Each node A periodically sends a stabilize() message to its successor B Upon receiving a stabilize() message node B - returns its predecessor B’=pred(B) to A by sending a notify(B’) message § Upon receiving notify(B’) from B, - if B’ is between A and B, A updates its successor to B’ - A doesn’t do anything, otherwise 29

Joining Operation § § Node with id=50 joins the ring Node 50 needs to know at least one node already in the system succ=4 pred=44 4 58 8 - Assume known node is 15 succ=nil pred=nil 15 50 succ=58 pred=35 44 20 35 32 30

Joining Operation § § § succ=4 pred=44 Node 50: send join(50) to node 15 Node 44: returns node 58 Node 50 updates its successor to 58 4 58 8 join(50) succ=nil succ=58 pred=nil 50 15 58 succ=58 pred=35 44 20 35 32 31

Joining Operation - update predecessor to 50 send notify() back 0) 4 ed =5 58 8 (pr § succ=4 pred=50 pred=44 Node 50: send stabilize() to node 58 Node 58: no tify § stabilize() succ=58 pred=nil 15 50 succ=58 pred=35 44 20 35 32 32

Joining Operation (cont’d) § 4 8 =50 ) 58 red ify( p § Node 44 sends a stabilize message to its successor, node 58 Node 58 reply with a notify message Node 44 updates its successor to 50 succ=58 not § succ=4 pred=50 stabilize() pred=nil 15 50 succ=58 succ=50 pred=35 44 20 35 32 33

Joining Operation (cont’d) § § Node 44 sends a stabilize message to its new successor, node 50 Node 50 sets its predecessor to node 44 succ=58 pred=44 pred=nil succ=4 pred=50 4 58 8 15 Stabilize() 50 succ=50 pred=35 44 20 35 32 34

Joining Operation (cont’d) § This completes the joining operation! pred=50 4 58 succ=58 pred=44 succ=50 8 50 15 44 20 35 32 35

Achieving Efficiency: finger tables Finger Table at 80 i 0 1 2 3 4 5 6 ft[i] 96 96 96 112 20 Say m=7 0 80 + 25 (80 + 26) mod 27 = 16 112 20 96 80 + 24 80 + 23 80 + 22 80 + 21 80 + 20 32 80 45 ith entry at peer with id n is first peer with id >= 36

Achieving Robustness § § § To improve robustness each node maintains the k (> 1) immediate successors instead of only one successor In the notify() message, node A can send its k-1 successors to its predecessor B Upon receiving notify() message, B can update its successor list by concatenating the successor list received from A with A itself 37

CAN/Chord Optimizations § Reduce latency - Chose finger that reduces expected time to reach destination - Chose the closest node from range [N+2 i-1, N+2 i) as successor § Accommodate heterogeneous systems - Multiple virtual nodes per physical node 38

Conclusions § § § Distributed Hash Tables are a key component of scalable and robust overlay networks CAN: O(d) state, O(d*n 1/d) distance Chord: O(log n) state, O(log n) distance Both can achieve stretch < 2 Simplicity is key Services built on top of distributed hash tables - persistent storage (Open. DHT, Oceanstore) - p 2 p file storage, i 3 (chord) - multicast (CAN, Tapestry) 39

Two Other Papers Ø § Boon Loo Tau et al, “Enhancing P 2 P File-Sharing with an Internet-Scale Query Processor”, VLDB’ 04 Krishna Gummadi et al, “The Impact of DHT Routing Geometry on Resilience and Proximity”, SIGCOMM’ 03 40

Goal § § P 2 P Search - What is the best design? Target Environment - Typical P 2 P file-sharing environments. - Items stored in P 2 P network, queried by keywords - Replicas of items follow a long-tailed distribution • Popular items at head of distribution • Rare items at tail of distribution 41

Keyword Search using DHTs § Inverted Lists hashed by keyword (term) in the DHT K I (h(T 2), {I 2, I 3}) (h(T 1), {I 1, I 2}) K I (h(T 3), {I 4, I 5}) K I {I 1, I 2} K I K I {I 2} K I Query: “T 1 AND T 2” K I 42

Solution Space & Proposal Flood-based Network (Gnutella) (All items) Flood Query Very Few Results? DHT (Partial Index on Rare Items) DHT Query More Results Simpler alternative to either optimizing unstructured or structured networks. 43

Gnutella Measurements § “Quality” of Searches: - Recall (% of all relevant items retrieved) - Distinct Recall (% of all relevant distinct items retrieved) - Response Time (Latency) to 1 st result § Computed Query Recall - Each query issued simultaneously from 30 ultrapeers - Union of results from 30 ultrapeers • “Union-of-30” is our approximation of perfect answer 44

Result Size CDF Flooding is good enough. Single Query “Union-of-30” Query Our Focus 45

Query Latency Few Results Many results Queries that return few results have poor response times 46

Hybrid Search Hybrid = “Best of both worlds” Flood-based Network (Search Popular Items) Flood Query Very Few Results? DHT (Search Rare Items) DHT Query More Results 47

Challenges § Identifying Rare Items - Query Results Size (QRS) • Publish items from previous queries that return few results - Term Frequency Statistics • Single Term (TF) • Term Pairs (TPF) - Sample items on neighboring nodes (SAM) § Network Churn (IPTPS ’ 04) - Use Ultrapeers as DHT nodes - Avoid publishing items from short-lived nodes 48

Identifying Rare Items (%) • Upper-bound: Perfect scheme with complete knowledge of network • Lower-bound: Random scheme • Diminishing returns as more items are published • All schemes in between Random and Perfect 49

Results § Improved Response Time: - PIER (DHT) returns first result in 10 seconds • 40 seconds in aggregate including 30 seconds timeout - Gnutella (Flood) queries returns first result in 65 seconds - 25 seconds (38%) reduction in latency § Improved Recall Analysis: - 18% reduction in #queries with empty results • Using the naïve QRS rare-item selection scheme - Opportunity to do far better • Recall: 66% potential reduction based on “Union-of-30” • Approaches: larger-scale deployments and using better rare-item schemes. 50

Two Other Papers § Ø Boon Loo Tau et al, “Enhancing P 2 P File-Sharing with an Internet-Scale Query Processor”, VLDB’ 04 Krishna Gummadi et al, “The Impact of DHT Routing Geometry on Resilience and Proximity”, SIGCOMM’ 03 51

Motivation § New DHTs constantly proposed - CAN, Chord, Pastry, Tapestry, Plaxton, Viceroy, Kademlia, Skipnet, Symphony, Koorde, Apocrypha, Land, ORDI … § Each is extensively analyzed but in isolation § Each DHT has many algorithmic details making it difficult to compare a) Goals: Separate fundamental design choices from algorithmic details b) Understand their affect reliability and efficiency 52

Our approach: Component-based analysis § Break DHT design into independent components § Analyze impact of each component choice separately - compare with black-box analysis: • benchmark each DHT implementation • rankings of existing DHTs vs. hints on better designs § Two types of components - Routing-level : neighbor & route selection - System-level : caching, replication, querying policy etc. 53

Three aspects of a DHT design 1) Geometry: a graph structure that inspires a DHT design, with its exciting properties - 2) Distance function: captures a geometric structure - 3) Tree, Hypercube, Ring, Butterfly, Debruijn d(id 1, id 2) for any two node identifiers Algorithm: rules for selecting neighbors and routes using the distance function 54

Chord DHT has Ring Geometry 55

Chord Distance function captures Ring 000 111 001 110 010 101 d(100, 111) = 3 011 100 § Nodes are points on a clock-wise Ring § d(id 1, id 2) = length of clock-wise arc between ids = (id 2 – id 1) mod N 56

Chord Neighbor and Route selection Algorithms 111 000 110 001 d(000, 001) = 1 010 101 d(000, 010) = 2 011 100 d(000, 001) = 4 § Neighbor selection: ith neighbor at 2 i distance § Route selection: pick neighbor closest to destination 57

One Geometry, Many Algorithms § Algorithm : exact rules for selecting neighbors, routes - Chord, CAN, PRR, Tapestry, Pastry etc. - inspired by geometric structures like Ring, Hyper-cube, Tree § Geometry : an algorithm’s underlying structure - Distance function is the formal representation of Geometry - Chord, Symphony => Ring - many algorithms can have same geometry Why is Geometry important? 58

Insight: Geometry => Flexibility => Performance § Geometry captures flexibility in selecting algorithms § Flexibility is important for routing performance - Flexibility in selecting routes leads to shorter, reliable paths - Flexibility in selecting neighbors leads to shorter paths 59

Route selection flexibility allowed by Ring Geometry 111 000 110 001 010 101 011 100 § Chord algorithm picks neighbor closest to destination § A different algorithm picks the best of alternate paths 60

Neighbor selection flexibility allowed by Ring Geometry 000 111 001 110 010 101 011 100 § Chord algorithm picks ith neighbor at 2 i distance § A different algorithm picks ith neighbor from [2 i , 2 i+1) 61

Geometries we compare Geometry Algorithm Ring Chord, Symphony Hypercube CAN Tree Plaxton Hybrid = Tree + Ring XOR d(id 1, id 2) = id 1 XOR id 2 Tapestry, Pastry Kademlia 62

Metrics for flexibilities § FNS: Flexibility in Neighbor Selection = number of node choices for a neighbor § FRS: Flexibility in Route Selection = avg. number of next-hop choices for all destinations § Constraints for neighbors and routes - select neighbors to have paths of O(log. N) - select routes so that each hop is closer to destination 63

Flexibility in neighbor selection (FNS) for Tree h=3 h=2 h=1 000 001 010 011 100 101 110 111 § log. N neighbors in sub-trees of varying heights § FNS = 2 i-1 for ith neighbor of a node 64

Flexibility in route selection (FRS) for Hypercube 110 d(010, 011) = 3 100 d(010, 011) = 1 d(000, 011) = 2 000 111 101 010 011 001 d(001, 011) = 1 § Routing to next hop fixes one bit § FRS =Avg. (#bits destination differs in)=log. N/2 65

Summary of our flexibility analysis Flexibility Ordering of Geometries Neighbors Hypercube << Tree, XOR, Ring, Hybrid (FNS) Routes (FRS) (log. N) (2 i-1) Tree << XOR, Hybrid < Hypercube < Ring (1) (log. N/2) (log. N) How relevant is flexibility for DHT routing performance? 66

Analysis of Static Resilience Two aspects of robust routing § Dynamic Recovery : how quickly routing state is recovered after failures § Static Resilience : how well the network routes before recovery finishes - captures how quickly recovery algorithms need to work - depends on FRS Evaluation: § Fail a fraction of nodes, without recovering any state § Metric: % Paths Failed 67

Does flexibility affect Static Resilience? Tree << XOR ≈ Hybrid < Hypercube < Ring Flexibility in Route Selection matters for Static Resilience 68

Analysis of Overlay Path Latency § Goal: Minimize end-to-end overlay path latency - not just the number of hops § Both FNS and FRS can reduce latency - Tree has FNS, Hypercube has FRS, Ring & XOR have both Evaluation: § Using Internet latency distributions (see paper) 69

Which is more effective, FNS or FRS? Plain << FRS << FNS ≈ FNS+FRS Neighbor Selection is much better than Route Selection 70

Does Geometry affect performance of FNS or FRS? No, performance of FNS/FRS is independent of Geometry A Geometry’s support for neighbor selection is crucial 71

Summary of results § FRS matters for Static Resilience - Ring has the best resilience § Both FNS and FRS reduce Overlay Path Latency § But, FNS is far more important than FRS - Ring, Hybrid, Tree and XOR have high FNS 72

Conclusions § Routing Geometry is a fundamental design choice - Geometry determines flexibility - Flexibility improves resilience and proximity § Ring has the greatest flexibility 73