CS 268 DHTs Scott Shenker and Ion Stoica
- Slides: 73
CS 268: DHTs Scott Shenker and Ion Stoica April 11, 2005 1
How Did it Start? § A killer application: Naptser - Free music over the Internet § Key idea: share the content, storage and bandwidth of individual (home) users Internet 2
Model § § Each user stores a subset of files Each user has access (can download) files from all users in the system 3
Main Challenge § Find where a particular file is stored E F D E? A C B 4
Other Challenges § § Scale: up to hundred of thousands or millions of machines Dynamicity: machines can come and go any time 5
Napster § § Assume a centralized index system that maps files (songs) to machines that are alive How to find a file (song) - Query the index system return a machine that stores the required file • Ideally this is the closest/least-loaded machine - ftp the file § Advantages: - Simplicity, easy to implement sophisticated search engines on top of the index system § Disadvantages: - Robustness, scalability (? ) 6
Napster: Example m 5 E m 6 F E? E E? m 5 m 1 m 2 m 3 m 4 m 5 m 6 m 4 C A m 1 D A B C D E F B m 3 m 2 7
Gnutella § § § Distribute file location Idea: flood the request Hot to find a file: - Send request to all neighbors - Neighbors recursively multicast the request - Eventually a machine that has the file receives the request, and it sends back the answer § Advantages: - Totally decentralized, highly robust § Disadvantages: - Not scalable; the entire network can be swamped with request (to alleviate this problem, each request has a TTL) 8
Gnutella § § § Ad-hoc topology Queries are flooded for bounded number of hops No guarantees on recall xyz Query: “xyz” 9
Distributed Hash Tables (DHTs) § Abstraction: a distributed hash-table data structure - insert(id, item); - item = query(id); (or lookup(id); ) - Note: item can be anything: a data object, document, file, pointer to a file… § Proposals - CAN, Chord, Kademlia, Pastry, Tapestry, etc 10
DHT Design Goals § § § Make sure that an item (file) identified is always found Scales to hundreds of thousands of nodes Handles rapid arrival and failure of nodes 11
Structured Networks § § Distributed Hash Tables (DHTs) Hash table interface: put(key, item), get(key) O(log n) hops Guarantees on recall (K , I ) K I 1 1 K I K I I 1 K I put(K 1, I 1) K I get (K 1) K I 12
Content Addressable Network (CAN) § § Associate to each node and item a unique id in an ddimensional Cartesian space on a d-torus Properties - Routing table size O(d) - Guarantees that a file is found in at most d*n 1/d steps, where n is the total number of nodes 13
CAN Example: Two Dimensional Space § § Space divided between nodes All nodes cover the entire space Each node covers either a square or a rectangular area of ratios 1: 2 or 2: 1 Example: - Node n 1: (1, 2) first node that joins cover the entire space 7 6 5 4 3 n 1 2 1 0 0 1 2 3 4 5 6 14 7
CAN Example: Two Dimensional Space § Node n 2: (4, 2) joins space is divided between n 1 and n 2 7 6 5 4 3 n 2 n 1 2 1 0 0 1 2 3 4 5 6 15 7
CAN Example: Two Dimensional Space § Node n 2: (4, 2) joins space is divided between n 1 and n 2 7 6 n 3 5 4 3 n 2 n 1 2 1 0 0 1 2 3 4 5 6 16 7
CAN Example: Two Dimensional Space § Nodes n 4: (5, 5) and n 5: (6, 6) join 7 6 n 4 n 3 5 n 5 4 3 n 2 n 1 2 1 0 0 1 2 3 4 5 6 17 7
CAN Example: Two Dimensional Space § § Nodes: n 1: (1, 2); n 2: (4, 2); n 3: (3, 5); n 4: (5, 5); n 5: (6, 6) Items: f 1: (2, 3); f 2: (5, 1); f 3: (2, 1); f 4: (7, 5); 7 6 n 5 n 4 n 3 5 f 4 4 f 1 3 n 2 n 1 2 f 3 1 f 2 0 0 1 2 3 4 5 6 18 7
CAN Example: Two Dimensional Space § Each item is stored by the node who owns its mapping in the space 7 6 n 4 n 3 5 n 5 f 4 4 f 1 3 n 2 n 1 2 f 3 1 f 2 0 0 1 2 3 4 5 6 19 7
CAN: Query Example § § Each node knows its neighbors in the d-space Forward query to the neighbor that is closest to the query id Example: assume n 1 queries f 4 Can route around some failures 7 6 n 4 n 3 5 n 5 f 4 4 f 1 3 n 2 n 1 2 f 3 1 f 2 0 0 1 2 3 4 5 6 20 7
CAN: Node Joining I new node 1) Discover some node “I” already in CAN 21
CAN: Node Joining (x, y) I new node 2) Pick random point in space 22
CAN: Node Joining (x, y) J I new node 3) I routes to (x, y), discovers node J 23
CAN: Node Joining J new 4) split J’s zone in half… new node owns one half 24
Node departure § Node explicitly hands over its zone and the associated (key, value) database to one of its neighbors § Incase of network failure this is handled by a take-over algorithm § Problem : take over mechanism does not provide regeneration of data § Solution: every node has a backup of its neighbours 25
Chord § § Associate to each node and item a unique id in an unidimensional space 0. . 2 m-1 Goals - Scales to hundreds of thousands of nodes - Handles rapid arrival and failure of nodes § Key design decision - Decouple correctness from efficiency § Properties - Routing table size O(log(N)) , where N is the total number of nodes - Guarantees that a file is found in O(log(N)) steps 26
Identifier to Node Mapping Example § § § Node 8 maps [5, 8] Node 15 maps [9, 15] Node 20 maps [16, 20] … Node 4 maps [59, 4] 4 58 8 15 § Each node maintains a pointer to its successor 44 20 35 32 27
Lookup § § lookup(37) Each node maintains its successor Route packet (ID, data) to the node responsible for ID using successor pointers 4 58 8 node=44 15 44 20 35 32 28
Joining Operation § § Each node A periodically sends a stabilize() message to its successor B Upon receiving a stabilize() message node B - returns its predecessor B’=pred(B) to A by sending a notify(B’) message § Upon receiving notify(B’) from B, - if B’ is between A and B, A updates its successor to B’ - A doesn’t do anything, otherwise 29
Joining Operation § § Node with id=50 joins the ring Node 50 needs to know at least one node already in the system succ=4 pred=44 4 58 8 - Assume known node is 15 succ=nil pred=nil 15 50 succ=58 pred=35 44 20 35 32 30
Joining Operation § § § succ=4 pred=44 Node 50: send join(50) to node 15 Node 44: returns node 58 Node 50 updates its successor to 58 4 58 8 join(50) succ=nil succ=58 pred=nil 50 15 58 succ=58 pred=35 44 20 35 32 31
Joining Operation - update predecessor to 50 send notify() back 0) 4 ed =5 58 8 (pr § succ=4 pred=50 pred=44 Node 50: send stabilize() to node 58 Node 58: no tify § stabilize() succ=58 pred=nil 15 50 succ=58 pred=35 44 20 35 32 32
Joining Operation (cont’d) § 4 8 =50 ) 58 red ify( p § Node 44 sends a stabilize message to its successor, node 58 Node 58 reply with a notify message Node 44 updates its successor to 50 succ=58 not § succ=4 pred=50 stabilize() pred=nil 15 50 succ=58 succ=50 pred=35 44 20 35 32 33
Joining Operation (cont’d) § § Node 44 sends a stabilize message to its new successor, node 50 Node 50 sets its predecessor to node 44 succ=58 pred=44 pred=nil succ=4 pred=50 4 58 8 15 Stabilize() 50 succ=50 pred=35 44 20 35 32 34
Joining Operation (cont’d) § This completes the joining operation! pred=50 4 58 succ=58 pred=44 succ=50 8 50 15 44 20 35 32 35
Achieving Efficiency: finger tables Finger Table at 80 i 0 1 2 3 4 5 6 ft[i] 96 96 96 112 20 Say m=7 0 80 + 25 (80 + 26) mod 27 = 16 112 20 96 80 + 24 80 + 23 80 + 22 80 + 21 80 + 20 32 80 45 ith entry at peer with id n is first peer with id >= 36
Achieving Robustness § § § To improve robustness each node maintains the k (> 1) immediate successors instead of only one successor In the notify() message, node A can send its k-1 successors to its predecessor B Upon receiving notify() message, B can update its successor list by concatenating the successor list received from A with A itself 37
CAN/Chord Optimizations § Reduce latency - Chose finger that reduces expected time to reach destination - Chose the closest node from range [N+2 i-1, N+2 i) as successor § Accommodate heterogeneous systems - Multiple virtual nodes per physical node 38
Conclusions § § § Distributed Hash Tables are a key component of scalable and robust overlay networks CAN: O(d) state, O(d*n 1/d) distance Chord: O(log n) state, O(log n) distance Both can achieve stretch < 2 Simplicity is key Services built on top of distributed hash tables - persistent storage (Open. DHT, Oceanstore) - p 2 p file storage, i 3 (chord) - multicast (CAN, Tapestry) 39
Two Other Papers Ø § Boon Loo Tau et al, “Enhancing P 2 P File-Sharing with an Internet-Scale Query Processor”, VLDB’ 04 Krishna Gummadi et al, “The Impact of DHT Routing Geometry on Resilience and Proximity”, SIGCOMM’ 03 40
Goal § § P 2 P Search - What is the best design? Target Environment - Typical P 2 P file-sharing environments. - Items stored in P 2 P network, queried by keywords - Replicas of items follow a long-tailed distribution • Popular items at head of distribution • Rare items at tail of distribution 41
Keyword Search using DHTs § Inverted Lists hashed by keyword (term) in the DHT K I (h(T 2), {I 2, I 3}) (h(T 1), {I 1, I 2}) K I (h(T 3), {I 4, I 5}) K I {I 1, I 2} K I K I {I 2} K I Query: “T 1 AND T 2” K I 42
Solution Space & Proposal Flood-based Network (Gnutella) (All items) Flood Query Very Few Results? DHT (Partial Index on Rare Items) DHT Query More Results Simpler alternative to either optimizing unstructured or structured networks. 43
Gnutella Measurements § “Quality” of Searches: - Recall (% of all relevant items retrieved) - Distinct Recall (% of all relevant distinct items retrieved) - Response Time (Latency) to 1 st result § Computed Query Recall - Each query issued simultaneously from 30 ultrapeers - Union of results from 30 ultrapeers • “Union-of-30” is our approximation of perfect answer 44
Result Size CDF Flooding is good enough. Single Query “Union-of-30” Query Our Focus 45
Query Latency Few Results Many results Queries that return few results have poor response times 46
Hybrid Search Hybrid = “Best of both worlds” Flood-based Network (Search Popular Items) Flood Query Very Few Results? DHT (Search Rare Items) DHT Query More Results 47
Challenges § Identifying Rare Items - Query Results Size (QRS) • Publish items from previous queries that return few results - Term Frequency Statistics • Single Term (TF) • Term Pairs (TPF) - Sample items on neighboring nodes (SAM) § Network Churn (IPTPS ’ 04) - Use Ultrapeers as DHT nodes - Avoid publishing items from short-lived nodes 48
Identifying Rare Items (%) • Upper-bound: Perfect scheme with complete knowledge of network • Lower-bound: Random scheme • Diminishing returns as more items are published • All schemes in between Random and Perfect 49
Results § Improved Response Time: - PIER (DHT) returns first result in 10 seconds • 40 seconds in aggregate including 30 seconds timeout - Gnutella (Flood) queries returns first result in 65 seconds - 25 seconds (38%) reduction in latency § Improved Recall Analysis: - 18% reduction in #queries with empty results • Using the naïve QRS rare-item selection scheme - Opportunity to do far better • Recall: 66% potential reduction based on “Union-of-30” • Approaches: larger-scale deployments and using better rare-item schemes. 50
Two Other Papers § Ø Boon Loo Tau et al, “Enhancing P 2 P File-Sharing with an Internet-Scale Query Processor”, VLDB’ 04 Krishna Gummadi et al, “The Impact of DHT Routing Geometry on Resilience and Proximity”, SIGCOMM’ 03 51
Motivation § New DHTs constantly proposed - CAN, Chord, Pastry, Tapestry, Plaxton, Viceroy, Kademlia, Skipnet, Symphony, Koorde, Apocrypha, Land, ORDI … § Each is extensively analyzed but in isolation § Each DHT has many algorithmic details making it difficult to compare a) Goals: Separate fundamental design choices from algorithmic details b) Understand their affect reliability and efficiency 52
Our approach: Component-based analysis § Break DHT design into independent components § Analyze impact of each component choice separately - compare with black-box analysis: • benchmark each DHT implementation • rankings of existing DHTs vs. hints on better designs § Two types of components - Routing-level : neighbor & route selection - System-level : caching, replication, querying policy etc. 53
Three aspects of a DHT design 1) Geometry: a graph structure that inspires a DHT design, with its exciting properties - 2) Distance function: captures a geometric structure - 3) Tree, Hypercube, Ring, Butterfly, Debruijn d(id 1, id 2) for any two node identifiers Algorithm: rules for selecting neighbors and routes using the distance function 54
Chord DHT has Ring Geometry 55
Chord Distance function captures Ring 000 111 001 110 010 101 d(100, 111) = 3 011 100 § Nodes are points on a clock-wise Ring § d(id 1, id 2) = length of clock-wise arc between ids = (id 2 – id 1) mod N 56
Chord Neighbor and Route selection Algorithms 111 000 110 001 d(000, 001) = 1 010 101 d(000, 010) = 2 011 100 d(000, 001) = 4 § Neighbor selection: ith neighbor at 2 i distance § Route selection: pick neighbor closest to destination 57
One Geometry, Many Algorithms § Algorithm : exact rules for selecting neighbors, routes - Chord, CAN, PRR, Tapestry, Pastry etc. - inspired by geometric structures like Ring, Hyper-cube, Tree § Geometry : an algorithm’s underlying structure - Distance function is the formal representation of Geometry - Chord, Symphony => Ring - many algorithms can have same geometry Why is Geometry important? 58
Insight: Geometry => Flexibility => Performance § Geometry captures flexibility in selecting algorithms § Flexibility is important for routing performance - Flexibility in selecting routes leads to shorter, reliable paths - Flexibility in selecting neighbors leads to shorter paths 59
Route selection flexibility allowed by Ring Geometry 111 000 110 001 010 101 011 100 § Chord algorithm picks neighbor closest to destination § A different algorithm picks the best of alternate paths 60
Neighbor selection flexibility allowed by Ring Geometry 000 111 001 110 010 101 011 100 § Chord algorithm picks ith neighbor at 2 i distance § A different algorithm picks ith neighbor from [2 i , 2 i+1) 61
Geometries we compare Geometry Algorithm Ring Chord, Symphony Hypercube CAN Tree Plaxton Hybrid = Tree + Ring XOR d(id 1, id 2) = id 1 XOR id 2 Tapestry, Pastry Kademlia 62
Metrics for flexibilities § FNS: Flexibility in Neighbor Selection = number of node choices for a neighbor § FRS: Flexibility in Route Selection = avg. number of next-hop choices for all destinations § Constraints for neighbors and routes - select neighbors to have paths of O(log. N) - select routes so that each hop is closer to destination 63
Flexibility in neighbor selection (FNS) for Tree h=3 h=2 h=1 000 001 010 011 100 101 110 111 § log. N neighbors in sub-trees of varying heights § FNS = 2 i-1 for ith neighbor of a node 64
Flexibility in route selection (FRS) for Hypercube 110 d(010, 011) = 3 100 d(010, 011) = 1 d(000, 011) = 2 000 111 101 010 011 001 d(001, 011) = 1 § Routing to next hop fixes one bit § FRS =Avg. (#bits destination differs in)=log. N/2 65
Summary of our flexibility analysis Flexibility Ordering of Geometries Neighbors Hypercube << Tree, XOR, Ring, Hybrid (FNS) Routes (FRS) (log. N) (2 i-1) Tree << XOR, Hybrid < Hypercube < Ring (1) (log. N/2) (log. N) How relevant is flexibility for DHT routing performance? 66
Analysis of Static Resilience Two aspects of robust routing § Dynamic Recovery : how quickly routing state is recovered after failures § Static Resilience : how well the network routes before recovery finishes - captures how quickly recovery algorithms need to work - depends on FRS Evaluation: § Fail a fraction of nodes, without recovering any state § Metric: % Paths Failed 67
Does flexibility affect Static Resilience? Tree << XOR ≈ Hybrid < Hypercube < Ring Flexibility in Route Selection matters for Static Resilience 68
Analysis of Overlay Path Latency § Goal: Minimize end-to-end overlay path latency - not just the number of hops § Both FNS and FRS can reduce latency - Tree has FNS, Hypercube has FRS, Ring & XOR have both Evaluation: § Using Internet latency distributions (see paper) 69
Which is more effective, FNS or FRS? Plain << FRS << FNS ≈ FNS+FRS Neighbor Selection is much better than Route Selection 70
Does Geometry affect performance of FNS or FRS? No, performance of FNS/FRS is independent of Geometry A Geometry’s support for neighbor selection is crucial 71
Summary of results § FRS matters for Static Resilience - Ring has the best resilience § Both FNS and FRS reduce Overlay Path Latency § But, FNS is far more important than FRS - Ring, Hybrid, Tree and XOR have high FNS 72
Conclusions § Routing Geometry is a fundamental design choice - Geometry determines flexibility - Flexibility improves resilience and proximity § Ring has the greatest flexibility 73
- Scott shenker
- Ion stoica berkeley
- Ion stoica
- Anat shenker-osorio
- Beatrice stoica
- Monica stoica
- Emilia stoica
- Adrian stoica jpl
- Procuror emilia ion
- C6h12 fuerza intermolecular
- Qumica
- Uniones intramoleculares
- London dispersion forces diagram
- Cs 268
- Suipacha 268
- 268 rounded to the nearest hundred
- Iso tc 268
- Opwekking 268
- Cs 268
- Iso/tc 268
- Ps 135
- Px 268
- Odo268 swiss medical
- 268/33
- Protein chromatography
- Principle of mass spectrophotometer
- Formula?
- Solubility and complex ion equilibria
- Atom molecule and ion
- Affinity chromatography principle
- Common ion solubility
- Are all ionic compounds soluble in water
- Common ion effect example
- Common ion effect example
- Solubility product and common ion effect
- Neg ion combination pump
- Gorry and scott morton grid
- Monatomic ions
- Significance of urea
- Common ion effect example
- Criss cross method of aluminium chloride
- Crisscross method
- Tin iv sulfate formula
- Carbon dioxide ion
- Tcs internal certification
- Sputter ion pump working principle
- What are spectator ions
- Common ion effect example
- How is positive ion formed
- Ion trapping definition
- What is ion trapping in pharmacology
- Aktivlik koeffitsienti formulasi
- Reticulo prefix
- Common ion effect explained
- Ion 8800
- Boundary potential definition
- Nitrite nitrate nitride
- When adding a subscript to a polyatomic ion you should
- What is a polyatomic ion
- What is a polyatomic ion definition
- An ion source is producing 6li ions
- First pass effect drugs
- Atom size trend periodic table
- Carbonium ion rearrangement
- Aluminum ion symbol
- Hs- ion name
- Net ionic equation definition
- Ionic bonding worksheet answers
- P2o5 compound name
- Applications of common ion effect
- Monatomic ions
- Fintonal
- Long medical words
- Ligand and voltage gated channels