Chord Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson

  • Slides: 47
Download presentation
Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike

Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006

Introduction r Dynamo stores objects associated with a key through a simple interface: m

Introduction r Dynamo stores objects associated with a key through a simple interface: m get(), put() r It should be possible to scale Dynamo incrementally r This requires the ability to partition data over the set of nodes (storage hosts) r Dynamo relies on a concept called consistent hashing m The approach they used is similar to that found in Chord.

Distributed Hash Tables (DHT) r Operationally like standard hash tables r Stores (key, value)

Distributed Hash Tables (DHT) r Operationally like standard hash tables r Stores (key, value) pairs m The key is like a filename m The value can be file contents or pointer to location r Goal: Efficiently insert/lookup/delete (key, value) pairs r Each peer stores a subset of (key, value) pairs in the system

DHT r Core operation: Find node responsible for a key m Map key to

DHT r Core operation: Find node responsible for a key m Map key to node m Efficiently route insert/lookup/delete request to this node r Allow for frequent node arrivals and departures

DHT r Introduce a hash function to map the object being searched for to

DHT r Introduce a hash function to map the object being searched for to a unique global identifier: m e. g. , h(“NGC’ 02 Tutorial Notes”) → 8045 r Distribute the range of the hash function among all nodes in the network 1000 -1999 1500 -4999 9000 -9500 0 -999 4500 -6999 8000 -8999 8045 7000 -8500 9500 -9999 r Each node must “know about” at least one copy of each object that hashes within its range (when one exists)

DHT: Desirable Properties r Key ID space (search space) is uniformly populated m Mapping

DHT: Desirable Properties r Key ID space (search space) is uniformly populated m Mapping of keys to IDs using (consistent) hashing r A node is responsible for indexing all the keys in a certain subspace of the ID space r Nodes have only partial knowledge of other node’s responsibilities r Messages should be routed to a node efficiently (small number of hops) r Node arrival/departure should only affect a few nodes.

Consistent Hashing r The main idea: map both keys and nodes (node IPs) to

Consistent Hashing r The main idea: map both keys and nodes (node IPs) to the same (metric) ID space

Consistent Hashing r The main idea: map both keys and nodes (node IPs) to

Consistent Hashing r The main idea: map both keys and nodes (node IPs) to the same (metric) ID space The ring is just a possibility. Any metric space will do

Consistent Hashing r With high probability, the hash function balances load (all nodes receive

Consistent Hashing r With high probability, the hash function balances load (all nodes receive roughly the same number of keys). r With high probability, when a node joins (or leaves) the network, only an fraction of the keys are moved to a different location. m This is clearly the minimum necessary to maintain a balanced load.

Consistent Hashing r The consistent hash function assigns each node and key an m-bit

Consistent Hashing r The consistent hash function assigns each node and key an m-bit identifier using SHA-1 as a base hash function. r A node’s identifier is chosen by hashing the node’s IP address. r A key identifier is produced by hashing the key. r For more info see: m D. R. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigrahy, “Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the. World. Wide. Web, ” in Proc. 29 th ACM Symp. Theory of Computing, El Paso, TX, May 1997, pp. 654– 663.

P 2 P Middleware: Differences r Different P 2 P middlewares differ in: m

P 2 P Middleware: Differences r Different P 2 P middlewares differ in: m The choice of the ID space m The structure of their network of nodes (i. e. how each node chooses its neighbors) m For each object, node(s) whose range(s) cover that object must be reachable via a “short” path r This is a major research topic

Chord r m bit identifier space for both keys and nodes r Key identifier

Chord r m bit identifier space for both keys and nodes r Key identifier = SHA-1(key) m Key SHA-1 = “Let. It. Be” m Key = “ 129. 100. 16. 93” SHA-1 ID=50 ID=70 r How do we assign keys to nodes?

Chord r Nodes organized in an identifier circle based on node identifiers r Keys

Chord r Nodes organized in an identifier circle based on node identifiers r Keys assigned to their successor node in the identifier circle e. g. , node with next higher ID.

Chord r Hash function ensures even distribution of nodes and keys on the circle

Chord r Hash function ensures even distribution of nodes and keys on the circle r Range covered by node is from previous ID up to its own ID r Assume an N node network

Chord: Search Possibilities r Routing table size vs search cost r Every peer knows

Chord: Search Possibilities r Routing table size vs search cost r Every peer knows every other peer: O(N) routing table size r Every peer knows its successor: O(N) search time. r The “compromise” is to have each peer know the next m successors.

Finger Table r Let m be the number of bits in the key/node identifiers

Finger Table r Let m be the number of bits in the key/node identifiers r Each node, n, maintains a routing table with at most m entries called the finger table. r The ith entry in the table at node n contains the identity of the first node, s, that succeeds n by at least 2 i-1. ms = successor(n+2 i-1) m s is called the ith finger of node n

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1) where 1

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1) where 1 ≤ i ≤ m O(log N) table size

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

The Chord algorithm – Scalable node localization

The Chord algorithm – Scalable node localization

Chord: Search r Assume node n is searching for key k. r Node n

Chord: Search r Assume node n is searching for key k. r Node n does the following: m Find ith table entry of node n such that k [finger[i]. start, finger[i+1]. start]) m If no such entry exists then return the node in the last entry of the finger table m The above two steps are repeated until the condition in the first step is satisfied.

Chord: Join r Nodes can join (and leave) at any time. r Challenge: Preserving

Chord: Join r Nodes can join (and leave) at any time. r Challenge: Preserving the ability to locate every key in the network r Chord must preserve the following: m Each node’s successor correctly maintained m For every key k, node successor(k) is responsible for k. r For lookups to be fast, it is desirable for the finger tables to be correct.

Chord: Join Implementation r Each node in Chord maintains a predecessor pointer. m This

Chord: Join Implementation r Each node in Chord maintains a predecessor pointer. m This consists of the Chord ID and IP address of the immediate predecessor of that node. m It can be used to walk counterclockwise around the identifier circle. r The new node to be added learns the identify of an existing Chord node by some external mechanism

Chord: Join Initialization Steps r Assume n is the node to join. r Find

Chord: Join Initialization Steps r Assume n is the node to join. r Find any existing node, n’. r Find successor of n from n’. Label this successor(n). r Ask successor(n) for its predecessor. This is labelled as predecessor(successor(n)).

Chord: Join Example • Assume N 26 wants to join; If finds N 8

Chord: Join Example • Assume N 26 wants to join; If finds N 8 • N 8’s finger table suggests that N 26 will be “between” N 21 and N 32.

Chord: Join (Initialize finger table) r Node n needs to have its finger table

Chord: Join (Initialize finger table) r Node n needs to have its finger table initialized r Node n can ask one its predecessor to be for its finger table as a starting point

Chord: Join (Changing Existing Finger Tables) r Node n needs to entered into the

Chord: Join (Changing Existing Finger Tables) r Node n needs to entered into the finger tables of some existing nodes. r Node n becomes the ith finger of node p, iff m m p precedes n by at least 2 i-1 ; and The ith finger of node p succeeds n. r The first node, p, that satisfies these conditions is the immediate predecessor of n-2 i-1 r For a given n, the algorithm starts with the ith finger of node n and then continues to walk in the counter-clock-wise direction on the identifier circle until it encounters a node whose ith finger precedes n.

Chord: Join Example (add N 26) N 21 (old finger table) N 21 (new

Chord: Join Example (add N 26) N 21 (old finger table) N 21 (new finger table) N 21+1 N 32 N 21+1 N 26 N 21+2 N 32 N 21+2 N 26 N 21+4 N 32 N 21+4 N 26 N 21+8 N 32 N 21+16 N 38 N 21+32 N 56 i=1: Does N 21 precede N 26 by at least 1 (2 i-1); yes: N 21+1 becomes N 26; i=2: Does N 21 precede N 26 by at least 2; yes: N 21+2 becomes N 26; i=3: Does N 21 precede N 26 by at least 4; yes: N 21+4 becomes N 26; i=4: Does N 21 precede N 26 by 8; no; evaluate N 14;

Chord: Join Example (add N 26) N 14 (new finger table) N 14+1 N

Chord: Join Example (add N 26) N 14 (new finger table) N 14+1 N 21 N 14+2 N 21 N 14+4 N 21 N 14+8 N 32 N 14+8 N 26 N 14+16 N 32 N 14+32 N 48 i=4: Does N 14 precede N 26 by at least 8; yes; N 14+8 becomes N 26 i=5; Does N 15 precede N 26 by at least 16; no; evaluate N 8 Etc

Chord: Join (Transferring Keys) r Move responsibility for all the keys for which node

Chord: Join (Transferring Keys) r Move responsibility for all the keys for which node n is the successor. r Typically this involves moving data associated with each key to the new node. r Node n can become the successor for keys that were previously the responsibility of the node immediately following n. r Node n only needs to contact one node to transfer responsibility for all relevant keys.

Chord: Join r The previous discussion on join focuses on a single node join.

Chord: Join r The previous discussion on join focuses on a single node join. r What if there are multiple node joins? r Join requires that each node’s successor is correctly maintained

Chord: Stabilization Protocol r The successor/predecessor links are rebuilt by periodic stabilize notification messages

Chord: Stabilization Protocol r The successor/predecessor links are rebuilt by periodic stabilize notification messages m Sent by each node to its successor to inform it of the (possibly new) identity of the predecessor r The successor pointers are used to verify and correct finger table entries.

Chord: Join/Stabilize Example

Chord: Join/Stabilize Example

Chord: Join/Stabilize Example • N 26 joins the system • N 26 acquires N

Chord: Join/Stabilize Example • N 26 joins the system • N 26 acquires N 32 as its successor • N 26 notifies N 32 • N 32 acquires N 26 as its predecessor

Chord: Join/Stabilize Example • N 26 copies keys • N 21 runs stabilize() and

Chord: Join/Stabilize Example • N 26 copies keys • N 21 runs stabilize() and asks its successor N 32 for its predecessor which is N 26.

Chord: Join/Stabilize Example • N 21 aquires N 26 as its successor

Chord: Join/Stabilize Example • N 21 aquires N 26 as its successor

Chord Stabilization r Pointers and finger tables may be in a state of flux

Chord Stabilization r Pointers and finger tables may be in a state of flux r Is it possible that data will not be found? m Yes r Recovery: try again

Chord: Node Failure N 120 N 113 N 102 N 85 Lookup(90) N 80

Chord: Node Failure N 120 N 113 N 102 N 85 Lookup(90) N 80 doesn’t know correct successor, so incorrect lookup

Chord: Node Failure r Solution: Use successor lists r Each node knows r immediate

Chord: Node Failure r Solution: Use successor lists r Each node knows r immediate successors r After failure, will know first live successor r Stabilize messages correct finger tables r Replicas of the data associated with a key at the r successor nodes might be used m Application dependent

Chord Properties r In a system with N nodes and K keys, with high

Chord Properties r In a system with N nodes and K keys, with high probability… m m each node receives at most K/N keys each node maintains info. about O(log N) other nodes lookups resolved with O(log N) hops Insertions O(log 2 N) r The developers of Chord validated this through simulation studies. r No consistency among replicas r Hops have poor network locality

Chord: Network Locality r Nodes close on ring can be far in the network.

Chord: Network Locality r Nodes close on ring can be far in the network. N 20 N 40 N 80 * Figure from http: //project-iris. net/talks/dht-toronto-03. ppt N 41