Chord Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson

Introduction r Dynamo stores objects associated with a key through a simple interface: m

Distributed Hash Tables (DHT) r Operationally like standard hash tables r Stores (key, value)

DHT r Core operation: Find node responsible for a key m Map key to

DHT r Introduce a hash function to map the object being searched for to

DHT: Desirable Properties r Key ID space (search space) is uniformly populated m Mapping

Consistent Hashing r The main idea: map both keys and nodes (node IPs) to

Consistent Hashing r With high probability, the hash function balances load (all nodes receive

Consistent Hashing r The consistent hash function assigns each node and key an m-bit

P 2 P Middleware: Differences r Different P 2 P middlewares differ in: m

Chord r m bit identifier space for both keys and nodes r Key identifier

Chord r Nodes organized in an identifier circle based on node identifiers r Keys

Chord r Hash function ensures even distribution of nodes and keys on the circle

Chord: Search Possibilities r Routing table size vs search cost r Every peer knows

Finger Table r Let m be the number of bits in the key/node identifiers

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1) where 1

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

The Chord algorithm – Scalable node localization

Chord: Search r Assume node n is searching for key k. r Node n

Chord: Join r Nodes can join (and leave) at any time. r Challenge: Preserving

Chord: Join Implementation r Each node in Chord maintains a predecessor pointer. m This

Chord: Join Initialization Steps r Assume n is the node to join. r Find

Chord: Join Example • Assume N 26 wants to join; If finds N 8

Chord: Join (Initialize finger table) r Node n needs to have its finger table

Chord: Join (Changing Existing Finger Tables) r Node n needs to entered into the

Chord: Join Example (add N 26) N 21 (old finger table) N 21 (new

Chord: Join Example (add N 26) N 14 (new finger table) N 14+1 N

Chord: Join (Transferring Keys) r Move responsibility for all the keys for which node

Chord: Join r The previous discussion on join focuses on a single node join.

Chord: Stabilization Protocol r The successor/predecessor links are rebuilt by periodic stabilize notification messages

Chord: Join/Stabilize Example • N 26 joins the system • N 26 acquires N

Chord: Join/Stabilize Example • N 26 copies keys • N 21 runs stabilize() and

Chord: Join/Stabilize Example • N 21 aquires N 26 as its successor

Chord Stabilization r Pointers and finger tables may be in a state of flux

Chord: Node Failure N 120 N 113 N 102 N 85 Lookup(90) N 80

Chord: Node Failure r Solution: Use successor lists r Each node knows r immediate

Chord Properties r In a system with N nodes and K keys, with high

Chord: Network Locality r Nodes close on ring can be far in the network.

Slides: 47

Download presentation

Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006

Introduction r Dynamo stores objects associated with a key through a simple interface: m get(), put() r It should be possible to scale Dynamo incrementally r This requires the ability to partition data over the set of nodes (storage hosts) r Dynamo relies on a concept called consistent hashing m The approach they used is similar to that found in Chord.

Distributed Hash Tables (DHT) r Operationally like standard hash tables r Stores (key, value) pairs m The key is like a filename m The value can be file contents or pointer to location r Goal: Efficiently insert/lookup/delete (key, value) pairs r Each peer stores a subset of (key, value) pairs in the system

DHT r Core operation: Find node responsible for a key m Map key to node m Efficiently route insert/lookup/delete request to this node r Allow for frequent node arrivals and departures

DHT r Introduce a hash function to map the object being searched for to a unique global identifier: m e. g. , h(“NGC’ 02 Tutorial Notes”) → 8045 r Distribute the range of the hash function among all nodes in the network 1000 -1999 1500 -4999 9000 -9500 0 -999 4500 -6999 8000 -8999 8045 7000 -8500 9500 -9999 r Each node must “know about” at least one copy of each object that hashes within its range (when one exists)

DHT: Desirable Properties r Key ID space (search space) is uniformly populated m Mapping of keys to IDs using (consistent) hashing r A node is responsible for indexing all the keys in a certain subspace of the ID space r Nodes have only partial knowledge of other node’s responsibilities r Messages should be routed to a node efficiently (small number of hops) r Node arrival/departure should only affect a few nodes.

Consistent Hashing r The main idea: map both keys and nodes (node IPs) to the same (metric) ID space

Consistent Hashing r The main idea: map both keys and nodes (node IPs) to the same (metric) ID space The ring is just a possibility. Any metric space will do

Consistent Hashing r With high probability, the hash function balances load (all nodes receive roughly the same number of keys). r With high probability, when a node joins (or leaves) the network, only an fraction of the keys are moved to a different location. m This is clearly the minimum necessary to maintain a balanced load.

Consistent Hashing r The consistent hash function assigns each node and key an m-bit identifier using SHA-1 as a base hash function. r A node’s identifier is chosen by hashing the node’s IP address. r A key identifier is produced by hashing the key. r For more info see: m D. R. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigrahy, “Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the. World. Wide. Web, ” in Proc. 29 th ACM Symp. Theory of Computing, El Paso, TX, May 1997, pp. 654– 663.

P 2 P Middleware: Differences r Different P 2 P middlewares differ in: m The choice of the ID space m The structure of their network of nodes (i. e. how each node chooses its neighbors) m For each object, node(s) whose range(s) cover that object must be reachable via a “short” path r This is a major research topic

Chord r m bit identifier space for both keys and nodes r Key identifier = SHA-1(key) m Key SHA-1 = “Let. It. Be” m Key = “ 129. 100. 16. 93” SHA-1 ID=50 ID=70 r How do we assign keys to nodes?

Chord r Nodes organized in an identifier circle based on node identifiers r Keys assigned to their successor node in the identifier circle e. g. , node with next higher ID.

Chord r Hash function ensures even distribution of nodes and keys on the circle r Range covered by node is from previous ID up to its own ID r Assume an N node network

Chord: Search Possibilities r Routing table size vs search cost r Every peer knows every other peer: O(N) routing table size r Every peer knows its successor: O(N) search time. r The “compromise” is to have each peer know the next m successors.

Finger Table r Let m be the number of bits in the key/node identifiers r Each node, n, maintains a routing table with at most m entries called the finger table. r The ith entry in the table at node n contains the identity of the first node, s, that succeeds n by at least 2 i-1. ms = successor(n+2 i-1) m s is called the ith finger of node n

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1) where 1 ≤ i ≤ m O(log N) table size

Chord: Finger Table Finger table: finger[i] = successor (n + 2 i-1)

The Chord algorithm – Scalable node localization

Chord: Search r Assume node n is searching for key k. r Node n does the following: m Find ith table entry of node n such that k [finger[i]. start, finger[i+1]. start]) m If no such entry exists then return the node in the last entry of the finger table m The above two steps are repeated until the condition in the first step is satisfied.

Chord: Join r Nodes can join (and leave) at any time. r Challenge: Preserving the ability to locate every key in the network r Chord must preserve the following: m Each node’s successor correctly maintained m For every key k, node successor(k) is responsible for k. r For lookups to be fast, it is desirable for the finger tables to be correct.

Chord: Join Implementation r Each node in Chord maintains a predecessor pointer. m This consists of the Chord ID and IP address of the immediate predecessor of that node. m It can be used to walk counterclockwise around the identifier circle. r The new node to be added learns the identify of an existing Chord node by some external mechanism

Chord: Join Initialization Steps r Assume n is the node to join. r Find any existing node, n’. r Find successor of n from n’. Label this successor(n). r Ask successor(n) for its predecessor. This is labelled as predecessor(successor(n)).

Chord: Join Example • Assume N 26 wants to join; If finds N 8 • N 8’s finger table suggests that N 26 will be “between” N 21 and N 32.

Chord: Join (Initialize finger table) r Node n needs to have its finger table initialized r Node n can ask one its predecessor to be for its finger table as a starting point

Chord: Join (Changing Existing Finger Tables) r Node n needs to entered into the finger tables of some existing nodes. r Node n becomes the ith finger of node p, iff m m p precedes n by at least 2 i-1 ; and The ith finger of node p succeeds n. r The first node, p, that satisfies these conditions is the immediate predecessor of n-2 i-1 r For a given n, the algorithm starts with the ith finger of node n and then continues to walk in the counter-clock-wise direction on the identifier circle until it encounters a node whose ith finger precedes n.

Chord: Join Example (add N 26) N 21 (old finger table) N 21 (new finger table) N 21+1 N 32 N 21+1 N 26 N 21+2 N 32 N 21+2 N 26 N 21+4 N 32 N 21+4 N 26 N 21+8 N 32 N 21+16 N 38 N 21+32 N 56 i=1: Does N 21 precede N 26 by at least 1 (2 i-1); yes: N 21+1 becomes N 26; i=2: Does N 21 precede N 26 by at least 2; yes: N 21+2 becomes N 26; i=3: Does N 21 precede N 26 by at least 4; yes: N 21+4 becomes N 26; i=4: Does N 21 precede N 26 by 8; no; evaluate N 14;

Chord: Join Example (add N 26) N 14 (new finger table) N 14+1 N 21 N 14+2 N 21 N 14+4 N 21 N 14+8 N 32 N 14+8 N 26 N 14+16 N 32 N 14+32 N 48 i=4: Does N 14 precede N 26 by at least 8; yes; N 14+8 becomes N 26 i=5; Does N 15 precede N 26 by at least 16; no; evaluate N 8 Etc

Chord: Join (Transferring Keys) r Move responsibility for all the keys for which node n is the successor. r Typically this involves moving data associated with each key to the new node. r Node n can become the successor for keys that were previously the responsibility of the node immediately following n. r Node n only needs to contact one node to transfer responsibility for all relevant keys.

Chord: Join r The previous discussion on join focuses on a single node join. r What if there are multiple node joins? r Join requires that each node’s successor is correctly maintained

Chord: Stabilization Protocol r The successor/predecessor links are rebuilt by periodic stabilize notification messages m Sent by each node to its successor to inform it of the (possibly new) identity of the predecessor r The successor pointers are used to verify and correct finger table entries.

Chord: Join/Stabilize Example

Chord: Join/Stabilize Example • N 26 joins the system • N 26 acquires N 32 as its successor • N 26 notifies N 32 • N 32 acquires N 26 as its predecessor

Chord: Join/Stabilize Example • N 26 copies keys • N 21 runs stabilize() and asks its successor N 32 for its predecessor which is N 26.

Chord: Join/Stabilize Example • N 21 aquires N 26 as its successor

Chord Stabilization r Pointers and finger tables may be in a state of flux r Is it possible that data will not be found? m Yes r Recovery: try again

Chord: Node Failure N 120 N 113 N 102 N 85 Lookup(90) N 80 doesn’t know correct successor, so incorrect lookup

Chord: Node Failure r Solution: Use successor lists r Each node knows r immediate successors r After failure, will know first live successor r Stabilize messages correct finger tables r Replicas of the data associated with a key at the r successor nodes might be used m Application dependent

Chord Properties r In a system with N nodes and K keys, with high probability… m m each node receives at most K/N keys each node maintains info. about O(log N) other nodes lookups resolved with O(log N) hops Insertions O(log 2 N) r The developers of Chord validated this through simulation studies. r No consistency among replicas r Hops have poor network locality

Chord: Network Locality r Nodes close on ring can be far in the network. N 20 N 40 N 80 * Figure from http: //project-iris. net/talks/dht-toronto-03. ppt N 41