Mem C 3 Mem Cache with CLOCK and

  • Slides: 23
Download presentation
Mem. C 3: Mem. Cache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU)

Mem. C 3: Mem. Cache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU) Dave Andersen (CMU), Michael Kaminsky (Intel Labs) Presenter: Shouqian Shi Credit to: Son Nguyen

Memcached • DRAM-based keyvalue store to alleviate database load – Set(key, value) – Get(key)

Memcached • DRAM-based keyvalue store to alleviate database load – Set(key, value) – Get(key) -> value

Memcached • LRU (Least Recently Used) eviction • Often used for small objects (Facebook

Memcached • LRU (Least Recently Used) eviction • Often used for small objects (Facebook [Atikoglu 12] – 90% keys < 31 bytes • 10 x. M queries per second (Facebook [Atikoglu 12] ) )

Memcached internal • Chaining hashtable • LRU caching using doubly linked list Global Lock

Memcached internal • Chaining hashtable • LRU caching using doubly linked list Global Lock Huge Space Overhead

Goals • Target: read-intensive workload with small objects • Reduce space overhead (bytes/key) •

Goals • Target: read-intensive workload with small objects • Reduce space overhead (bytes/key) • Improve throughput (queries/sec) – Result: 3 X throughput, 30% more objects

Mem. Cachd: Chaining Hashtable • Use linked list – costly space overhead for pointers

Mem. Cachd: Chaining Hashtable • Use linked list – costly space overhead for pointers • Pointer dereference is slow (no advantage from CPU cache) • Read is not constant time (due to possibly long list)

Cuckoo Hashing • Use 2 hash functions to pick two candidate positions • Each

Cuckoo Hashing • Use 2 hash functions to pick two candidate positions • Each bucket has exactly 4 slots (fits in CPU cache) • Each (key, value) object therefore can reside at one of the 8 possible slots HASH 1(ka) (ka, va) HASH 2(ka)

Cuckoo Hashing X X Insert a: X HASH 1(ka) ba X X X X

Cuckoo Hashing X X Insert a: X HASH 1(ka) ba X X X X X c X X X (ka, va) HASH 2(ka) X X X

Cuckoo Hashing X X Insert b: X HASH 2(kb) a X X X X

Cuckoo Hashing X X Insert b: X HASH 2(kb) a X X X X X bc X X X (kb, vb) HASH 1(kb) X X X

Cuckoo Hashing X X X a X Insert c: X HASH 1(kc) c X

Cuckoo Hashing X X X a X Insert c: X HASH 1(kc) c X X X X b X X X (kc, vc) HASH 2(kc) X Done !!! X X

Cuckoo Hashing • Read: 4 lookups on average • Write: write(ka, va) – Find

Cuckoo Hashing • Read: 4 lookups on average • Write: write(ka, va) – Find an empty slot in 8 possible slots of ka – If all are full then randomly kick some (kb, vb) out – Now find an empty slot for (kb, vb) – Repeat 500 times or until an empty slot is found – If still not found then do table expansion

Cuckoo’s advantages • Concurrency: multiple readers/single writer • Read optimized (entries fit in CPU

Cuckoo’s advantages • Concurrency: multiple readers/single writer • Read optimized (entries fit in CPU cache) • Still O(1) amortized time for write • 30% less space overhead • 95% table occupancy

Floating Problem • Always one guy is outside during the insertion – false cache

Floating Problem • Always one guy is outside during the insertion – false cache miss • Solution: Compute the kick out path (Cuckoo path) first, then move items backward

Computed Cuckoo path X X Insert a: X HASH 1(ka) b X X X

Computed Cuckoo path X X Insert a: X HASH 1(ka) b X X X X X c X X X (ka, va) HASH 2(ka) X X X

Cuckoo path backward insert X X Insert a: X HASH 1(ka) ba X X

Cuckoo path backward insert X X Insert a: X HASH 1(ka) ba X X X X X c X X (ka, va) HASH 2(ka) X X X

Cuckoo and optimistic lock

Cuckoo and optimistic lock

Mem. Cachd: Doubly-linked-list • At least two pointers per item – expensive for small

Mem. Cachd: Doubly-linked-list • At least two pointers per item – expensive for small key-value pair • Both read and write update the position – change the list’s structure – need locking between threads (no concurrency)

Solution: CLOCK-based LRU • Only for multiple readers / single writer • Approximate LRU

Solution: CLOCK-based LRU • Only for multiple readers / single writer • Approximate LRU • Circular queue instead of linked list – less space overhead – 1 bit per entry vs 16 Bytes

CLOCK example Originally: entry recency entry Read(kd): Write(kf, vf): Write(kg, vg): recency entry recency

CLOCK example Originally: entry recency entry Read(kd): Write(kf, vf): Write(kg, vg): recency entry recency (ka, va) (kb, vb) (kc, vc) (kd, vd) (ke, ve) 0 1 0 0 1 (ka, va) (kb, vb) (kc, vc) (kd, vd) (ke, ve) 0 1 0 (ka, va) (kb, vb) (kf, vf) 0 0 1 (kg, vg) (kb, vb) (kf, vf) 1 0 1 1 1 (kd, vd) (ke, ve) 0 0

Eviction and lock

Eviction and lock

Evaluation 68% throughput improvement in all hit case. 235% for all miss

Evaluation 68% throughput improvement in all hit case. 235% for all miss

Evaluation 3 x throughput on “real” workload

Evaluation 3 x throughput on “real” workload

Discussion • Single machine multi-core optimization – no cooperation between machines – need a

Discussion • Single machine multi-core optimization – no cooperation between machines – need a load balancer – cannot address hotspot on the cluster level • Atomic insertion – lock along the path • The impact of hotspot false eviction?