Inmemory Caching in HDFS Lower latency same great

Problems • Data hotspots Everyone wants to query some fresh data • Shared disks

How do we solve I/O issues? • Cache important datasets in memory! Much higher

Extra copies Alice Checksum verification

Design considerations Explicitly pin hot datasets in memory 2. Place tasks for memory locality

Outline • Implementation Name. Node and Data. Node modifications • Zero-copy read API •

Architecture The Name. Node schedules which Data. Nodes cache each block of a file.

Architecture Data. Nodes periodically send cache reports describing which replicas they have cached. Name.

Cache Locations API • Clients can ask the Name. Node where a file is

Cache Directives • A cache directive describes a file or directory that should be

mlock The Data. Node pins each cached block into the page cache using mlock.

Zero-copy read API Clients can use the zero-copy read API to map the cached

Skipping Checksums • We would like to skip checksum verification when reading cached data

Skipping Checksums • The Data. Node and DFSClient use shared memory segments to communicate

Test Cluster • 5 Nodes 1 Name. Node • 4 Data. Nodes • •

Single-Node Microbenchmarks How much faster are cached and zero-copy reads? • Introducing vecsum (vector

Throughput 7 5. 9 6 GB/s 5 4 3 2 1 1. 9 0.

Throughput Skipping checksums matters more when going faster • ZCR gets close to bus

Client CPU cycles (billions) 70 60 57. 6 50 51. 8 40 27. 1

Remote Cached vs. Local Uncached Zero-copy is only possible for local cached data •

Remote Cached vs. Local Uncached 1200 1092 841 MB/s 800 600 400 200 125

Microbenchmark Conclusions Short-circuit reads need less CPU than TCP reads • ZCR is even

Map. Reduce • Started with example MR jobs Wordcount • Grep • • Same

wordcount and grep 400 350 300 250 200 150 280 100 275 50 0

wordcount and grep 400 350 Almost no speedup! 300 250 200 150 280 100

wordcount and grep 400 350 ~60 MB/s Not I/O bound 300 250 200 150

wordcount and grep End-to-end latency barely changes • These MR jobs are simply not

Introducing bytecount • Trivial version of wordcount Counts # of occurrences of byte values

bytecount 70 60 50 40 30 20 55 52 45 58 39 10 0

bytecount 1. 3 x faster 70 60 50 40 30 20 55 52 45

Map. Reduce Conclusions • Many MR jobs will see marginal improvement Startup costs •

Impala Benchmarks Open-source OLAP database developed by Cloudera • Tested with Impala 1. 3

Impala Benchmarks 1 TB TPC-DS store_sales table, text format • count(*) on different numbers

Average response time (s) Small Query 25 20 19. 8 15 10 5. 8

Average response time (s) Small Query 25 20 15 2550 MB/s 10 5. 8

Average response time (s) Small Query 25 20 3. 4 x faster, disk vs.

Average response time (s) Small Query 25 20 15 1. 3 x after warmup,

Average response time (s) Big Query 60 50 48. 2 40. 9 40 30

Average response time (s) Big Query 60 50 40 4. 2 x faster, disk

Average response time (s) Big Query Cannot schedule for page cache locality 60 50

Average response time (s) Small Query with Concurrent Workload 60 50 40 30 20

Average response time (s) Small Query with Concurrent Workload 60 50 40 7 x

Average response time (s) Small Query with Concurrent Workload 60 50 40 30 2

Impala Conclusions HDFS cache is faster than disk or page cache • ZCR is

Future Work • Automatic cache replacement • • Sub-block caching • • Lose many

Conclusion • • • I/O contention is a problem for concurrent workloads HDFS can

bytecount Less disk parallelism 70 60 50 40 30 20 55 52 45 58

Slides: 67

Download presentation

In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera. com Colin Mc. Cabe | cmccabe@cloudera. com 1

Query Alice Result set Hadoop cluster

Alice Fresh data

Fresh data

Alice Rollup

Problems • Data hotspots Everyone wants to query some fresh data • Shared disks are unable to handle high load • • Mixed workloads Data analyst making small point queries • Rollup job scanning all the data • Point query latency suffers because of I/O contention • • 6 Same theme: disk I/O contention!

How do we solve I/O issues? • Cache important datasets in memory! Much higher throughput than disk • Fast random/concurrent access • • Interesting working sets often fit in cluster memory • • Traces from Facebook’s Hive cluster Increasingly affordable to buy a lot of memory Moore’s law • 1 TB server is 40 k on HP’s website • 7

Alice Page cache

Alice ? Repeated query

Alice Rollup

Extra copies Alice

Extra copies Alice Checksum verification

Design considerations Explicitly pin hot datasets in memory 2. Place tasks for memory locality 3. Zero overhead reads of cached data 1. 13

Outline • Implementation Name. Node and Data. Node modifications • Zero-copy read API • • Evaluation Microbenchmarks • Map. Reduce • Impala • • 14 Future work

Outline • Implementation Name. Node and Data. Node modifications • Zero-copy read API • • Evaluation Microbenchmarks • Map. Reduce • Impala • • 15 Future work

Architecture The Name. Node schedules which Data. Nodes cache each block of a file. Name. Node Data. Node 16

Architecture Data. Nodes periodically send cache reports describing which replicas they have cached. Name. Node Data. Node 17

Cache Locations API • Clients can ask the Name. Node where a file is cached via get. File. Block. Locations Name. Node Data. Node 18 DFSClient

Cache Directives • A cache directive describes a file or directory that should be cached Path • Cache replication factor • 19 • Stored permanently on the Name. Node • Also have cache pools for access control and quotas, but we won’t be covering that here

mlock The Data. Node pins each cached block into the page cache using mlock. • Because we’re using the page cache, the blocks don’t take up any space on the Java heap. • mlock Data. Node Page Cache read DFSClient 20

Zero-copy read API Clients can use the zero-copy read API to map the cached replica into their own address space • The zero-copy API avoids the overhead of the read() and pread() system calls • However, we don’t verify checksums when using the zero-copy API • • 21 The zero-copy API can be only used on cached data, or when the application computes its own checksums.

Skipping Checksums • We would like to skip checksum verification when reading cached data • • Data. Node already checksums when caching the block Requirements Client needs to know that the replica is cached • Data. Node needs to notify the client if the replica is uncached • 22

Skipping Checksums • The Data. Node and DFSClient use shared memory segments to communicate which blocks are cached. Data. Node Shared Memory Segment Page Cache DFSClient 23 read mlock

Outline • Implementation Name. Node and Data. Node modifications • Zero-copy read API • • Evaluation Single-Node Microbenchmarks • Map. Reduce • Impala • • 24 Future work

Test Cluster • 5 Nodes 1 Name. Node • 4 Data. Nodes • • 48 GB of RAM • Configured 38 GB of HDFS cache per DN 11 x SATA hard disks • 2 x 4 core 2. 13 GHz Westmere Xeon processors • 10 Gbit/s full-bisection bandwidth network • 25

Single-Node Microbenchmarks How much faster are cached and zero-copy reads? • Introducing vecsum (vector sum) • Computes sums of a file of doubles • Highly optimized: uses SSE intrinsics • libhdfs program • Can toggle between various read methods • 26

Throughput 7 5. 9 6 GB/s 5 4 3 2 1 1. 9 0. 8 0. 9 TCP no csums 2. 4 0 27 SCR no csums ZCR

ZCR 1 GB vs 20 GB 7 6 5. 9 GB/s 5 4 2. 7 3 2 1 0 1 GB 28 20 GB

Throughput Skipping checksums matters more when going faster • ZCR gets close to bus bandwidth • • • ~6 GB/s Need to reuse client-side mmaps for maximum perf page_fault function is 1. 16% of cycles in 1 G • 17. 55% in 20 G • 29

Client CPU cycles (billions) 70 60 57. 6 50 51. 8 40 27. 1 30 23. 4 20 12. 7 10 0 TCP 30 TCP no csums SCR no csums ZCR

Why is ZCR more CPU-efficient? 31

Why is ZCR more CPU-efficient? 32

Remote Cached vs. Local Uncached Zero-copy is only possible for local cached data • Is it better to read from remote cache, or local disk? • 33

Remote Cached vs. Local Uncached 1200 1092 841 MB/s 800 600 400 200 125 137 SCR dd 0 TCP 34 iperf

Microbenchmark Conclusions Short-circuit reads need less CPU than TCP reads • ZCR is even more efficient, because it avoids a copy • ZCR goes much faster when re-reading the same data, because it can avoid mmap page faults • Network and disk may be bottleneck for remote or uncached reads • 35

Outline • Implementation Name. Node and Data. Node modifications • Zero-copy read API • • Evaluation Microbenchmarks • Map. Reduce • Impala • • 36 Future work

Map. Reduce • Started with example MR jobs Wordcount • Grep • • Same 4 DN cluster 38 GB HDFS cache per DN • 11 disks per DN • • 17 GB of Wikipedia text • • 37 Small enough to fit into cache at 3 x replication Ran each job 10 times, took the average

wordcount and grep 400 350 300 250 200 150 280 100 275 50 0 38 wordcount cached 55 52 grep cached

wordcount and grep 400 350 Almost no speedup! 300 250 200 150 280 100 275 50 0 39 wordcount cached 55 52 grep cached

wordcount and grep 400 350 ~60 MB/s Not I/O bound 300 250 200 150 280 100 275 50 0 40 wordcount cached ~330 MB/s 55 52 grep cached

wordcount and grep End-to-end latency barely changes • These MR jobs are simply not I/O bound! • Best map phase throughput was about 330 MB/s • 44 disks can theoretically do 4400 MB/s • • Further reasoning Long JVM startup and initialization time • Many copies in Text. Input. Format, doesn’t use zero-copy • Caching input data doesn’t help reduce step • 41

Introducing bytecount • Trivial version of wordcount Counts # of occurrences of byte values • Heavily CPU optimized • • Each mapper processes an entire block via ZCR No additional copies • No record slop across block boundaries • Fast inner loop • Very unrealistic job, but serves as a best case • Also tried 2 GB block size to amortize startup costs • 42

bytecount 70 60 50 40 30 20 55 52 45 58 39 10 0 43 grep cached bytecount cached 35 bytecount -2 G cached

bytecount 1. 3 x faster 70 60 50 40 30 20 55 52 45 58 39 10 0 44 grep cached bytecount cached 35 bytecount -2 G cached

bytecount 70 60 50 40 30 20 55 52 45 58 39 10 0 45 grep cached bytecount cached 35 Still only ~500 MB/s bytecount -2 G cached

Map. Reduce Conclusions • Many MR jobs will see marginal improvement Startup costs • CPU inefficiencies • Shuffle and reduce steps • • Even bytecount sees only modest gains 1. 3 x faster than disk • 500 MB/s with caching and ZCR • Nowhere close to GB/s possible with memory • • 46 Needs more work to take full advantage of caching!

Outline • Implementation Name. Node and Data. Node modifications • Zero-copy read API • • Evaluation Microbenchmarks • Map. Reduce • Impala • • 47 Future work

Impala Benchmarks Open-source OLAP database developed by Cloudera • Tested with Impala 1. 3 (CDH 5. 0) • Same 4 DN cluster as MR section • 38 GB of 48 GB per DN configured as HDFS cache • 152 GB aggregate HDFS cache • 11 disks per DN • 48

Impala Benchmarks 1 TB TPC-DS store_sales table, text format • count(*) on different numbers of partitions • • • Has to scan all the data, no skipping Queries 51 GB small query (34% cache capacity) • 148 GB big query (98% cache capacity) • Small query with concurrent workload • • Tested “cold” and “hot” echo 3 > /proc/sys/vm/drop_caches • Lets us compare HDFS caching against page cache • 49

Average response time (s) Small Query 25 20 19. 8 15 10 5. 8 5 3. 0 0 Uncached cold 50 4. 0 Cached cold Uncached hot Cached hot

Average response time (s) Small Query 25 20 15 2550 MB/s 10 5. 8 5 17 GB/s 4. 0 3. 0 0 Uncached cold 51 I/O bound! 19. 8 Cached cold Uncached hot Cached hot

Average response time (s) Small Query 25 20 3. 4 x faster, disk vs. memory 15 10 5 0 Uncached cold 52 Cached cold Uncached hot Cached hot

Average response time (s) Small Query 25 20 15 1. 3 x after warmup, still wins on CPU efficiency 10 5 0 Uncached cold 53 Cached cold Uncached hot Cached hot

Average response time (s) Big Query 60 50 48. 2 40. 9 40 30 20 11. 5 10 0 Uncached cold 54 9. 4 Cached cold Uncached hot Cached hot

Average response time (s) Big Query 60 50 40 4. 2 x faster, disk vs mem 30 20 10 0 Uncached cold 55 Cached cold Uncached hot Cached hot

Average response time (s) Big Query Cannot schedule for page cache locality 60 50 4. 3 x faster, doesn’t fit in page cache 40 30 20 10 0 Uncached cold 56 Cached cold Uncached hot Cached hot

Average response time (s) Small Query with Concurrent Workload 60 50 40 30 20 10 0 Uncached 57 Cached (not concurrent)

Average response time (s) Small Query with Concurrent Workload 60 50 40 7 x faster when small query working set is cached 30 20 10 0 Uncached 58 Cached (not concurrent)

Average response time (s) Small Query with Concurrent Workload 60 50 40 30 2 x slower than isolated, CPU contention 20 10 0 Uncached 59 Cached (not concurrent)

Impala Conclusions HDFS cache is faster than disk or page cache • ZCR is more efficient than SCR from page cache • Better when working set is approx. cluster memory • • • Significantly better for concurrent workloads • • 7 x faster when contending with a single background query Impala performance will only improve • 60 Can schedule tasks for cache locality Many CPU improvements on the roadmap

Outline • Implementation Name. Node and Data. Node modifications • Zero-copy read API • • Evaluation Microbenchmarks • Map. Reduce • Impala • • 61 Future work

Future Work • Automatic cache replacement • • Sub-block caching • • Lose many benefits of zero-copy API Write-side caching • 62 Potentially important for automatic cache replacement Compression, encryption, serialization • • LRU, LFU, ? Enables Spark-like RDDs for all HDFS applications

Conclusion • • • I/O contention is a problem for concurrent workloads HDFS can now explicitly pin working sets into RAM Applications can place their tasks for cache locality Use zero-copy API to efficiently read cached data Substantial performance improvements 6 GB/s for single thread microbenchmark • 7 x faster for concurrent Impala workload • 63

bytecount Less disk parallelism 70 60 50 40 30 20 55 52 45 58 39 10 0 67 grep cached bytecount cached 35 bytecount -2 G cached