Using memory mapping to program Intel Optane SSD

How long should we use persistent memory as HDDs? We’ve built all software around

Memory mapping (examples) std: : file_mapping("/home/user/file", C++ PML 4 std: : memory_mappable: : read_write);

Sweeping memory To validate the memory mapping access we simulated sweeping of a file

Our benchmark (Mono&F#&(Intel® Optane™ SSD|P 3700)) let benchmark. File(name: string, a: int, b: int

Different benchmarks We executed the benchmark along the following dimensions: Single thread: Multiple threads:

Single-thread mmap (read) IO Intel® Optane™ / IO Nand 4096 131072 1048576 4194304 16

Single-thread mmap (rw) IO Intel® Optane™/ IO Nand 1 0. 8 4096 0. 6

Multi-thread mmap 4096 131072 1048576 4194304 1073741824 1048576 4194304 2 1. 5 1 0.

Memory using mmap Read a b 4096 Mem. Ms Optane. Ms Nand. Ms Mem.

MR Fingerprinting Undersampling may help reducing examination time significantly (i. e. a 40 min.

Using Intel® Optane™ mmap and Mat. Lab We used the mmap and Mat. Lab

Conclusions Traditional I/O assumptions (orders of magnitude in latency and slow access) are diverging

Slides: 13

Download presentation

Using memory mapping to program Intel® Optane™ SSD drives Antonio Cisternino (@cisterni)

How long should we use persistent memory as HDDs? We’ve built all software around the notion that HDD are SLOW BUT… now it’s not true anymore NAND SSD drives have addressed BW issues (2. 5 GB/s per NVMe drive with 4 PCIe lanes) Intel® Optane™ SSD is about addressing latency Memory mapping is the natural way to exploit it

Memory mapping (examples) std: : file_mapping("/home/user/file", C++ PML 4 std: : memory_mappable: : read_write); std: : mapped_region (mapping, PDP T PD PT Offset Physical memory std: : mapped_region: : read_write); void *address = region. get_address(); std: : size_t size = region. get_size(); std: : memset(address, 0 x. FF, size); region. flush(); m = memmapfile('records. dat', 'Format', 'double') m. Data import mmap with open("hello. txt", "wb") as f: f. write("Hello Python!n") with open("hello. txt", "r+b") as f: mm = mmap(f. fileno(), 0) print mm. readline() print mm[: 5] mm[6: ] = "world!n" mm. seek(0) print mm. readline() mm. close() PML 4 E Matlab Python CR 3 PDPE PTE PDE

Sweeping memory To validate the memory mapping access we simulated sweeping of a file using mmap The algorithm is very simple: Read a bytes every b bytes By varying a and b it is possible to sweep the file in various ways The benchmark has been developed with F# (mono on Linux) because: Memory mapping is available The runtime has some overhead in memory management due to GC It is easy to get CPU time vs completion time Completion time - CPU time ~ IO idle time

Our benchmark (Mono&F#&(Intel® Optane™ SSD|P 3700)) let benchmark. File(name: string, a: int, b: int 64) = let sz = (File. Info(name)). Length let fm = Memory. Mapped. File. Create. From. File(name, File. Mode. Open) let v = fm. Create. View. Accessor(0 L, sz) let start = System. Date. Time. Now Create mapping let buf = Array. zero. Create a Map into memory for i in 0 L. . b. . sz do if i + int 64(a) < sz then let cnt = v. Read. Array<byte>(i, buf, 0, a) if cnt = a then System. Array. Reverse(buf) Read data v. Write. Array(i, buf, 0, a) |> ignore buf |> Array. max |> ignore let endt = System. Date. Time. Now Write data let duration = (endt - start) printfn "%s, %d, %f" name a b duration. Total. Milliseconds

Different benchmarks We executed the benchmark along the following dimensions: Single thread: Multiple threads: Various a and b Read and write Sliced file in strides and skips Every thread executes the benchmark on its stride Read and write Memory mapped memory (12 GB): Single thread test on memory, Optane and Nand Read and write

Single-thread mmap (read) IO Intel® Optane™ / IO Nand 4096 131072 1048576 4194304 16 21 777 6 53 09 68 12 7 1 41 073 82 7 4 21 83 47 64 4 7 83 8 08 86 41 9 04 43 10 4 76 85 20 9 52 71 2 07 26 2 4 14 13 81 9 6 40 9 8 20 4 4 10 2 51 2 25 6 12 8 64 1073741824 32 16 1. 4 1. 2 1 0. 8 0. 6 0. 4 0. 2 0 0. 012 0. 01 0. 008 Total time Optane CPU time Optane 0. 006 Total time Nand 0. 004 CPU time Nand 0. 002 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Single-thread mmap (rw) IO Intel® Optane™/ IO Nand 1 0. 8 4096 0. 6 131072 0. 4 1048576 0. 2 4194304 1073741824 83 8 08 86 16 21 777 6 53 09 68 12 7 1 41 073 82 7 4 2 83 147 64 4 7 41 9 04 43 10 4 76 85 20 9 52 71 13 1 2 07 26 2 4 14 81 92 40 96 20 48 10 24 51 2 25 6 12 8 64 32 16 0 0. 016 0. 014 0. 012 0. 01 CPU time Optane CPU time Nand 0. 008 Total time Optane 0. 006 Total time Nand 0. 004 0. 002 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Multi-thread mmap 4096 131072 1048576 4194304 1073741824 1048576 4194304 2 1. 5 1 0. 5 0 1048576 4194304 8388608 16 21 777 6 83 8 08 86 41 9 04 43 20 9 5 2 71 10 4 76 85 13 1 2 07 26 2 4 14 92 81 96 40 48 20 24 10 2 51 6 25 8 12 64 32 16 1. 2 1 0. 8 0. 6 0. 4 0. 2 0

Memory using mmap Read a b 4096 Mem. Ms Optane. Ms Nand. Ms Mem. Ms/Optane. Ms Mem. Ms/Nand. Ms Optane. Ms/Nand. Ms 4096 21028, 7 25842, 664 26254, 08 0, 813720327 0, 80096897 0, 984329558 Read and write a b Mem. Ms Optane. Ms Nand. Ms Mem. Ms/Optane. Ms Mem. Ms/Nand. Ms Optane. Ms/Nand. Ms 4194304 27390, 1 42102, 431 85178, 88 409631470, 33 40953, 16 45844, 77 0, 650558539 0, 321559726 0, 494282538 0, 768446928 0, 686454048 0, 89330053 With Intel® Optane™ and a similar mmap overhead layer the slowdown is only of 35%

MR Fingerprinting Undersampling may help reducing examination time significantly (i. e. a 40 min. traditional scan be performed in 2 min) MP IN UT TE A NS TIO IV E N Undersampled movie acquired CO Data compared with pre-computed dictionary MATCH! Multiple parameters at once This computation step doesn’t require the patient D Ma et al. Nature 495, 187 -192 (2013) doi: 10. 1038/nature 11971

Using Intel® Optane™ mmap and Mat. Lab We used the mmap and Mat. Lab to multiply very large matrices The behavior reflects our tests Mat. Lab is attempting to use the disk, nevertheless Intel® Optane™ SSD is performing better than Nand SSD Researchers (G. Buonincontri) are working on a smarter

Conclusions Traditional I/O assumptions (orders of magnitude in latency and slow access) are diverging with the reality NMVe dropped the disk controller but software is still pretending hard drives are not solid state From our experiments emerges that Intel® Optane™ SSD technology is a significant step forward towards having main memory tiering instead of storage Memory mapping is a natural programming abstraction to explicitly address storage having benefits from the virtual memory subsystem without trashing system performance