MICRO 51 Persistence Parallelism Optimization A Holistic Approach

  • Slides: 17
Download presentation
MICRO 51 Persistence Parallelism Optimization: A Holistic Approach from Memory Bus to RDMA Network

MICRO 51 Persistence Parallelism Optimization: A Holistic Approach from Memory Bus to RDMA Network Xing Hu∗ Matheus Ogleari† Jishen Zhao‡ Shuangchen Li∗ Abanti Basak∗ Yuan Xie∗ University of California, Santa Barbara∗ University of California, Santa Cruz† University of California, San Diego‡ Presenter: Seunghyo Kang

Persistent memory • Data remained after power off • To keep persistent & to

Persistent memory • Data remained after power off • To keep persistent & to survive from crash? • Hardware Ordering Control + multi-versioning • !!! Inefficient along the persist data paths !!! • Solve with parallelism a|b

Three Datapaths RDMA Overhead (90% of time stalled by network) Bank Conflict Overhead (36%

Three Datapaths RDMA Overhead (90% of time stalled by network) Bank Conflict Overhead (36% of req. are stalled by this)

Motivation – In Local Memory Bus Epoch Barrier

Motivation – In Local Memory Bus Epoch Barrier

Motivation – In Local Memory Bus Bank-Level Parallelism Barrier

Motivation – In Local Memory Bus Bank-Level Parallelism Barrier

Motivation – In Remote Memory

Motivation – In Remote Memory

Architectural Design ① St X = x ② St X = y Dependency! Barrier

Architectural Design ① St X = x ② St X = y Dependency! Barrier Region Of Interest (BROI)

Architectural Design Epoch

Architectural Design Epoch

Architectural Design

Architectural Design

Architectural Design – Overhead (≈ 400 B)

Architectural Design – Overhead (≈ 400 B)

System Design RDMA_pwrite RDMA_write with ordering ctrl Advanced RDMA NIC Enable ACK

System Design RDMA_pwrite RDMA_write with ordering ctrl Advanced RDMA NIC Enable ACK

Experiment Setup Benchmark: microbenchmark, Whisper persistent benchmark

Experiment Setup Benchmark: microbenchmark, Whisper persistent benchmark

Local Application Performance Local: 16% Hybrid(Local + RDMA): 18%

Local Application Performance Local: 16% Hybrid(Local + RDMA): 18%

Remote Application Performance x 2 improved

Remote Application Performance x 2 improved

Conclusion • Ordering Ctrl with Intra-, Inter-thread parallelism in Local and Remote Node •

Conclusion • Ordering Ctrl with Intra-, Inter-thread parallelism in Local and Remote Node • Throughput Improved in Local and Remote Persistent Memory Environment

Thank you

Thank you