Peking University Samsara Efficient Deterministic Replay with Hardware

Table of Contents 1 Introduction 4 R & R the Memory Interleaving with HAV

Introduction Deterministic Replay Ø Gives computer users the ability to travel backward in time,

Introduction Deterministic Replay for Multi-processor Ø Deterministic replay for single processor is relatively mature

Background & Motivation Software-only schemes Ø Modify OS, compiler, runtime libraries or VMM Ø

Background & Motivation Hardware-based schemes Ø Use special hardware support for recording memory-access interleaving

Background & Motivation Combine the merits of both solutions Ø An Efficient way to

Samsara Overview System composition Ø Controller Ø DMA recorder Ø R&R Component Ø Log

R&R the Memory Interleaving with HAV Extensions Motivation Ø point-to-point logging approach Ø Record

R&R the Memory Interleaving with HAV Extensions Ø Serializability: COW, conflict detection strategy Ø

R&R the Memory Interleaving with HAV Extensions Obtain R&W-set Efficiently via HAV Extensions Ø

R&R the Memory Interleaving with HAV Extensions P 0 Observations Ø Chunk commit is

R&R the Memory Interleaving with HAV Extensions P 0 Reduce Lock Granularity with a

R&R the Memory Interleaving with HAV Extensions Replay Memory Interleaving Ø Guarantee all chunks

Evaluation Experimental Setup Ø 4 -core Intel Core i 7 -4790 processor, 12 GB

Evaluation Log Size Ø Samsara generates log at an average rate of 0. 127

Evaluation Log Size Ø The size of the chunk commit log is practically negligible

Evaluation Recording Overhead Compared to Native Execution Ø Measure the overhead of the system

Conclusion We made the first attempt to leverage HAV extensions to achieve an efficient

Slides: 25

Download presentation

Peking University Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27，2015

Table of Contents 1 Introduction 4 R & R the Memory Interleaving with HAV 2 Background & Motivation 5 Evaluation 3 Samsara Overview 6 Conclusion 2

Introduction Deterministic Replay Ø Gives computer users the ability to travel backward in time, recreating past states and events in the computer. Ø Checkpoint + record all non-deterministic events Replay Checkpoint Phase Application Scenarios Execute same instruction stream Final State’ Inject NDEs in logged points Ø For debugging: cyclic debugging Ø For security: forensics, intrusion detection, malware analysis NDEs log Ø For fault tolerance: hot standby, data recovery Recording Phase Initial State Instruction stream Non-determinism Events (NDEs) (e. g. user / network input, interrupts… ) Final State 3

Introduction Deterministic Replay for Multi-processor Ø Deterministic replay for single processor is relatively mature and welldeveloped Ø Challenge on the multi-processor system: memory interleaving 4

Table of Contents 1 Introduction 4 R & R the Memory Interleaving with HAV 2 Background & Motivation 5 Evaluation 3 Samsara Overview 6 Conclusion 5

Background & Motivation Software-only schemes Ø Modify OS, compiler, runtime libraries or VMM Ø Virtualization-based approaches—CREW protocol Ø CREW: Concurrent-Read & Exclusive-Write Issues Ø Each memory access operation must be checked for logging Ø Serious performance degradation (more than 10 X) Ø Huge log size (approximately 1 MB/processor/second) 6

Background & Motivation Hardware-based schemes Ø Use special hardware support for recording memory-access interleaving Ø Redesign the cache coherence protocol Issues Ø Huge space overhead which limits the duration of the recorded interval Ø Modeled only using software simulations Ø Only support Sequential Consistency (SC) Ø Impractical for use in realistic systems We believe software-only approaches will remain in the focus for optimizations as commercial processors with dedicated hardware-based R&R features are not commonly available yet. 7

Background & Motivation Combine the merits of both solutions Ø An Efficient way to record memory interleaving Ø Without hardware changes Ø Utilize current hardware acceleration as much as possible Hardware-assisted virtualization Ø Some hardware characteristics are available to boost performance Ø Efficient full virtualization using help from hardware capabilities Evaluation results show that our system: Ø Incurs less than 3 X overhead when compared to native execution Ø Reduce 90% log size (even smaller than hardware-based scheme) 8

Table of Contents 1 Introduction 4 R & R the Memory Interleaving with HAV 2 Background & Motivation 5 Evaluation 3 Samsara Overview 6 Conclusion 9

Samsara Overview 10

Samsara Overview System composition Ø Controller Ø DMA recorder Ø R&R Component Ø Log recorder daemon Record and Replay Non-deterministic Events Ø Synchronous Events: record the contents Ø Asynchronous Events: record timestamp Ø Compound Events: DMA, record both Ø Memory interleaving: most important challenge 11

Table of Contents 1 Introduction 4 R & R the Memory Interleaving with HAV 2 Background & Motivation 5 Evaluation 3 Samsara Overview 6 Conclusion 12

R&R the Memory Interleaving with HAV Extensions Motivation Ø point-to-point logging approach Ø Record dependences between pairs of instructions( log size, record overhead ) Ø Avoid the large number of memory access detections Ø Chunk-based schemes ( only the total sequence of chunks is recorded ) Chunk-based Strategy Ø Restrict virtual processors’ execution into a series of chunks Ø Merely need to record commit order Ø Chunk execution must satisfy: Ø Atomicity Ø Serializability 13

R&R the Memory Interleaving with HAV Extensions Ø Serializability: COW, conflict detection strategy Ø Atomicity: some instructions that hard to undo P 0 P 1 Chunk Start LD (D) LD (A) ST (A) COW ST (A) Truncation Reason: I/O Instruction ST (D) R-set { D } W-set { D } ST (B) COW Truncation Reason: Chunk Size Limit Chunk Complete LD (A) R-set { A } W-set { A , B } Conflict Detection LD (B) Commit ST (B) …… R-set { A , B } W-set { B } Squash & Rollback Re-execution 14

R&R the Memory Interleaving with HAV Extensions Obtain R&W-set Efficiently via HAV Extensions Ø VM-based approaches: VM exit (hardware page protection) Ø Our approach: a single EPT traversal Ø Accessed and Dirty Flags of EPT Ø Optimization: tree-based design of EPT VM exit R(b) W(b) W(e) R(a) R(c) Just the first write to each memory page will trigger an EPT violation a single EPT traversal Ø Reduce at least 50% extra VM exits 15

R&R the Memory Interleaving with HAV Extensions P 0 Observations Ø Chunk commit is a time-consuming process Ø The time consumed on waiting for Obtain R&W-set this lock is excessive (40%) Chunk Complete Wait for Lock Ø Update write-back operation involves serious performance degradation Broadcast Updates Lock Subsequent Chunk Detect Conflict Write-back Updates 16

R&R the Memory Interleaving with HAV Extensions P 0 Reduce Lock Granularity with a Decentralized Three-Phase Commit Protocol Ø Pages committed concurrently by different chunks have no intersection Ø Move this out of the synchronized block Ø Chunk committing out-of-order Obtain R&W-set Detect Lock Conflict Chunk Complete Wait for Lock Insert into committing list Broadcast Updates Write-back Updates Update Chunk Info Check Committing List Subsequent Chunk 17

R&R the Memory Interleaving with HAV Extensions Replay Memory Interleaving Ø Guarantee all chunks will be properly re-built and executed in the original order Ø 1. Truncate a chunk at the recorded timestamp (hardware performance counter) Ø 2. Ensure that all preceding chunks have been committed successfully before the subsequent chunk starts Ø Allowing processors execute concurrently in replay 18

Table of Contents 1 Introduction 4 R & R the Memory Interleaving with HAV 2 Background & Motivation 5 Evaluation 3 Samsara Overview 6 Conclusion 19

Evaluation Experimental Setup Ø 4 -core Intel Core i 7 -4790 processor, 12 GB memory, 1 TB Hard Drive Ø Host: Ubuntu 12. 04 with Linux kernel version 3. 11. 0 and Qemu-1. 2. 2 Ø Guest: Ubuntu 14. 04 with Linux kernel version 3. 13. 0 Workloads Ø PARSEC 3. 0 Data Usage Workload Application Domain Parallelization Granularity Sharing Exchange blackscholes Financial Analysis coarse low small bodytrack Computer Vision medium high medium raytrace Rendering medium high low unbounded swaptions Financial Analysis coarse low medium freqmine Data Mining coarse high medium unbounded x 264 Media Processing coarse high medium Working Set 20

Evaluation Log Size Ø Samsara generates log at an average rate of 0. 127 MB/s and 0. 151 MB/s for recoding two and four processors, respectively Ø Reduce 90% log size (even smaller than hardware-based schemes) Ø For comparison : 0. 131 MB/s (single processor) 21

Evaluation Log Size Ø The size of the chunk commit log is practically negligible compared with other non-deterministic events Ø 2. 33% with 2 processors and 7. 37% with 4 processors on average 22

Evaluation Recording Overhead Compared to Native Execution Ø Measure the overhead of the system relative to the base platform it runs on (can not reflect the actual execution time of the system in real life) Ø 2. 6 X and 5. 0 X for recording these workloads on two and four processors 23

Conclusion We made the first attempt to leverage HAV extensions to achieve an efficient software-based replay system Ø Record processors’ execution as a series of chunks Ø Avoid the large number of memory access detections by performing a single EPT traversal at the end of each chunk Ø Propose a decentralized three-phase commit protocol to reduce the lock granularity of the chunk commit process 24

Thanks 25