Cyrus Unintrusive ApplicationLevel RecordReplay for Replay Parallelism Nima

  • Slides: 18
Download presentation
Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King (UIUC) Gilles Pokam and Cristiano Pereira (Intel) iacoma. cs. uiuc. edu 1

Record-and-Replay (Rn. R) • Record execution of a parallel program or a whole machine

Record-and-Replay (Rn. R) • Record execution of a parallel program or a whole machine – Save non-deterministic events in a log • During replay, use the recoded log to enforce the same execution – Each thread follows the same sequence of instructions • Use cases – Debugging – Security – High availability 2

Contribution: Cyrus Rn. R System • Application-level Rn. R – Rn. R one or

Contribution: Cyrus Rn. R System • Application-level Rn. R – Rn. R one or more programs in isolation – What users typically need • Fast replay – Replay-time parallelism – Flexibly trade off parallelism for log size • Unintrusive HW – No changes to snoopy cache coherence protocol 3

Capturing Non-determinism • Sources of non-determinism – Program inputs – Memory access interleavings •

Capturing Non-determinism • Sources of non-determinism – Program inputs – Memory access interleavings • How to capture? – OS kernel extension to capture program inputs – HW support to capture memory interleavings (HW-assisted Rn. R) • This talk: recording memory interleavings 4

Recording Interleaving as Chunks • Inter-processor data dependences manifest as coherence messages • Capture

Recording Interleaving as Chunks • Inter-processor data dependences manifest as coherence messages • Capture interleavings as ordered chunks of instructions P 0 P 1 P 0 add Time add …. store A …. mul …. sub P 1 Req . . . store A. . . mul div …. load A …. Resp add …. sub 5 …. div …. load A …. add

Restriction: Unintrusive HW • Unmodified snoopy protocols – In some coherence transactions, there is

Restriction: Unintrusive HW • Unmodified snoopy protocols – In some coherence transactions, there is no reply P 0 P 1 data P 0 invl P 1 data P 1 rd P 1 wr RAW WAR • Requirements for HW-assisted Rn. R: – Do not augment or add coherence messages – Do not rely on explicit replies Only source is always aware → Use source-only recording 6

Challenge 1: Enable Replay Parallelism • Key to fast replay – Overlapped replay of

Challenge 1: Enable Replay Parallelism • Key to fast replay – Overlapped replay of chunks from diff. threads • Previous work: – DAG-based ordering (Karma [ICS’ 2011]) • Requires explicit replies • Augments coherence messages 7

Challenge 1: Enable Replay Parallelism P 0 P 1 P 0 ? P 0→P

Challenge 1: Enable Replay Parallelism P 0 P 1 P 0 ? P 0→P 1 Predecessor 8 P 1 Successor

Challenge 2: Application-Level Rn. R Turn hardware on only when a recorded application runs.

Challenge 2: Application-Level Rn. R Turn hardware on only when a recorded application runs. Four cases: (1) src=monitoring, dst=monitoring (2) src=monitoring, dst=not monitoring (3) src=not monitoring, dst=monitoring (4) src=not monitoring, dst=not monitoring Issues of source-only recording: • Cannot distinguish between (1) and (2) • (2) may result in a dependence later • Not recording in (3) and (4) P 0 P 1 (1) (2) (3) (4) Non-monitored Communication 9 P 2 Monitored Application

Challenge 2: Application-Level Rn. R • Treat (2) as an Early Dependence – Defer

Challenge 2: Application-Level Rn. R • Treat (2) as an Early Dependence – Defer and assign it to the next chunk of the target processor • (3) and (4) superseded by context switches P 0 P 1 (1) ser (2) P 2 ser (3) (4) – At context switch, record a Serialization Dependence to all other processors Non-monitored Dependence 10 Monitored Application

Key: On-the-Fly Backend Software Pass Recording Processors Replaying Processors On-the-fly Backend P P …

Key: On-the-Fly Backend Software Pass Recording Processors Replaying Processors On-the-fly Backend P P … P Sourceonly Log P P … P DAG of Chunks • Transforms source-only log to DAG (for parallelism) • Fixes the Early and Serialization dependences – To support app-level Rn. R • Can trade replay parallelism for log size 11

Memory Race Recording Unit (RRU) • HW module that observes coherence transactions and cache

Memory Race Recording Unit (RRU) • HW module that observes coherence transactions and cache evictions • Tracks loads/stores of the chunk in a signature • Keeps signatures for multiple recent chunks • Records for each chunk – # of instructions – Timestamp (# of coh. transactions) – Dependences for which the chunk is source • Dumps recorded chunks into a log in memory 12

Time. Stamp RRUs Record Source-Only Log 100 150 200 250 300 P 0 Rd

Time. Stamp RRUs Record Source-Only Log 100 150 200 250 300 P 0 Rd A Wr B C 00 … … … C 01 … P 1 P 2 Chunk TS P 0 Rd B … Wr A … … … Rd D … C 10 … … Successor Vector C 00 100 P 0 - P 1 P 2 150 100 C 01 200 - 200 250 - - 250 300 - - - P 1 … Wr D … … C 20 … C 10 P 2 C 20 13

Backend Pass Creates DAG • Finds the target chunk for each recorded dependency –

Backend Pass Creates DAG • Finds the target chunk for each recorded dependency – Creates bidirectional links between src and dst chunks • This algorithm is called Max. Par C 00 Chunks of P 1 C 00 C 01 C 10 100 250 - 150 100 200 - C 01 250 C 10 Chunks of P 2 C 20 300 - - - C 20 14

Trading Replay Parallelism for Log Size C 00 C 01 C 10 ed tch

Trading Replay Parallelism for Log Size C 00 C 01 C 10 ed tch i t S C 00 + C 01 C 10 C 20 St. S CPU TID SIZE PTV 0 0 1 2 2 2 CPU TID SIZE PTV 0 1 2 0 0 0 - 0 0 1 1 1 - er ial TID SIZE Serial C 20 0 0 1 STV - 0 0 - 1 1 2 - 0 0 - 1 2 1 - 0 0 - • No Parallelism • Smallest log • Less Parallelism • Smaller log TID SIZE 15 • No Parallelism • Even Smaller log

Evaluation • Using Simics – Full-system simulation with OS – Wrote a Linux kernel

Evaluation • Using Simics – Full-system simulation with OS – Wrote a Linux kernel module to • Records application inputs • Controls RRUs • Model 8 + 1 processors – 8 processors for the app – 1 processor for the backend • 10 SPLASH-2 benchmarks 16

Replay Time Normalized to Recording rep-Stitched rep-St. Serial rep-Serial 32 16 8 4 2

Replay Time Normalized to Recording rep-Stitched rep-St. Serial rep-Serial 32 16 8 4 2 ra ge ] [a ve tia l w at er _s pa n 2 at er _ ce w yt ra ra di x ra ra di os ity ea n oc lu m fm fft ne s 1 ba r Normalized Replay Time (Log Scale) rep-Max. Par • Large difference between Max. Par and Serial replay • On 8 processors, unoptimized Max. Par replay is only 50% slower than recording 17

Conclusions • Cyrus: Rn. R system that supports – Application-level Rn. R – Unintrusive

Conclusions • Cyrus: Rn. R system that supports – Application-level Rn. R – Unintrusive hardware – Flexible replay parallelism • Key idea: On-the-fly software backend pass • On 8 processors: – Large difference between Max. Par and Serial replay – Unoptimized replay of Max. Par is only 50% slower than recording – Negligible recording overhead • Upcoming ISCA’ 13 paper describes our FPGA Rn. R prototype 18