Support for Symmetric Shadow Memory in Multiprocessors Vijay

  • Slides: 25
Download presentation
Support for Symmetric Shadow Memory in Multiprocessors Vijay Nagarajan Rajiv Gupta University of California,

Support for Symmetric Shadow Memory in Multiprocessors Vijay Nagarajan Rajiv Gupta University of California, Riverside

Runtime Monitoring • Applications of monitoring – Security • DIFT – Debugging • Memcheck,

Runtime Monitoring • Applications of monitoring – Security • DIFT – Debugging • Memcheck, Redux, On. Trac – Performance • Speculation • Requirements of monitoring – Shadow Memory (SM) • Meta-data associated with memory locations – Shadow memory instructions (SMIs) • Instruction for maintenance of meta-data

DIFT: Example • Each word/reg associated with “taint” value – Data from input channels

DIFT: Example • Each word/reg associated with “taint” value – Data from input channels are considered tainted – Flow of tainted data is tracked – Usage of tainted data in “malicious” fashion detected Original Instruction Shadow Memory Operation Ld reg, mem Taint-val[reg] Taint-val[Mem] St reg, mem Taint-val[mem] Taint-val[reg] Add reg 1, reg 2 Taint-val[reg 1] or Taint-val[reg 2] Jmp reg 1 If Taint-val[reg 1] raise exception

Shadow Memory Observations • Single vs Multiple Shadow values – DIFT associates one taint

Shadow Memory Observations • Single vs Multiple Shadow values – DIFT associates one taint value – Other applications associate multiple shadow values • DDG computes dynamic dependence graph on the fly • For each memory word, maintains (instruction, instance) pair that wrote to it last. • Symmetric SMIs – Original stores (loads) associated with shadow stores (loads) • Atomic SMIs – OMI and SMIs must be executed atomically

Atomic SMIs Proc A St 1 St 2 S St 2 Proc BAA Proc.

Atomic SMIs Proc A St 1 St 2 S St 2 Proc BAA Proc. BB St. St 1 1 Ld SSSt. St 1 1 St 2 St. S 2 St 2 SSLd St 2 Ld S Ld Inconsistent Atomicity View

Robust & Efficient SM • Each SM access involves – Calculating effective and shadow

Robust & Efficient SM • Each SM access involves – Calculating effective and shadow address – Accessing the shadow values • Half-and-Half scheme – Reserve half of virtual space for shadow memory – Efficient SM access – Not Robust [Nethercote and Seward VEE ’ 07] • Valgrind’s s/w page table like scheme – Robust – Inefficient (Valgrind’s Memcheck causes 22 x slowdown) • Need to be efficient and robust!

Research Question • Can we make SMIs and OMIs atomic? • Can we make

Research Question • Can we make SMIs and OMIs atomic? • Can we make SM accesses efficient without sacrificing robustness? • Can we do the above with minimal HW support?

Our Approach • Convey atomic block to the processor – Simple ISA support: shadow-start,

Our Approach • Convey atomic block to the processor – Simple ISA support: shadow-start, shadow-end – SMIs implicitly identified • Coupled Coherence – Coherence of SMIs and OMIs are coupled – Enforces the effect of atomicity • OS Support – Couple allocation of original and shadow pages – Efficient addressing without sacrificing robustness

ISA Support • Shadow-start / Shadow-end instructions – OMIs and SMIs enclosed – Conveys

ISA Support • Shadow-start / Shadow-end instructions – OMIs and SMIs enclosed – Conveys atomic block to the processor – Guides actions of cachecoherence protocol • Implicitly distinguishing SMIs – First instruction is an OMI – All others with same VA treated as SMIs – Multiple accesses implicitly assumed to access different shadow values EXAMPLE 0. shadow-start // Original load 1. ld reg 1, vaddr // 1 st shadow load 2. ld reg 2, vaddr // 2 nd shadow load 3. ld reg 3, vaddr 4. shadow-end

Coupled Coherence • Dependence Mirroring – Dependences among SMIs mirror those of the OMIs

Coupled Coherence • Dependence Mirroring – Dependences among SMIs mirror those of the OMIs – If OMI 2 OMI 1 then SMI 2 SMI 1 – Couple coherence enforces this Proc A Proc B St 1 St 2 S St 2 Ld S Ld

Coupled Coherence • Coupled Coherence involves – No Explicit Shadow coherence messages • SMIs

Coupled Coherence • Coupled Coherence involves – No Explicit Shadow coherence messages • SMIs do not trigger coherence messages • Shadow stores do not trigger invalidates • Shadow loads do not cause misses – Co-transfer • Data replies of original blocks are piggybacked with shadow blocks – Co-existence • Original blocks and shadow blocks co-exist in the cache • Brought in together • Replaced together

Dependence Mirroring: RAW Proc A Proc B Block ‘B’ Shared Exc shared Inv Shadow

Dependence Mirroring: RAW Proc A Proc B Block ‘B’ Shared Exc shared Inv Shadow Block ‘B’ Exc Shared Inv shared Proc A send invalidate for B and B’ Proc B send read miss for B and B’ Proc A sends blocks B and B’ St S St Ld S Ld

Dependence Mirroring: RAW Proc A Block ‘B’ Ready bit 0 1 Proc B Exc

Dependence Mirroring: RAW Proc A Block ‘B’ Ready bit 0 1 Proc B Exc Inv Proc A send invalidate for B and B’ Proc B send read miss for B and B’ shadow-st St Proc A waits until ready bit set Ld S St shadow-end S Ld Proc A sends blocks B and B’

Dependence Mirroring: WAR Proc A Proc B St 1 S St 1 Proc A

Dependence Mirroring: WAR Proc A Proc B St 1 S St 1 Proc A send invalidates Proc B send read miss for B and B’ Ld St 2 S Ld Proc A sends blocks B and B’

Coupled Coherence • On a cache miss – Original Ld / St • Place

Coupled Coherence • On a cache miss – Original Ld / St • Place read miss for original, shadow block(s) • Write back dirty blocks – Shadow Ld / St • //No coherence events • Shadow-start – Set ready bit to 0 • Shadow-end – Set ready bit to 1

Symmetric/General SM • Symmetric SM – Original loads (stores) accompanied by shadow loads (stores)

Symmetric/General SM • Symmetric SM – Original loads (stores) accompanied by shadow loads (stores) • General SM – Original load can be accompanied by both shadow loads and stores • Eg. Eraser: Online race detection – Need to enforce shadow coherence for RAR • Typically no coherence events for RAR • Future Work

Addressing Support • Shadow pages allocated adjacent to original pages – Virtual Memory space

Addressing Support • Shadow pages allocated adjacent to original pages – Virtual Memory space unaffected – Retains robustness – OS treats them as a single “superpage” • Swapped in and swapped out together • Address Translation – During Address translation add offset to access shadow page – Provides efficiency – No separate TLB for shadow pages V. Page Memory TLB OMI Off Ori. Page Ph. page Shadow Page 1 V. Page SMI Off Shadow Page 2 Shadow Value cnt

Experiments • Implementation in SESC Simulator – Cycle Accurate, targets MIPS architecture • Shadow-start,

Experiments • Implementation in SESC Simulator – Cycle Accurate, targets MIPS architecture • Shadow-start, Shadow-end instructions – Models cache coherence protocol • Coupled Coherence implementation • Bus based protocol – Models basic OS services • Coupled page allocation • Monitoring Applications – DIFT: Detection of security attacks – DDG: Computes Dynamic dependence graph online • Benchmarks – SPLASH-2

Efficiency of SM • Three versions: – SM • • Our SM implementation ISA

Efficiency of SM • Three versions: – SM • • Our SM implementation ISA support OS support for address translation Coupled Coherence protocol for atomicity – VAL: serial • Valgrind’s SM support. • Address Translation: involves software page table accesses • Atomicity: Enforced by thread serialization – VAL: lb • Valgrind’s SM support with no atomicity guarantees • Means of comparison of our address translation support

Efficiency of SM: DIFT • VAL: serial causes 41 times overhead on an average

Efficiency of SM: DIFT • VAL: serial causes 41 times overhead on an average – Effect of serialization • SM causes only 7 times overhead – Efficient Address translation + coupled coherence • Even without serialization VAL: lb causes 12 times overhead – With coupled coherence this reduces to 7 times

Efficiency of SM: DDG • VAL: serial causes 78 times overhead on an average

Efficiency of SM: DDG • VAL: serial causes 78 times overhead on an average – Effect of serialization • SM causes only 23 times overhead – Efficient Address translation + coupled coherence • Even without serialization VAL: lb causes 27 times overhead – With coupled coherence this reduces to 23 times – Effect not as pronounced as in DIFT

Effect of Coupled Coherence • Performance overhead < 0. 6% for DIFT and DDG

Effect of Coupled Coherence • Performance overhead < 0. 6% for DIFT and DDG – Total amount of traffic is about the same – Coupled coherence sees more bursts in traffic

Related Work • Enforcing Atomicity – Valgrind [Nethercote et al. PLDI ‘ 07] through

Related Work • Enforcing Atomicity – Valgrind [Nethercote et al. PLDI ‘ 07] through thread serialization • Not efficient – TM [Chung et al. HPCA ‘ 08] can be used. • Requires additional HW changes • Support for rollback and re-execution. • Address Translation – Valgrind [Nethercote VEE ’ 07] software page table structure • Proposed application specific optimizations • Still inefficient – Half-and-Half scheme [Qin et al MICRO ’ 07] • Divides virtual address space • Not Robust

Conclusion • SM used extensively for performing monitoring – Performance – Security – Debugging

Conclusion • SM used extensively for performing monitoring – Performance – Security – Debugging • Support for improving SM performance – – ISA Support Coupled coherence atomicity Coupled allocation efficient addressing Significant performance advantage • Future Work – Extend system to not only symmetric SMIs – Look at other techniques for providing atomicity without changes to coherence protocol

Questions?

Questions?