Support for Symmetric Shadow Memory in Multiprocessors Vijay

Runtime Monitoring • Applications of monitoring – Security • DIFT – Debugging • Memcheck,

DIFT: Example • Each word/reg associated with “taint” value – Data from input channels

Shadow Memory Observations • Single vs Multiple Shadow values – DIFT associates one taint

Atomic SMIs Proc A St 1 St 2 S St 2 Proc BAA Proc.

Robust & Efficient SM • Each SM access involves – Calculating effective and shadow

Research Question • Can we make SMIs and OMIs atomic? • Can we make

Our Approach • Convey atomic block to the processor – Simple ISA support: shadow-start,

ISA Support • Shadow-start / Shadow-end instructions – OMIs and SMIs enclosed – Conveys

Coupled Coherence • Dependence Mirroring – Dependences among SMIs mirror those of the OMIs

Coupled Coherence • Coupled Coherence involves – No Explicit Shadow coherence messages • SMIs

Dependence Mirroring: RAW Proc A Proc B Block ‘B’ Shared Exc shared Inv Shadow

Dependence Mirroring: RAW Proc A Block ‘B’ Ready bit 0 1 Proc B Exc

Dependence Mirroring: WAR Proc A Proc B St 1 S St 1 Proc A

Coupled Coherence • On a cache miss – Original Ld / St • Place

Symmetric/General SM • Symmetric SM – Original loads (stores) accompanied by shadow loads (stores)

Addressing Support • Shadow pages allocated adjacent to original pages – Virtual Memory space

Experiments • Implementation in SESC Simulator – Cycle Accurate, targets MIPS architecture • Shadow-start,

Efficiency of SM • Three versions: – SM • • Our SM implementation ISA

Efficiency of SM: DIFT • VAL: serial causes 41 times overhead on an average

Efficiency of SM: DDG • VAL: serial causes 78 times overhead on an average

Effect of Coupled Coherence • Performance overhead < 0. 6% for DIFT and DDG

Related Work • Enforcing Atomicity – Valgrind [Nethercote et al. PLDI ‘ 07] through

Conclusion • SM used extensively for performing monitoring – Performance – Security – Debugging

Slides: 25

Download presentation

Support for Symmetric Shadow Memory in Multiprocessors Vijay Nagarajan Rajiv Gupta University of California, Riverside

Runtime Monitoring • Applications of monitoring – Security • DIFT – Debugging • Memcheck, Redux, On. Trac – Performance • Speculation • Requirements of monitoring – Shadow Memory (SM) • Meta-data associated with memory locations – Shadow memory instructions (SMIs) • Instruction for maintenance of meta-data

DIFT: Example • Each word/reg associated with “taint” value – Data from input channels are considered tainted – Flow of tainted data is tracked – Usage of tainted data in “malicious” fashion detected Original Instruction Shadow Memory Operation Ld reg, mem Taint-val[reg] Taint-val[Mem] St reg, mem Taint-val[mem] Taint-val[reg] Add reg 1, reg 2 Taint-val[reg 1] or Taint-val[reg 2] Jmp reg 1 If Taint-val[reg 1] raise exception

Shadow Memory Observations • Single vs Multiple Shadow values – DIFT associates one taint value – Other applications associate multiple shadow values • DDG computes dynamic dependence graph on the fly • For each memory word, maintains (instruction, instance) pair that wrote to it last. • Symmetric SMIs – Original stores (loads) associated with shadow stores (loads) • Atomic SMIs – OMI and SMIs must be executed atomically

Atomic SMIs Proc A St 1 St 2 S St 2 Proc BAA Proc. BB St. St 1 1 Ld SSSt. St 1 1 St 2 St. S 2 St 2 SSLd St 2 Ld S Ld Inconsistent Atomicity View

Robust & Efficient SM • Each SM access involves – Calculating effective and shadow address – Accessing the shadow values • Half-and-Half scheme – Reserve half of virtual space for shadow memory – Efficient SM access – Not Robust [Nethercote and Seward VEE ’ 07] • Valgrind’s s/w page table like scheme – Robust – Inefficient (Valgrind’s Memcheck causes 22 x slowdown) • Need to be efficient and robust!

Research Question • Can we make SMIs and OMIs atomic? • Can we make SM accesses efficient without sacrificing robustness? • Can we do the above with minimal HW support?

Our Approach • Convey atomic block to the processor – Simple ISA support: shadow-start, shadow-end – SMIs implicitly identified • Coupled Coherence – Coherence of SMIs and OMIs are coupled – Enforces the effect of atomicity • OS Support – Couple allocation of original and shadow pages – Efficient addressing without sacrificing robustness

ISA Support • Shadow-start / Shadow-end instructions – OMIs and SMIs enclosed – Conveys atomic block to the processor – Guides actions of cachecoherence protocol • Implicitly distinguishing SMIs – First instruction is an OMI – All others with same VA treated as SMIs – Multiple accesses implicitly assumed to access different shadow values EXAMPLE 0. shadow-start // Original load 1. ld reg 1, vaddr // 1 st shadow load 2. ld reg 2, vaddr // 2 nd shadow load 3. ld reg 3, vaddr 4. shadow-end

Coupled Coherence • Dependence Mirroring – Dependences among SMIs mirror those of the OMIs – If OMI 2 OMI 1 then SMI 2 SMI 1 – Couple coherence enforces this Proc A Proc B St 1 St 2 S St 2 Ld S Ld

Coupled Coherence • Coupled Coherence involves – No Explicit Shadow coherence messages • SMIs do not trigger coherence messages • Shadow stores do not trigger invalidates • Shadow loads do not cause misses – Co-transfer • Data replies of original blocks are piggybacked with shadow blocks – Co-existence • Original blocks and shadow blocks co-exist in the cache • Brought in together • Replaced together

Dependence Mirroring: RAW Proc A Proc B Block ‘B’ Shared Exc shared Inv Shadow Block ‘B’ Exc Shared Inv shared Proc A send invalidate for B and B’ Proc B send read miss for B and B’ Proc A sends blocks B and B’ St S St Ld S Ld

Dependence Mirroring: RAW Proc A Block ‘B’ Ready bit 0 1 Proc B Exc Inv Proc A send invalidate for B and B’ Proc B send read miss for B and B’ shadow-st St Proc A waits until ready bit set Ld S St shadow-end S Ld Proc A sends blocks B and B’

Dependence Mirroring: WAR Proc A Proc B St 1 S St 1 Proc A send invalidates Proc B send read miss for B and B’ Ld St 2 S Ld Proc A sends blocks B and B’

Coupled Coherence • On a cache miss – Original Ld / St • Place read miss for original, shadow block(s) • Write back dirty blocks – Shadow Ld / St • //No coherence events • Shadow-start – Set ready bit to 0 • Shadow-end – Set ready bit to 1

Symmetric/General SM • Symmetric SM – Original loads (stores) accompanied by shadow loads (stores) • General SM – Original load can be accompanied by both shadow loads and stores • Eg. Eraser: Online race detection – Need to enforce shadow coherence for RAR • Typically no coherence events for RAR • Future Work

Addressing Support • Shadow pages allocated adjacent to original pages – Virtual Memory space unaffected – Retains robustness – OS treats them as a single “superpage” • Swapped in and swapped out together • Address Translation – During Address translation add offset to access shadow page – Provides efficiency – No separate TLB for shadow pages V. Page Memory TLB OMI Off Ori. Page Ph. page Shadow Page 1 V. Page SMI Off Shadow Page 2 Shadow Value cnt

Experiments • Implementation in SESC Simulator – Cycle Accurate, targets MIPS architecture • Shadow-start, Shadow-end instructions – Models cache coherence protocol • Coupled Coherence implementation • Bus based protocol – Models basic OS services • Coupled page allocation • Monitoring Applications – DIFT: Detection of security attacks – DDG: Computes Dynamic dependence graph online • Benchmarks – SPLASH-2

Efficiency of SM • Three versions: – SM • • Our SM implementation ISA support OS support for address translation Coupled Coherence protocol for atomicity – VAL: serial • Valgrind’s SM support. • Address Translation: involves software page table accesses • Atomicity: Enforced by thread serialization – VAL: lb • Valgrind’s SM support with no atomicity guarantees • Means of comparison of our address translation support

Efficiency of SM: DIFT • VAL: serial causes 41 times overhead on an average – Effect of serialization • SM causes only 7 times overhead – Efficient Address translation + coupled coherence • Even without serialization VAL: lb causes 12 times overhead – With coupled coherence this reduces to 7 times

Efficiency of SM: DDG • VAL: serial causes 78 times overhead on an average – Effect of serialization • SM causes only 23 times overhead – Efficient Address translation + coupled coherence • Even without serialization VAL: lb causes 27 times overhead – With coupled coherence this reduces to 23 times – Effect not as pronounced as in DIFT

Effect of Coupled Coherence • Performance overhead < 0. 6% for DIFT and DDG – Total amount of traffic is about the same – Coupled coherence sees more bursts in traffic

Related Work • Enforcing Atomicity – Valgrind [Nethercote et al. PLDI ‘ 07] through thread serialization • Not efficient – TM [Chung et al. HPCA ‘ 08] can be used. • Requires additional HW changes • Support for rollback and re-execution. • Address Translation – Valgrind [Nethercote VEE ’ 07] software page table structure • Proposed application specific optimizations • Still inefficient – Half-and-Half scheme [Qin et al MICRO ’ 07] • Divides virtual address space • Not Robust

Conclusion • SM used extensively for performing monitoring – Performance – Security – Debugging • Support for improving SM performance – – ISA Support Coupled coherence atomicity Coupled allocation efficient addressing Significant performance advantage • Future Work – Extend system to not only symmetric SMIs – Look at other techniques for providing atomicity without changes to coherence protocol

Questions?