Dissertation Seminar 1811 2005 Auditorium Minus Museum Gustavianum

























![TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2 -node Wild. Fire] TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2 -node Wild. Fire]](https://slidetodoc.com/presentation_image_h/baa95fead18a45039699231e4ef71025/image-26.jpg)


- Slides: 28

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran. radovic@it. uu. se Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Outline § NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Vasaloppet “Contention Problem in Sweden” Traditional cross-country ski race 90 km … Zoran. Radovic@it. uu. se Dissertation Seminar 85. 6533 km to go… CS Nov 18, 2005

Critical Section (CS) Cost Spin Locks under Contention Spin locks with backoff IF (more contention) THEN less efficient CS … “The more important the slower it runs…” Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Queue-based Locks CS Cost Spin locks with backoff locks IF (more contention)Queue-based THEN constant CS cost … Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

CS Cost This Dissertation Spin locks with backoff IF (more contention) THEN more efficient CS … Queue-based locks “The more important the faster it runs…” NUCA locks Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

NUCA Locks (Basic Idea) 1) Reduce Switch traffic - one CPU per node is testing… Memory 2) Improve. Memory lock handover 3) More efficient CS $ $ $ P P … P Lock/Unlock Zoran. Radovic@it. uu. se - local traffic is cheaper $ $ P P … Memory $ $ $ P P P Test Lock/Unlock Dissertation Seminar … $ P Test Test Nov 18, 2005

The HBO Lock (the simplest HBO) § What do we need? § node_id Creates § Compare&swap (CAS) atomic operation Communication CAS(Lock_address, FREE, node_id) § lock-acquire: Affinity § If the lock-value is in the state FREE: • The node_id is CAS-ed into the lock location § Else: 2 cases • The lock is “local” Spin with small backoff • The lock is “remote” Spin with large backoff § Simple but fairly effective… Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Performance Results Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs 14 14 WF Fairness? Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Fairness Study Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs t Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Application Performance 28 -processor runs ≈ 4 x Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Total Traffic: Raytrace Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

HBO Locks inside Linux Kernel § Patch provided by Silicon Graphics, Inc. § Linux-IA 64 kernel implementation, May 2005 § Page-fault handler runs 3 x faster § 60 processors Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Outline ü NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

The DSZOOM Proposal Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

The DSZOOM Proposal § Run entire protocol in requesting-processor Þ No protocol agent communication! § Assumes user-level remote memory access § put, get, and atomics [ Infini. Band ] § Fine-grain memory protocols (e. g. , 64 bytes) § Hardware-like memory models [Shasta, Blizzard, Sirocco] Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

“Squeezing” Protocols into Binaries… DSZOOM Program Original Program. . . cmp bne nop %g 0, %l 5 0 x 24431 ld [%o 1 + 64], %o 0 ldd clr. . . [%o 0 + 16], %f 4 %l 5 %g 0, %l 5 0 x 24431 ld mov and cmp bne nop [%o 1 + 64], %o 0 255, %g 6, %o 0, %g 6, 170 0 x 24450 ldd clr. . . [%o 0 + 16], %f 4 %l 5 Binary/Assembler level instrumentation Zoran. Radovic@it. uu. se Dissertation Seminar Fast-path Protocol Code Slow-path Protocol Code (C-code) Nov 18, 2005

Write Permission Caching § Problem: store instrumentation relies on locking § More complex instrumentation § Solution: write permission cache (WPC) § Small and fast software-managed cache § Keeps write permissions § The WPC idea: § Exploit store locality § Dynamically reduce the number of memory references in store checking code Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Other “Features” § Two kinds of protocols § Invalidate § Update § Many optimizations § § § § Instrumentation scheduling Instrumentation batching WPC-based write batching WPC-based dirty-data filtering Private-data filtering # of WPC entries Coherence unit size Zoran. Radovic@it. uu. se Dissertation Seminar (update and invalidate) (update) (update and invalidate) Nov 18, 2005

Coherence Flags and Profiling § Coherence flags § Similar to optimization flags of compilers § Possible scenario: gcc -dszoom-cl 128 -dszoom-inv –O 3 my_app. c § Execution profiling § Similar to profile feedback of compilers § Helps finding appropriate coherence flag settings § Low overhead implementation in DSZOOM • Less than 30 percent overhead § Works for both small and large input sets Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

DSZOOM Results 2 -node Wild. Fire, 16 CPUs 1. 45 x Zoran. Radovic@it. uu. se Dissertation Seminar 1. 11 x Nov 18, 2005

Outline ü NUCA Locks ü DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Instrumentation Drawbacks DSZOOM Program Original Program. . . cmp bne nop %g 0, %l 5 0 x 24431 ld [%o 1 + 64], %o 0 ldd clr. . . [%o 0 + 16], %f 4 %l 5 • • ld mov and cmp bne nop [%o 1 + 64], %o 0 255, %g 6, %o 0, %g 6, 170 0 x 24450 ldd clr. . . [%o 0 + 16], %f 4 %l 5 Binary transparency? Run-time execution overhead Zoran. Radovic@it. uu. se %g 0, %l 5 0 x 24431 Dissertation Seminar Fast-path Protocol Code Slow-path Protocol Code (C-code) Nov 18, 2005

Trap-Based Memory Architectures § Basic idea § Detect fine-grained coherence violations in hardware § Trigger a coherence trap when one occur § Maintain coherence by software protocols § No memory system modifications § Minimal processor modifications § Binary Transparency § No need to instrument binaries/applications Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

TMA Lite Proof-of-concept Implementation § Load permission check § Hardware implementation of software check • Predefined “magic-value” convention § Store permission check § Hardware WPC § Can be seen as a very small cache § Operates on virtual addresses § Accessed in parallel with the data TLB Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
![TMA Lite Performance TMA simulation study 4 nodes DSZOOM 2 node Wild Fire TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2 -node Wild. Fire]](https://slidetodoc.com/presentation_image_h/baa95fead18a45039699231e4ef71025/image-26.jpg)
TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2 -node Wild. Fire] 1. 75 x Zoran. Radovic@it. uu. se Dissertation Seminar 1. 01 x Nov 18, 2005

Topics not Presented § RH lock algorithm § Controlled (un)fairness § HBO_GT and HBO_GT_SD algorithms § Global throttling and starvation detection § DSZOOM implementation details § Instrumentation challenges; scheduling, batching, etc. § Bandwidth filtering techniques; dirty- & private-data § Innovative TMA simulation tricks § Low-level “good days” hacks § Reusing Simics checkpoints Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran. radovic@it. uu. se Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005