Dissertation Seminar 1811 2005 Auditorium Minus Museum Gustavianum

  • Slides: 28
Download presentation
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran. radovic@it. uu. se Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Outline § NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based

Outline § NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Vasaloppet “Contention Problem in Sweden” Traditional cross-country ski race 90 km … Zoran. Radovic@it.

Vasaloppet “Contention Problem in Sweden” Traditional cross-country ski race 90 km … Zoran. Radovic@it. uu. se Dissertation Seminar 85. 6533 km to go… CS Nov 18, 2005

Critical Section (CS) Cost Spin Locks under Contention Spin locks with backoff IF (more

Critical Section (CS) Cost Spin Locks under Contention Spin locks with backoff IF (more contention) THEN less efficient CS … “The more important the slower it runs…” Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Queue-based Locks CS Cost Spin locks with backoff locks IF (more contention)Queue-based THEN constant

Queue-based Locks CS Cost Spin locks with backoff locks IF (more contention)Queue-based THEN constant CS cost … Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

CS Cost This Dissertation Spin locks with backoff IF (more contention) THEN more efficient

CS Cost This Dissertation Spin locks with backoff IF (more contention) THEN more efficient CS … Queue-based locks “The more important the faster it runs…” NUCA locks Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

NUCA Locks (Basic Idea) 1) Reduce Switch traffic - one CPU per node is

NUCA Locks (Basic Idea) 1) Reduce Switch traffic - one CPU per node is testing… Memory 2) Improve. Memory lock handover 3) More efficient CS $ $ $ P P … P Lock/Unlock Zoran. Radovic@it. uu. se - local traffic is cheaper $ $ P P … Memory $ $ $ P P P Test Lock/Unlock Dissertation Seminar … $ P Test Test Nov 18, 2005

The HBO Lock (the simplest HBO) § What do we need? § node_id Creates

The HBO Lock (the simplest HBO) § What do we need? § node_id Creates § Compare&swap (CAS) atomic operation Communication CAS(Lock_address, FREE, node_id) § lock-acquire: Affinity § If the lock-value is in the state FREE: • The node_id is CAS-ed into the lock location § Else: 2 cases • The lock is “local” Spin with small backoff • The lock is “remote” Spin with large backoff § Simple but fairly effective… Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Performance Results Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs 14 14 WF Fairness?

Performance Results Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs 14 14 WF Fairness? Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Fairness Study Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs t Zoran. Radovic@it. uu.

Fairness Study Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs t Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Application Performance 28 -processor runs ≈ 4 x Zoran. Radovic@it. uu. se Dissertation Seminar

Application Performance 28 -processor runs ≈ 4 x Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Total Traffic: Raytrace Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Total Traffic: Raytrace Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

HBO Locks inside Linux Kernel § Patch provided by Silicon Graphics, Inc. § Linux-IA

HBO Locks inside Linux Kernel § Patch provided by Silicon Graphics, Inc. § Linux-IA 64 kernel implementation, May 2005 § Page-fault handler runs 3 x faster § 60 processors Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Outline ü NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based

Outline ü NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

The DSZOOM Proposal Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

The DSZOOM Proposal Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

The DSZOOM Proposal § Run entire protocol in requesting-processor Þ No protocol agent communication!

The DSZOOM Proposal § Run entire protocol in requesting-processor Þ No protocol agent communication! § Assumes user-level remote memory access § put, get, and atomics [ Infini. Band ] § Fine-grain memory protocols (e. g. , 64 bytes) § Hardware-like memory models [Shasta, Blizzard, Sirocco] Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

“Squeezing” Protocols into Binaries… DSZOOM Program Original Program. . . cmp bne nop %g

“Squeezing” Protocols into Binaries… DSZOOM Program Original Program. . . cmp bne nop %g 0, %l 5 0 x 24431 ld [%o 1 + 64], %o 0 ldd clr. . . [%o 0 + 16], %f 4 %l 5 %g 0, %l 5 0 x 24431 ld mov and cmp bne nop [%o 1 + 64], %o 0 255, %g 6, %o 0, %g 6, 170 0 x 24450 ldd clr. . . [%o 0 + 16], %f 4 %l 5 Binary/Assembler level instrumentation Zoran. Radovic@it. uu. se Dissertation Seminar Fast-path Protocol Code Slow-path Protocol Code (C-code) Nov 18, 2005

Write Permission Caching § Problem: store instrumentation relies on locking § More complex instrumentation

Write Permission Caching § Problem: store instrumentation relies on locking § More complex instrumentation § Solution: write permission cache (WPC) § Small and fast software-managed cache § Keeps write permissions § The WPC idea: § Exploit store locality § Dynamically reduce the number of memory references in store checking code Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Other “Features” § Two kinds of protocols § Invalidate § Update § Many optimizations

Other “Features” § Two kinds of protocols § Invalidate § Update § Many optimizations § § § § Instrumentation scheduling Instrumentation batching WPC-based write batching WPC-based dirty-data filtering Private-data filtering # of WPC entries Coherence unit size Zoran. Radovic@it. uu. se Dissertation Seminar (update and invalidate) (update) (update and invalidate) Nov 18, 2005

Coherence Flags and Profiling § Coherence flags § Similar to optimization flags of compilers

Coherence Flags and Profiling § Coherence flags § Similar to optimization flags of compilers § Possible scenario: gcc -dszoom-cl 128 -dszoom-inv –O 3 my_app. c § Execution profiling § Similar to profile feedback of compilers § Helps finding appropriate coherence flag settings § Low overhead implementation in DSZOOM • Less than 30 percent overhead § Works for both small and large input sets Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

DSZOOM Results 2 -node Wild. Fire, 16 CPUs 1. 45 x Zoran. Radovic@it. uu.

DSZOOM Results 2 -node Wild. Fire, 16 CPUs 1. 45 x Zoran. Radovic@it. uu. se Dissertation Seminar 1. 11 x Nov 18, 2005

Outline ü NUCA Locks ü DSZOOM – Software-based Shared Memory § TMA – Trap-based

Outline ü NUCA Locks ü DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Instrumentation Drawbacks DSZOOM Program Original Program. . . cmp bne nop %g 0, %l

Instrumentation Drawbacks DSZOOM Program Original Program. . . cmp bne nop %g 0, %l 5 0 x 24431 ld [%o 1 + 64], %o 0 ldd clr. . . [%o 0 + 16], %f 4 %l 5 • • ld mov and cmp bne nop [%o 1 + 64], %o 0 255, %g 6, %o 0, %g 6, 170 0 x 24450 ldd clr. . . [%o 0 + 16], %f 4 %l 5 Binary transparency? Run-time execution overhead Zoran. Radovic@it. uu. se %g 0, %l 5 0 x 24431 Dissertation Seminar Fast-path Protocol Code Slow-path Protocol Code (C-code) Nov 18, 2005

Trap-Based Memory Architectures § Basic idea § Detect fine-grained coherence violations in hardware §

Trap-Based Memory Architectures § Basic idea § Detect fine-grained coherence violations in hardware § Trigger a coherence trap when one occur § Maintain coherence by software protocols § No memory system modifications § Minimal processor modifications § Binary Transparency § No need to instrument binaries/applications Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

TMA Lite Proof-of-concept Implementation § Load permission check § Hardware implementation of software check

TMA Lite Proof-of-concept Implementation § Load permission check § Hardware implementation of software check • Predefined “magic-value” convention § Store permission check § Hardware WPC § Can be seen as a very small cache § Operates on virtual addresses § Accessed in parallel with the data TLB Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2 -node Wild. Fire]

TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2 -node Wild. Fire] 1. 75 x Zoran. Radovic@it. uu. se Dissertation Seminar 1. 01 x Nov 18, 2005

Topics not Presented § RH lock algorithm § Controlled (un)fairness § HBO_GT and HBO_GT_SD

Topics not Presented § RH lock algorithm § Controlled (un)fairness § HBO_GT and HBO_GT_SD algorithms § Global throttling and starvation detection § DSZOOM implementation details § Instrumentation challenges; scheduling, batching, etc. § Bandwidth filtering techniques; dirty- & private-data § Innovative TMA simulation tricks § Low-level “good days” hacks § Reusing Simics checkpoints Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran. radovic@it. uu. se Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005