Dissertation Seminar 1811 2005 Auditorium Minus Museum Gustavianum
- Slides: 28
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran. radovic@it. uu. se Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Outline § NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Vasaloppet “Contention Problem in Sweden” Traditional cross-country ski race 90 km … Zoran. Radovic@it. uu. se Dissertation Seminar 85. 6533 km to go… CS Nov 18, 2005
Critical Section (CS) Cost Spin Locks under Contention Spin locks with backoff IF (more contention) THEN less efficient CS … “The more important the slower it runs…” Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Queue-based Locks CS Cost Spin locks with backoff locks IF (more contention)Queue-based THEN constant CS cost … Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
CS Cost This Dissertation Spin locks with backoff IF (more contention) THEN more efficient CS … Queue-based locks “The more important the faster it runs…” NUCA locks Amount of Contention Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
NUCA Locks (Basic Idea) 1) Reduce Switch traffic - one CPU per node is testing… Memory 2) Improve. Memory lock handover 3) More efficient CS $ $ $ P P … P Lock/Unlock Zoran. Radovic@it. uu. se - local traffic is cheaper $ $ P P … Memory $ $ $ P P P Test Lock/Unlock Dissertation Seminar … $ P Test Test Nov 18, 2005
The HBO Lock (the simplest HBO) § What do we need? § node_id Creates § Compare&swap (CAS) atomic operation Communication CAS(Lock_address, FREE, node_id) § lock-acquire: Affinity § If the lock-value is in the state FREE: • The node_id is CAS-ed into the lock location § Else: 2 cases • The lock is “local” Spin with small backoff • The lock is “remote” Spin with large backoff § Simple but fairly effective… Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Performance Results Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs 14 14 WF Fairness? Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Fairness Study Realistic microbenchmark, 2 -node Wild. Fire, 28 CPUs t Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Application Performance 28 -processor runs ≈ 4 x Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Total Traffic: Raytrace Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
HBO Locks inside Linux Kernel § Patch provided by Silicon Graphics, Inc. § Linux-IA 64 kernel implementation, May 2005 § Page-fault handler runs 3 x faster § 60 processors Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Outline ü NUCA Locks § DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
The DSZOOM Proposal Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
The DSZOOM Proposal § Run entire protocol in requesting-processor Þ No protocol agent communication! § Assumes user-level remote memory access § put, get, and atomics [ Infini. Band ] § Fine-grain memory protocols (e. g. , 64 bytes) § Hardware-like memory models [Shasta, Blizzard, Sirocco] Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
“Squeezing” Protocols into Binaries… DSZOOM Program Original Program. . . cmp bne nop %g 0, %l 5 0 x 24431 ld [%o 1 + 64], %o 0 ldd clr. . . [%o 0 + 16], %f 4 %l 5 %g 0, %l 5 0 x 24431 ld mov and cmp bne nop [%o 1 + 64], %o 0 255, %g 6, %o 0, %g 6, 170 0 x 24450 ldd clr. . . [%o 0 + 16], %f 4 %l 5 Binary/Assembler level instrumentation Zoran. Radovic@it. uu. se Dissertation Seminar Fast-path Protocol Code Slow-path Protocol Code (C-code) Nov 18, 2005
Write Permission Caching § Problem: store instrumentation relies on locking § More complex instrumentation § Solution: write permission cache (WPC) § Small and fast software-managed cache § Keeps write permissions § The WPC idea: § Exploit store locality § Dynamically reduce the number of memory references in store checking code Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Other “Features” § Two kinds of protocols § Invalidate § Update § Many optimizations § § § § Instrumentation scheduling Instrumentation batching WPC-based write batching WPC-based dirty-data filtering Private-data filtering # of WPC entries Coherence unit size Zoran. Radovic@it. uu. se Dissertation Seminar (update and invalidate) (update) (update and invalidate) Nov 18, 2005
Coherence Flags and Profiling § Coherence flags § Similar to optimization flags of compilers § Possible scenario: gcc -dszoom-cl 128 -dszoom-inv –O 3 my_app. c § Execution profiling § Similar to profile feedback of compilers § Helps finding appropriate coherence flag settings § Low overhead implementation in DSZOOM • Less than 30 percent overhead § Works for both small and large input sets Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
DSZOOM Results 2 -node Wild. Fire, 16 CPUs 1. 45 x Zoran. Radovic@it. uu. se Dissertation Seminar 1. 11 x Nov 18, 2005
Outline ü NUCA Locks ü DSZOOM – Software-based Shared Memory § TMA – Trap-based Memory Architecture Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Instrumentation Drawbacks DSZOOM Program Original Program. . . cmp bne nop %g 0, %l 5 0 x 24431 ld [%o 1 + 64], %o 0 ldd clr. . . [%o 0 + 16], %f 4 %l 5 • • ld mov and cmp bne nop [%o 1 + 64], %o 0 255, %g 6, %o 0, %g 6, 170 0 x 24450 ldd clr. . . [%o 0 + 16], %f 4 %l 5 Binary transparency? Run-time execution overhead Zoran. Radovic@it. uu. se %g 0, %l 5 0 x 24431 Dissertation Seminar Fast-path Protocol Code Slow-path Protocol Code (C-code) Nov 18, 2005
Trap-Based Memory Architectures § Basic idea § Detect fine-grained coherence violations in hardware § Trigger a coherence trap when one occur § Maintain coherence by software protocols § No memory system modifications § Minimal processor modifications § Binary Transparency § No need to instrument binaries/applications Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
TMA Lite Proof-of-concept Implementation § Load permission check § Hardware implementation of software check • Predefined “magic-value” convention § Store permission check § Hardware WPC § Can be seen as a very small cache § Operates on virtual addresses § Accessed in parallel with the data TLB Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
TMA Lite Performance [TMA: simulation study, 4 nodes | DSZOOM: 2 -node Wild. Fire] 1. 75 x Zoran. Radovic@it. uu. se Dissertation Seminar 1. 01 x Nov 18, 2005
Topics not Presented § RH lock algorithm § Controlled (un)fairness § HBO_GT and HBO_GT_SD algorithms § Global throttling and starvation detection § DSZOOM implementation details § Instrumentation challenges; scheduling, batching, etc. § Bandwidth filtering techniques; dirty- & private-data § Innovative TMA simulation tricks § Low-level “good days” hacks § Reusing Simics checkpoints Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran. radovic@it. uu. se Zoran. Radovic@it. uu. se Dissertation Seminar Nov 18, 2005
- 10-potenser
- 1811 food revolution
- Banda oriental 1811
- Martenehe
- Auditorium al duomo firenze
- An auditorium has dimensions 10 m
- Auditorium marthe condat
- Sound reinforcement system design
- Auditorium varrone rieti
- The gym is 21 years newer than the auditorium
- Auditorium tan sri abdul rafie
- Auditorium enzo tortora
- Susunan bilik darjah
- Pancoe abbott auditorium
- Reverberation time for auditorium
- Rozumowanie a contrario
- Prednosť matematických operácií
- Tuberculim majus
- Rogatio significato
- Shape builder tool illustrator minus
- Parietal peritoneum
- Minus b formula
- Revenues minus expenses equals
- Beta minus decay
- Four fundamental forces of nature
- Education
- Neutron emission
- Corn minus the hull and germ.
- How to minus front in illustrator