Hardware Transactional Memory for GPU Architectures Wilson W
- Slides: 22
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Motivation n Lifetime of GPU Application Development Functionality Performance E. g. N-Body with 5 M bodies CUDA SDK: O(n 2) – 1640 s (barrier) Barnes Hut: O(n. Logn) – 5. 2 s (locks) Time Fine-Grained Locking Transactional Memory ? Time Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Time Hardware TM for GPU Architectures 2
Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs n 1000 s Concurrent Scalar Threads n Challenges (from TM perspective) Our Solution: KILO TM n Hardware TM for GPUs Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3
Hardware TM for GPUs Challenge #1: SIMD Hardware n On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads . . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Committed Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt T 0 T 1 T 2 T 3 Branch Divergence! T 0 T 1 T 2 T 3 Aborted Hardware TM for GPU Architectures 4
KILO TM – Solution to Challenge #1: SIMD Hardware n Transaction Abort q q Like a Loop Extend SIMT Stack. . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Abort Hardware TM for GPU Architectures 5
Hardware TM for GPUs Challenge #2: Transaction Rollback GPU Core (SM) CPU Core 10 s of Register File Registers @ TX Entry Abort Warp Warp Register File Checkpoint? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 32 k Registers Hardware TM for GPU Architectures 2 MB Total On-Chip Storage 6
KILO TM – Solution to Challenge #2: Transaction Rollback n SW Register Checkpoint q q Most TX: Registers overwritten at first use TX in Barnes Hut: Checkpoint 2 registers Overwritten Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Abort Hardware TM for GPU Architectures 7
Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol n Not Available on GPUs n No Private Data Cache per Thread Signatures? n 1024 -bit / Thread n 3. 8 MB / 30 k Threads Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8
Hardware TM for GPUs Challenge #4: Write Buffer GPU Core (SM) Warp Warp L 1 Data Cache Problem: 384 lines /Threads 1536 threads line. Cache per thread! 1024 -1536 Fermi’s < L 11 Data (48 k. B) = 384 X 128 B Lines Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9
KILO TM: Value-Based Conflict Detection TX 1 Global Memory Tx. Begin LD r 1, [A] ADD r 1, 1 ST r 1, [B] Tx. Commit TX 2 Private Memory Read-Log A=1 Write-Log B=2 n atomic {B=A+1} atomic {A=B+2} A=1 B=0 B=2 Private Memory Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Read-Log B=0 Write-Log A=2 Self-Validation + Abort: q Only detects existence of conflict (not identity) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10
Parallel Validation? Data Race!? ! Private Memory Read-Log A=1 Write-Log B=2 TX 1 atomic {B=A+1} Global Memory TX 2 atomic {A=B+2} Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt A=1 Tx 1 then Tx 2: A=4, B=2 B=0 OR Tx 2 then Tx 1: A=2, B=3 Private Memory Hardware TM for GPU Architectures Read-Log B=0 Write-Log A=2 11
Serialize Validation? Time TX 1 V+C TX 2 VStall +C Commit Unit Global Memory V = Validation C = Commit n n n Benefit #1: No Data Race Benefit #2: No Live Lock Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12
Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts 1. Recently Committed TX in Parallel 2. Concurrently Committing TX in Commit Order q Approximate Time TX 1 V+C TX 2 V+C TX 3 V+C Stall V = Validation C = Commit Unit RS Global RS RS Memory Conflict Rare Good Commit Parallelism Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13
KILO TM Implementation n Minimal Modification to Existing GPU Arch. SIMT Stacks Commit Unit TX Log Unit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14
Evaluation Methodology n GPGPU-Sim 3. 0 (BSD license) q q n Detailed: IPC Correlation of 0. 93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications q q q Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) Cuda. Cuts (CC) Data Mining (AP) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15
Performance (vs. Serializing TX) Higher is Better Serializing TX ≈ Coarse-Grained Locks Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16
Performance (Exec. Time) Normalized Exec. Time 3 Ideal TM KILO TM FG Lock 2 1 0 HT-H HT-L ATM CL BH CC AP Lower is Better Captures 59% of FG Lock Performance Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17
Implementation Complexity n n Logs in Private Memory @ L 1 Data Cache Commit Unit q q q n 5 k. B Last Writer History Unit 19 k. B Transaction Status 32 k. B Read-Set and Write-Set Buffer CACTI 5. 3 @ 40 nm q q 0. 40 mm 2 x 6 Memory Partition 0. 5% of 520 mm 2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18
Summary n KILO TM: Hardware TM for GPUs q q q 1000 s of Concurrent Scalar TXs Handles Scalar TX Abort No cache coherence protocol dependency Word-level conflict detection Unbounded Transaction 59% Fine-Grained Locking Performance n q 128 X Faster than Serializing TX Execution 0. 5% Area Overhead Question? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19
Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20
ABA Problem? n Classic Example: Linked List Based Stack top n A Next B Next C Next Null Thread 0 – pop(): while (true) { t A t = top; Next = t->Next; Next B // thread 2: pop A, pop B, push A top A C Next if (atomic. CAS(&top, t, next) == t) break; top } C Next Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt top Null B Next Hardware TM for GPU Architectures C Next Null // succeeds! Null 21
ABA Problem? n atomic. CAS protects only a single word q Only part of the data structure top A Next B Next C Next while (true) { t = top; Next = t->Next; if (atomic. CAS(&top, t, next) == t) break; } n Null // succeeds! Value-based conflict detection protects all relevant parts of the data structure Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22
- Cache coherence for gpu architectures
- Transactional memory
- Hardware transactional memory
- Hardware transactional memory
- Gpu memory test
- Danny hendler
- C++ transactional memory
- Internal and external parts of computer
- Backbone network architectures
- Database system architectures
- Autoencoders
- What is isa architecture
- Database and storage architectures
- Why systolic architectures
- Cdn architectures
- Theo schlossnagle
- Client server architecture model
- Base system architectures
- Aaron bannert
- Fundamental and incidental interactions
- Distributed systems architectures
- Gui architectures
- Network backbone