Hardware Transactional Memory for GPU Architectures Wilson W

  • Slides: 22
Download presentation
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword

Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

Motivation n Lifetime of GPU Application Development Functionality Performance E. g. N-Body with 5

Motivation n Lifetime of GPU Application Development Functionality Performance E. g. N-Body with 5 M bodies CUDA SDK: O(n 2) – 1640 s (barrier) Barnes Hut: O(n. Logn) – 5. 2 s (locks) Time Fine-Grained Locking Transactional Memory ? Time Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Time Hardware TM for GPU Architectures 2

Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs n 1000 s Concurrent

Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs n 1000 s Concurrent Scalar Threads n Challenges (from TM perspective) Our Solution: KILO TM n Hardware TM for GPUs Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3

Hardware TM for GPUs Challenge #1: SIMD Hardware n On GPUs, scalar threads in

Hardware TM for GPUs Challenge #1: SIMD Hardware n On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads . . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Committed Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt T 0 T 1 T 2 T 3 Branch Divergence! T 0 T 1 T 2 T 3 Aborted Hardware TM for GPU Architectures 4

KILO TM – Solution to Challenge #1: SIMD Hardware n Transaction Abort q q

KILO TM – Solution to Challenge #1: SIMD Hardware n Transaction Abort q q Like a Loop Extend SIMT Stack. . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Abort Hardware TM for GPU Architectures 5

Hardware TM for GPUs Challenge #2: Transaction Rollback GPU Core (SM) CPU Core 10

Hardware TM for GPUs Challenge #2: Transaction Rollback GPU Core (SM) CPU Core 10 s of Register File Registers @ TX Entry Abort Warp Warp Register File Checkpoint? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 32 k Registers Hardware TM for GPU Architectures 2 MB Total On-Chip Storage 6

KILO TM – Solution to Challenge #2: Transaction Rollback n SW Register Checkpoint q

KILO TM – Solution to Challenge #2: Transaction Rollback n SW Register Checkpoint q q Most TX: Registers overwritten at first use TX in Barnes Hut: Checkpoint 2 registers Overwritten Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Abort Hardware TM for GPU Architectures 7

Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol

Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol n Not Available on GPUs n No Private Data Cache per Thread Signatures? n 1024 -bit / Thread n 3. 8 MB / 30 k Threads Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8

Hardware TM for GPUs Challenge #4: Write Buffer GPU Core (SM) Warp Warp L

Hardware TM for GPUs Challenge #4: Write Buffer GPU Core (SM) Warp Warp L 1 Data Cache Problem: 384 lines /Threads 1536 threads line. Cache per thread! 1024 -1536 Fermi’s < L 11 Data (48 k. B) = 384 X 128 B Lines Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9

KILO TM: Value-Based Conflict Detection TX 1 Global Memory Tx. Begin LD r 1,

KILO TM: Value-Based Conflict Detection TX 1 Global Memory Tx. Begin LD r 1, [A] ADD r 1, 1 ST r 1, [B] Tx. Commit TX 2 Private Memory Read-Log A=1 Write-Log B=2 n atomic {B=A+1} atomic {A=B+2} A=1 B=0 B=2 Private Memory Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Read-Log B=0 Write-Log A=2 Self-Validation + Abort: q Only detects existence of conflict (not identity) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10

Parallel Validation? Data Race!? ! Private Memory Read-Log A=1 Write-Log B=2 TX 1 atomic

Parallel Validation? Data Race!? ! Private Memory Read-Log A=1 Write-Log B=2 TX 1 atomic {B=A+1} Global Memory TX 2 atomic {A=B+2} Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt A=1 Tx 1 then Tx 2: A=4, B=2 B=0 OR Tx 2 then Tx 1: A=2, B=3 Private Memory Hardware TM for GPU Architectures Read-Log B=0 Write-Log A=2 11

Serialize Validation? Time TX 1 V+C TX 2 VStall +C Commit Unit Global Memory

Serialize Validation? Time TX 1 V+C TX 2 VStall +C Commit Unit Global Memory V = Validation C = Commit n n n Benefit #1: No Data Race Benefit #2: No Live Lock Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12

Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts 1. Recently Committed

Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts 1. Recently Committed TX in Parallel 2. Concurrently Committing TX in Commit Order q Approximate Time TX 1 V+C TX 2 V+C TX 3 V+C Stall V = Validation C = Commit Unit RS Global RS RS Memory Conflict Rare Good Commit Parallelism Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13

KILO TM Implementation n Minimal Modification to Existing GPU Arch. SIMT Stacks Commit Unit

KILO TM Implementation n Minimal Modification to Existing GPU Arch. SIMT Stacks Commit Unit TX Log Unit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14

Evaluation Methodology n GPGPU-Sim 3. 0 (BSD license) q q n Detailed: IPC Correlation

Evaluation Methodology n GPGPU-Sim 3. 0 (BSD license) q q n Detailed: IPC Correlation of 0. 93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications q q q Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) Cuda. Cuts (CC) Data Mining (AP) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15

Performance (vs. Serializing TX) Higher is Better Serializing TX ≈ Coarse-Grained Locks Wilson Fung,

Performance (vs. Serializing TX) Higher is Better Serializing TX ≈ Coarse-Grained Locks Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16

Performance (Exec. Time) Normalized Exec. Time 3 Ideal TM KILO TM FG Lock 2

Performance (Exec. Time) Normalized Exec. Time 3 Ideal TM KILO TM FG Lock 2 1 0 HT-H HT-L ATM CL BH CC AP Lower is Better Captures 59% of FG Lock Performance Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17

Implementation Complexity n n Logs in Private Memory @ L 1 Data Cache Commit

Implementation Complexity n n Logs in Private Memory @ L 1 Data Cache Commit Unit q q q n 5 k. B Last Writer History Unit 19 k. B Transaction Status 32 k. B Read-Set and Write-Set Buffer CACTI 5. 3 @ 40 nm q q 0. 40 mm 2 x 6 Memory Partition 0. 5% of 520 mm 2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18

Summary n KILO TM: Hardware TM for GPUs q q q 1000 s of

Summary n KILO TM: Hardware TM for GPUs q q q 1000 s of Concurrent Scalar TXs Handles Scalar TX Abort No cache coherence protocol dependency Word-level conflict detection Unbounded Transaction 59% Fine-Grained Locking Performance n q 128 X Faster than Serializing TX Execution 0. 5% Area Overhead Question? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19

Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU

Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20

ABA Problem? n Classic Example: Linked List Based Stack top n A Next B

ABA Problem? n Classic Example: Linked List Based Stack top n A Next B Next C Next Null Thread 0 – pop(): while (true) { t A t = top; Next = t->Next; Next B // thread 2: pop A, pop B, push A top A C Next if (atomic. CAS(&top, t, next) == t) break; top } C Next Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt top Null B Next Hardware TM for GPU Architectures C Next Null // succeeds! Null 21

ABA Problem? n atomic. CAS protects only a single word q Only part of

ABA Problem? n atomic. CAS protects only a single word q Only part of the data structure top A Next B Next C Next while (true) { t = top; Next = t->Next; if (atomic. CAS(&top, t, next) == t) break; } n Null // succeeds! Value-based conflict detection protects all relevant parts of the data structure Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22