Hardware Transactional Memory for GPU Architectures Wilson W
- Slides: 54
Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Motivation n Lifetime of GPU Application Development Functionality Performance E. g. N-Body with 5 M bodies CUDA SDK: O(n 2) – 1640 s (barrier) Barnes Hut: O(n. Logn) – 5. 2 s (locks) Time Fine-Grained Locking Transactional Memory ? Time Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Time Hardware TM for GPU Architectures 2
Talk Outline n n n n What we mean by “GPU” in this work. Data Synchronization on GPUs. What is Transactional Memory (TM)? TM is compatible with Open. CL. … but is TM compatible with GPU hardware? KILO TM: A Hardware TM for GPUs. Results Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3
What is a GPU (in this work)? n GPU is NVIDIA/AMD-like, Compute Accelerator q n SIMD HW + Aggressive Memory Subsystem => High Compute Throughput and Efficiency Non-Graphics API: Open. CL, Direct. Compute, CUDA q q Programming Model: Hierarchy of scalar threads Today: Limited Communication & Synchronization Kernel Wavefront / Warp Scalar Thread Blocks Work Group. Blocks / Thread Blocks 1 2 3 4 Barrier Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 5 6 7 8 9 10 11 12 Global Memory Shared (Local) Memory Hardware TM for GPU Architectures 4
Baseline GPU Architecture SIMTCore SIMTCore SIMT Interconnection Network Memory. Partition Memory Partition Atomic Op. Unit Last-Level Cache Bank Off-Chip DRAM Channel Done (Warp ID) SIMT Front End Fetch Decode Schedule Branch SIMD Datapath Memory Subsystem Tex $ Const$ SMem Non-Coherent L 1 D-Cache Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Icnt. Network Hardware TM for GPU Architectures 5
Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’ 84, Fung MICRO’ 07) Stack A/1111 A TOS TOS B/1111 B C/1001 C D/0110 D F E E Thread Warp E/1111 E Next PC B A G E D C E Active Mask 1111 0110 1001 Common PC Thread 1 2 3 4 G/1111 G A Reconv. PC B C D E G A Time 6 17
Data Synchronizations on GPUs n Motivation q q Solve wider range of problems on GPU Data Race Data Synchronization n n q Current Solution: Atomic read-modify-write (32 -bit/64 -bit). Best Sol’n? Why Transactional Memory? n n n E. g. N-Body with 5 M bodies (traditional sync, not TM) CUDA SDK: O(n 2) – 1640 s (barrier) Barnes Hut: O(n. Logn) – 5. 2 s (atomics, harder to get right) Easier to Write/Debug Efficient Algorithms Practical efficiency. Want efficiency of GPU with reasonable (not superhuman) effort and time. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7
Data Synchronizations on GPUs n Deadlock-free code with fine-grained locks and 10, 000+ hardware scheduled threads is hard # Possible Global Lock States Which of these states are deadlocks? ! # Locks x # Sharing Thread n Other general problems with lock based synchronization q q Implicit relationship between locks and objects being protected Code is not composable Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8
Data Synchronization Problems Specific to GPUs n Interaction between locks and SIMT control flow can cause deadlocks A: while(atomic. CAS(lock, 0, 1)==1); B: // Critical Section … C: lock = 0; A: done = 0; B: while(!done){ C: if(atomic. CAS(lock, 0, 1)==1){ D: // Critical Section … E: lock = 0; F: done = 1; G: } H: } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9
Transactional Memory n Program specifies atomic code blocks called transactions [Herlihy’ 93] Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt TM Version: atomic { X[c] = X[a]+X[b]; } Potential Deadlock! Hardware TM for GPU Architectures 10
Transactional Memory Programmers’ View: Non-conflicting transactions may run in parallel TX 1 Commit Memory A B C D Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt TX 2 Commit OR TX 2 TX 1 Time TX 2 Time TX 1 Conflicting transactions automatically serialized TX 1 Commit Hardware TM for GPU Architectures Memory A B C D TX 2 Abort TX 2 Commit 11
Transactional Memory n Each transaction has 3 phases q Execution n q Validation n n q Track all memory accesses (Read-Set and Write-Set) Detect any conflicting accesses between transactions Resolve conflict if needed (abort/stall) Commit n Update global memory Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12
Transactional Memory on Open. CL n A natural extension to Open. CL Programming Model q Program can launch many more threads than the hardware can execute concurrently GPU HW q GPU-TM? Current threads running transactions do not need to wait for future unscheduled threads Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13
Are TM and GPUs Incompatible? The problem with GPUs (from TM perspective): n 1000 s of concurrent threads n Inter-thread spatial locality common n No cache coherence n No private cache for each thread (Buffering? ) n Tx Abort Control flow divergence Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14
Hardware TM for GPUs Challenge: Conflict Detection Private Data Cache Signature TX 1 TX 2 TX 3 TX 4 R(A), W(C) Inv C 1024 -bit Signature/Thread 3. 8 MB / 30 k Threads Bus R(D) No coherence Scalable on GPUs? R(A) scalar thread Each needs own cache? Coherence R(C), W(B) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Conflict! Hardware TM for GPU Architectures 15
Hardware TM for GPUs Challenge: Transaction Rollback GPU Core (SM) CPU Core 10 s of Register File Registers @ TX Entry Abort Warp Warp Register File Checkpoint? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 32 k Registers Hardware TM for GPU Architectures 2 MB Total On-Chip Storage 16
Hardware TM for GPUs Challenge: Access Granularity and Write Buffer GPU Core (SM) CPU Core L 1 Data Cache Warp Warp L 1 Data Cache TX 1 -2 Threads 32 k. B Cache Commit Global Memory 1024 -1536 Threads Fermi’s L 1 Data Cache (48 k. B) = 384 X 128 B Lines Problem: 384 lines / 1536 threads < 1 line per thread! Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17
Hardware TM on GPUs Challenge: SIMT Hardware n On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 8 Scalar Threads . . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Committed Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Reconvergence? Aborted Hardware TM for GPU Architectures 18
Goal n We take it as a given that most programmers trying lock based programming on a GPU will give up before they manage to get their application working. n Hence, our goal was to find the most efficient approach to implement TM on GPU. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19
KILO TM n n n Supports 1000 s of concurrent transactions Transaction-aware SIMT stack No cache coherence protocol dependency Word-level conflict detection Captures 59% of FG Lock Performance q 128 X Faster than Serialized Tx Exec. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20
KILO TM: Design Highlights n Value-Based Conflict Detection q q n Self-Validation + Abort: Simple Communication No Cache Coherence Dependence Speculative Validation q Increase Commit Parallelism Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21
High Level GPU Architecture + KILO TM Implementation Overview Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22
KILO TM: SIMT Core Changes n SW Register Checkpoint q q n Observation: Most overwritten registers not used Compiler analysis can identify what to checkpoint Transaction Abort q q ~ Do-While Loop Extend SIMT Stack with special entries to track aborted transactions in each warp Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Overwritten Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Hardware TM for GPU Architectures Abort 23
Transaction-Aware SIMT Stack A: @ tx_begin: Type PC RPC N H -N B H R C -TOS T C -- t = tid. x; if (…) { tx_begin; x[t%10] = y[t] + 1; if (s[t]) y[t] = 0; tx_commit; z = y[t]; Implicit loop } when abort w = y[t+1]; B: Active Mask Copy C: 1111 Active D: 1111 0011 Mask E: F: 0000 G: 1111 0011 H: @ tx_commit, thread 6 & 7 failed validation: Branch Divergence within Tx: Type PC RPC Active Mask N H -1111 Copy N B H 1111 0011 Active N B H 1111 0000 0011 R C -TOS Mask R C -0000 + PC T F -0000 T F -1111 0011 N E F 0001 0011 TOS @ tx_commit, restart Tx for thread 6 & 7: @ tx_commit, Type PC RPC Active Mask all threads with Tx committed: N H -1111 Type PC RPC Active Mask N B H 1111 0011 N H -1111 0000 R C -N G H 1111 0011 TOS T C -0000 0011 R C -0000 TOS Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 24
KILO TM: Value-Based Conflict Detection Private Memory Read-Log A=1 Write-Log B=2 n TX 1 atomic{B=A+1} Tx. Begin LD r 1, [A] ADD r 1, 1 ST r 1, [B] Tx. Commit A=1 Global Memory TX 2 atomic{A=B+2} B=0 B=2 Private Memory Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Self-Validation + Abort: q Read-Log B=0 Write-Log A=2 Only detects existence of conflict (not identity) => No Tx to Tx Msg – Simple Communication Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 25
Parallel Validation? Data Race!? ! Init: A=1, B=0 Private Memory Read-Log A=1 Write-Log B=2 TX 1 atomic{B=A+1} Global Memory A=1 Tx 1 then Tx 2: B=2, A=4 B=0 OR Tx 2 then Tx 1: A=2, B=3 TX 2 atomic{A=B+2} Private Memory Read-Log B=0 Write-Log A=2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 26
Serialize Validation? Time n n n TX 1 V+C TX 2 VStall +C Commit Unit Global Memory Benefit #1: No Data Race Benefit #2: No Live Lock (generic lazy TM prob. ) Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 27
Identifying Non-conflicting Tx: Step 1: Leverage Parallelism TX 3 TX 3 TX 1 TX 1 TX 2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Commit Unit Global Memory Partition Hardware TM for GPU Architectures 28
Solution: Speculative Validation n Key Idea: Split Validation into two parts q q Part 1: Check recently committed transactions Part 2: Check concurrently committing transactions Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 29
KILO TM: Speculative Validation n Memory subsystem is deeply pipelined and highly parallel TX 1 Read-Log Write-Log R(C), W(D) TX 2 R(A), W(B) Commit Unit Log Transfer Spec. Validation TX 2 Validation Wait Finalize Outcome R(D), W(E) Commit Validation Queue TX 1 Hazard Detection TX 3 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt TX 3 Hardware TM for GPU Architectures C A D Global Memory Partition 30
KILO TM: Speculative Validation TX 1 TX 2 TX 3 R(C), W(D) R(A), W(B) R(D), W(E) Commit Unit Log Transfer Last History Last. Writer History D? W(D) A? C? Spec. Validation TX 1 Evict D Addr CID Hazard Detection TX 2 B Nil TX 1 Recency TX 3 E Lookup Table Validation Wait Bloom Filter Finalize Outcome TX 3 Validation Queue TX 2 TX 1 STALL C A D Global Memory Partition Commit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 31
Log Storage n Transaction logs are stored at the private memory of each thread q Located in DRAM, cached in L 1 and L 2 caches Wavefront T 0 T 1 T 2 T 3 LD ST A B E K F L M G C N D H T 0’s view of private memory Read-Log Ptr Write-Log Ptr Consecutive physical address A 3 E 6 B 4 F 1 C 9 G 6 D Address 7 Value H 8 K 7 L 6 M 5 N 4
Log Transfer n Entries heading to same memory partition can be grouped into a larger packet Partition 0 Partition 1 Partition 3 Packets to Commit Units Commit Unit 0 A 3 Commit Unit 1 B 4 Commit Unit 2 D 7 C 9 Read-Log Ptr Write-Log Ptr A 3 E 6 B 4 F 1 C 9 G 6 D 7 H 8 K 7 L 6 M 5 N 4
Distributed Commit / HW Org. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 34
ABA Problem? n Classic Example: Linked List Based Stack top n A Next B Next C Next Null Thread 0 – pop(): while (true) { t A t = top; Next = t->Next; Next B // thread 2: pop A, pop B, push A top A C Next if (atomic. CAS(&top, t, next) == t) break; top } C Next top Null B Next C Next Null // succeeds! Null
ABA Problem? n atomic. CAS protects only a single word q Only part of the data structure top A Next B Next C Next while (true) { t = top; Next = t->Next; if (atomic. CAS(&top, t, next) == t) break; } n Null // succeeds! Value-based conflict detection protects all relevant parts of the data structure
Evaluation Methodology n GPGPU-Sim 3. 0 (BSD license) q q n Detailed: IPC Correlation of 0. 93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications q q q Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) Cuda. Cuts (CC) Data Mining (AP) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 37
GPGPU-Sim 3. 0. x running SASS (decuda) 0. 976 correlation on subset of CUDA SDK that decuda correctly Disassembles Note: Rest of data uses PTX instead of SASS (0. 93 correlation) (We believe GPGPU-Sim is reasonable proxy. )
Performance (vs. Serializing Tx) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 39
IPC Absolute Performance (IPC) n n n TM on GPU performs well for applications with low contention. Poorly: Memory divergence, low parallelism, high conflict rate (tackle through alg. design/tuning? ) CPU vs GPU? q q CC: FG-Lock version 400 X faster than its CPU version BH: FG-Lock version 2. 5 X faster than its CPU version Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 40
Performance (Exec. Time) Captures 59% of FG Lock Performance 128 X Faster than Serialized Tx Exec. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 41
KILO TM Scaling Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 42
Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible solutions (future work): Solution 1: Application performance tuning (easier with TM vs. FG Lock) Solution 2: Transaction schedule Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 43
Thread Cycle Breakdown n n Status of a thread at each cycle Categories: q q q q TC: In a warp stalled by concurrency control TO: In a warp committing its transactions TW: Have passed commit, and waiting for other threads in the warp to pass TA: Executing an eventually aborted transaction TU: Executing an eventually committed transaction (Useful work) AT: Acquiring a lock or doing an Atomic Operation BA: Waiting at a Barrier NL: Doing non-transactional (Normal) work Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 44
Thread Cycle Breakdown KL FGL KL-UC IDEAL HT-H Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt HT-L KL FGL KL-UC IDEAL ATM KL FGL IDEAL KL-UC CL Hardware TM for GPU Architectures BH CC KL-UC IDEAL AP 45
Core Cycle Breakdown n n Action performed by a core at each cycle Categories: q q EXEC: Issuing a warp for execution STALL: Stalled by a downstream warp SCRB: All warps blocked by the scoreboard, due to data hazards, concurrency control, pending commits (or any combination thereof) IDLE: None of the warps are ready in the instruction buffer. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 46
Core Cycle Breakdown KL FGL KL-UC IDEAL Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt KL FGL KL-UC IDEAL FGL IDEAL KL-UC Hardware TM for GPU Architectures KL-UC IDEAL 47
Read-Write Buffer Usage Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 48
# In-Flight Buffers Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 49
Implementation Complexity n n Logs in Private Memory @ L 1 Data Cache Commit Unit q q q n 5 k. B Last Writer History Unit 19 k. B Transaction Status 32 k. B Read-Set and Write-Set Buffer CACTI 5. 3 @ 40 nm q q 0. 40 mm 2 x 6 Memory Partition 0. 5% of 520 mm 2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 50
Summary n KILO TM q q q 1000 s of Concurrent Transactions Value-Based Conflict Detection Speculative Validation for Commit Parallelism 59% Fine-Grained Locking Performance 0. 5% Area Overhead Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 51
Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 52
Logical Stage Organization Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 53
Execution Time Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 54
- Cache coherence for gpu architectures
- Hardware transactional memory
- Restricted transactional memory
- Hardware transactional memory
- Gpu memory test
- C++ transactional memory
- Danny hendler
- Internal hardware
- Database storage architecture
- Ecommerce architecture
- George schlossnagle
- Scalable web architectures
- Base system architectures
- Distributed systems architectures
- Gui architectures
- Examples of integral product architecture
- 3 layers of data warehouse architecture
- Network backbone
- Backbone network architectures
- Database system architectures
- Product architecture diagram
- Examples of isa
- Autoencoders, unsupervised learning, and deep architectures
- Why systolic architectures
- Cdn architectures
- Which memory is the actual working memory?
- Internal memory and external memory
- Symmetric shared memory architecture
- Semantics prototype
- Primary memory and secondary memory
- Virtual memory and cache memory
- Implicit and explicit memory
- Virtual memory in memory hierarchy consists of
- Logical memory is broken into
- Long term memory vs short term memory
- Eidetic memory vs iconic memory
- Nationell inriktning för artificiell intelligens
- Tack för att ni har lyssnat
- Läkarutlåtande för livränta
- Klassificeringsstruktur för kommunala verksamheter
- Centrum för kunskap och säkerhet
- Inköpsprocessen steg för steg
- Påbyggnader för flakfordon
- Tack för att ni lyssnade
- Egg för emanuel
- Atmosfr
- Beräkna standardavvikelse
- Personalliggare bygg undantag
- Rutin för avvikelsehantering
- Mitos steg
- Myndigheten för delaktighet
- Presentera för publik crossboss
- Debattartikel struktur
- Kung som dog 1611
- Tobinskatten för och nackdelar