Hardware Transactional Memory for GPU Architectures Wilson W

Motivation n Lifetime of GPU Application Development Functionality Performance E. g. N-Body with 5

Talk Outline n n n n What we mean by “GPU” in this work.

What is a GPU (in this work)? n GPU is NVIDIA/AMD-like, Compute Accelerator q

Baseline GPU Architecture SIMTCore SIMTCore SIMT Interconnection Network Memory. Partition Memory Partition Atomic Op.

Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’ 84, Fung MICRO’ 07) Stack A/1111 A TOS

Data Synchronizations on GPUs n Motivation q q Solve wider range of problems on

Data Synchronizations on GPUs n Deadlock-free code with fine-grained locks and 10, 000+ hardware

Data Synchronization Problems Specific to GPUs n Interaction between locks and SIMT control flow

Transactional Memory n Program specifies atomic code blocks called transactions [Herlihy’ 93] Lock Version:

Transactional Memory Programmers’ View: Non-conflicting transactions may run in parallel TX 1 Commit Memory

Transactional Memory n Each transaction has 3 phases q Execution n q Validation n

Transactional Memory on Open. CL n A natural extension to Open. CL Programming Model

Are TM and GPUs Incompatible? The problem with GPUs (from TM perspective): n 1000

Hardware TM for GPUs Challenge: Conflict Detection Private Data Cache Signature TX 1 TX

Hardware TM for GPUs Challenge: Transaction Rollback GPU Core (SM) CPU Core 10 s

Hardware TM for GPUs Challenge: Access Granularity and Write Buffer GPU Core (SM) CPU

Hardware TM on GPUs Challenge: SIMT Hardware n On GPUs, scalar threads in a

Goal n We take it as a given that most programmers trying lock based

KILO TM n n n Supports 1000 s of concurrent transactions Transaction-aware SIMT stack

KILO TM: Design Highlights n Value-Based Conflict Detection q q n Self-Validation + Abort:

High Level GPU Architecture + KILO TM Implementation Overview Wilson Fung, Inderpeet Singh, Andrew

KILO TM: SIMT Core Changes n SW Register Checkpoint q q n Observation: Most

Transaction-Aware SIMT Stack A: @ tx_begin: Type PC RPC N H -N B H

KILO TM: Value-Based Conflict Detection Private Memory Read-Log A=1 Write-Log B=2 n TX 1

Parallel Validation? Data Race!? ! Init: A=1, B=0 Private Memory Read-Log A=1 Write-Log B=2

Serialize Validation? Time n n n TX 1 V+C TX 2 VStall +C Commit

Identifying Non-conflicting Tx: Step 1: Leverage Parallelism TX 3 TX 3 TX 1 TX

Solution: Speculative Validation n Key Idea: Split Validation into two parts q q Part

KILO TM: Speculative Validation n Memory subsystem is deeply pipelined and highly parallel TX

KILO TM: Speculative Validation TX 1 TX 2 TX 3 R(C), W(D) R(A), W(B)

Log Storage n Transaction logs are stored at the private memory of each thread

Log Transfer n Entries heading to same memory partition can be grouped into a

Distributed Commit / HW Org. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware

ABA Problem? n Classic Example: Linked List Based Stack top n A Next B

ABA Problem? n atomic. CAS protects only a single word q Only part of

Evaluation Methodology n GPGPU-Sim 3. 0 (BSD license) q q n Detailed: IPC Correlation

GPGPU-Sim 3. 0. x running SASS (decuda) 0. 976 correlation on subset of CUDA

Performance (vs. Serializing Tx) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM

IPC Absolute Performance (IPC) n n n TM on GPU performs well for applications

Performance (Exec. Time) Captures 59% of FG Lock Performance 128 X Faster than Serialized

KILO TM Scaling Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for

Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible

Thread Cycle Breakdown n n Status of a thread at each cycle Categories: q

Thread Cycle Breakdown KL FGL KL-UC IDEAL HT-H Wilson Fung, Inderpeet Singh, Andrew Brownsword,

Core Cycle Breakdown n n Action performed by a core at each cycle Categories:

Core Cycle Breakdown KL FGL KL-UC IDEAL Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor

Read-Write Buffer Usage Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for

# In-Flight Buffers Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for

Implementation Complexity n n Logs in Private Memory @ L 1 Data Cache Commit

Summary n KILO TM q q q 1000 s of Concurrent Transactions Value-Based Conflict

Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU

Logical Stage Organization Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for

Execution Time Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for

Slides: 54

Download presentation

Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

Motivation n Lifetime of GPU Application Development Functionality Performance E. g. N-Body with 5 M bodies CUDA SDK: O(n 2) – 1640 s (barrier) Barnes Hut: O(n. Logn) – 5. 2 s (locks) Time Fine-Grained Locking Transactional Memory ? Time Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Time Hardware TM for GPU Architectures 2

Talk Outline n n n n What we mean by “GPU” in this work. Data Synchronization on GPUs. What is Transactional Memory (TM)? TM is compatible with Open. CL. … but is TM compatible with GPU hardware? KILO TM: A Hardware TM for GPUs. Results Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3

What is a GPU (in this work)? n GPU is NVIDIA/AMD-like, Compute Accelerator q n SIMD HW + Aggressive Memory Subsystem => High Compute Throughput and Efficiency Non-Graphics API: Open. CL, Direct. Compute, CUDA q q Programming Model: Hierarchy of scalar threads Today: Limited Communication & Synchronization Kernel Wavefront / Warp Scalar Thread Blocks Work Group. Blocks / Thread Blocks 1 2 3 4 Barrier Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 5 6 7 8 9 10 11 12 Global Memory Shared (Local) Memory Hardware TM for GPU Architectures 4

Baseline GPU Architecture SIMTCore SIMTCore SIMT Interconnection Network Memory. Partition Memory Partition Atomic Op. Unit Last-Level Cache Bank Off-Chip DRAM Channel Done (Warp ID) SIMT Front End Fetch Decode Schedule Branch SIMD Datapath Memory Subsystem Tex $ Const$ SMem Non-Coherent L 1 D-Cache Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Icnt. Network Hardware TM for GPU Architectures 5

Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’ 84, Fung MICRO’ 07) Stack A/1111 A TOS TOS B/1111 B C/1001 C D/0110 D F E E Thread Warp E/1111 E Next PC B A G E D C E Active Mask 1111 0110 1001 Common PC Thread 1 2 3 4 G/1111 G A Reconv. PC B C D E G A Time 6 17

Data Synchronizations on GPUs n Motivation q q Solve wider range of problems on GPU Data Race Data Synchronization n n q Current Solution: Atomic read-modify-write (32 -bit/64 -bit). Best Sol’n? Why Transactional Memory? n n n E. g. N-Body with 5 M bodies (traditional sync, not TM) CUDA SDK: O(n 2) – 1640 s (barrier) Barnes Hut: O(n. Logn) – 5. 2 s (atomics, harder to get right) Easier to Write/Debug Efficient Algorithms Practical efficiency. Want efficiency of GPU with reasonable (not superhuman) effort and time. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7

Data Synchronizations on GPUs n Deadlock-free code with fine-grained locks and 10, 000+ hardware scheduled threads is hard # Possible Global Lock States Which of these states are deadlocks? ! # Locks x # Sharing Thread n Other general problems with lock based synchronization q q Implicit relationship between locks and objects being protected Code is not composable Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8

Data Synchronization Problems Specific to GPUs n Interaction between locks and SIMT control flow can cause deadlocks A: while(atomic. CAS(lock, 0, 1)==1); B: // Critical Section … C: lock = 0; A: done = 0; B: while(!done){ C: if(atomic. CAS(lock, 0, 1)==1){ D: // Critical Section … E: lock = 0; F: done = 1; G: } H: } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9

Transactional Memory n Program specifies atomic code blocks called transactions [Herlihy’ 93] Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt TM Version: atomic { X[c] = X[a]+X[b]; } Potential Deadlock! Hardware TM for GPU Architectures 10

Transactional Memory Programmers’ View: Non-conflicting transactions may run in parallel TX 1 Commit Memory A B C D Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt TX 2 Commit OR TX 2 TX 1 Time TX 2 Time TX 1 Conflicting transactions automatically serialized TX 1 Commit Hardware TM for GPU Architectures Memory A B C D TX 2 Abort TX 2 Commit 11

Transactional Memory n Each transaction has 3 phases q Execution n q Validation n n q Track all memory accesses (Read-Set and Write-Set) Detect any conflicting accesses between transactions Resolve conflict if needed (abort/stall) Commit n Update global memory Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12

Transactional Memory on Open. CL n A natural extension to Open. CL Programming Model q Program can launch many more threads than the hardware can execute concurrently GPU HW q GPU-TM? Current threads running transactions do not need to wait for future unscheduled threads Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13

Are TM and GPUs Incompatible? The problem with GPUs (from TM perspective): n 1000 s of concurrent threads n Inter-thread spatial locality common n No cache coherence n No private cache for each thread (Buffering? ) n Tx Abort Control flow divergence Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14

Hardware TM for GPUs Challenge: Conflict Detection Private Data Cache Signature TX 1 TX 2 TX 3 TX 4 R(A), W(C) Inv C 1024 -bit Signature/Thread 3. 8 MB / 30 k Threads Bus R(D) No coherence Scalable on GPUs? R(A) scalar thread Each needs own cache? Coherence R(C), W(B) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Conflict! Hardware TM for GPU Architectures 15

Hardware TM for GPUs Challenge: Transaction Rollback GPU Core (SM) CPU Core 10 s of Register File Registers @ TX Entry Abort Warp Warp Register File Checkpoint? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 32 k Registers Hardware TM for GPU Architectures 2 MB Total On-Chip Storage 16

Hardware TM for GPUs Challenge: Access Granularity and Write Buffer GPU Core (SM) CPU Core L 1 Data Cache Warp Warp L 1 Data Cache TX 1 -2 Threads 32 k. B Cache Commit Global Memory 1024 -1536 Threads Fermi’s L 1 Data Cache (48 k. B) = 384 X 128 B Lines Problem: 384 lines / 1536 threads < 1 line per thread! Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17

Hardware TM on GPUs Challenge: SIMT Hardware n On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 8 Scalar Threads . . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Committed Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Reconvergence? Aborted Hardware TM for GPU Architectures 18

Goal n We take it as a given that most programmers trying lock based programming on a GPU will give up before they manage to get their application working. n Hence, our goal was to find the most efficient approach to implement TM on GPU. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19

KILO TM n n n Supports 1000 s of concurrent transactions Transaction-aware SIMT stack No cache coherence protocol dependency Word-level conflict detection Captures 59% of FG Lock Performance q 128 X Faster than Serialized Tx Exec. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20

KILO TM: Design Highlights n Value-Based Conflict Detection q q n Self-Validation + Abort: Simple Communication No Cache Coherence Dependence Speculative Validation q Increase Commit Parallelism Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21

High Level GPU Architecture + KILO TM Implementation Overview Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22

KILO TM: SIMT Core Changes n SW Register Checkpoint q q n Observation: Most overwritten registers not used Compiler analysis can identify what to checkpoint Transaction Abort q q ~ Do-While Loop Extend SIMT Stack with special entries to track aborted transactions in each warp Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Overwritten Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Hardware TM for GPU Architectures Abort 23

Transaction-Aware SIMT Stack A: @ tx_begin: Type PC RPC N H -N B H R C -TOS T C -- t = tid. x; if (…) { tx_begin; x[t%10] = y[t] + 1; if (s[t]) y[t] = 0; tx_commit; z = y[t]; Implicit loop } when abort w = y[t+1]; B: Active Mask Copy C: 1111 Active D: 1111 0011 Mask E: F: 0000 G: 1111 0011 H: @ tx_commit, thread 6 & 7 failed validation: Branch Divergence within Tx: Type PC RPC Active Mask N H -1111 Copy N B H 1111 0011 Active N B H 1111 0000 0011 R C -TOS Mask R C -0000 + PC T F -0000 T F -1111 0011 N E F 0001 0011 TOS @ tx_commit, restart Tx for thread 6 & 7: @ tx_commit, Type PC RPC Active Mask all threads with Tx committed: N H -1111 Type PC RPC Active Mask N B H 1111 0011 N H -1111 0000 R C -N G H 1111 0011 TOS T C -0000 0011 R C -0000 TOS Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 24

KILO TM: Value-Based Conflict Detection Private Memory Read-Log A=1 Write-Log B=2 n TX 1 atomic{B=A+1} Tx. Begin LD r 1, [A] ADD r 1, 1 ST r 1, [B] Tx. Commit A=1 Global Memory TX 2 atomic{A=B+2} B=0 B=2 Private Memory Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Self-Validation + Abort: q Read-Log B=0 Write-Log A=2 Only detects existence of conflict (not identity) => No Tx to Tx Msg – Simple Communication Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 25

Parallel Validation? Data Race!? ! Init: A=1, B=0 Private Memory Read-Log A=1 Write-Log B=2 TX 1 atomic{B=A+1} Global Memory A=1 Tx 1 then Tx 2: B=2, A=4 B=0 OR Tx 2 then Tx 1: A=2, B=3 TX 2 atomic{A=B+2} Private Memory Read-Log B=0 Write-Log A=2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 26

Serialize Validation? Time n n n TX 1 V+C TX 2 VStall +C Commit Unit Global Memory Benefit #1: No Data Race Benefit #2: No Live Lock (generic lazy TM prob. ) Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 27

Identifying Non-conflicting Tx: Step 1: Leverage Parallelism TX 3 TX 3 TX 1 TX 1 TX 2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Commit Unit Global Memory Partition Hardware TM for GPU Architectures 28

Solution: Speculative Validation n Key Idea: Split Validation into two parts q q Part 1: Check recently committed transactions Part 2: Check concurrently committing transactions Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 29

KILO TM: Speculative Validation n Memory subsystem is deeply pipelined and highly parallel TX 1 Read-Log Write-Log R(C), W(D) TX 2 R(A), W(B) Commit Unit Log Transfer Spec. Validation TX 2 Validation Wait Finalize Outcome R(D), W(E) Commit Validation Queue TX 1 Hazard Detection TX 3 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt TX 3 Hardware TM for GPU Architectures C A D Global Memory Partition 30

KILO TM: Speculative Validation TX 1 TX 2 TX 3 R(C), W(D) R(A), W(B) R(D), W(E) Commit Unit Log Transfer Last History Last. Writer History D? W(D) A? C? Spec. Validation TX 1 Evict D Addr CID Hazard Detection TX 2 B Nil TX 1 Recency TX 3 E Lookup Table Validation Wait Bloom Filter Finalize Outcome TX 3 Validation Queue TX 2 TX 1 STALL C A D Global Memory Partition Commit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 31

Log Storage n Transaction logs are stored at the private memory of each thread q Located in DRAM, cached in L 1 and L 2 caches Wavefront T 0 T 1 T 2 T 3 LD ST A B E K F L M G C N D H T 0’s view of private memory Read-Log Ptr Write-Log Ptr Consecutive physical address A 3 E 6 B 4 F 1 C 9 G 6 D Address 7 Value H 8 K 7 L 6 M 5 N 4

Log Transfer n Entries heading to same memory partition can be grouped into a larger packet Partition 0 Partition 1 Partition 3 Packets to Commit Units Commit Unit 0 A 3 Commit Unit 1 B 4 Commit Unit 2 D 7 C 9 Read-Log Ptr Write-Log Ptr A 3 E 6 B 4 F 1 C 9 G 6 D 7 H 8 K 7 L 6 M 5 N 4

Distributed Commit / HW Org. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 34

ABA Problem? n Classic Example: Linked List Based Stack top n A Next B Next C Next Null Thread 0 – pop(): while (true) { t A t = top; Next = t->Next; Next B // thread 2: pop A, pop B, push A top A C Next if (atomic. CAS(&top, t, next) == t) break; top } C Next top Null B Next C Next Null // succeeds! Null

ABA Problem? n atomic. CAS protects only a single word q Only part of the data structure top A Next B Next C Next while (true) { t = top; Next = t->Next; if (atomic. CAS(&top, t, next) == t) break; } n Null // succeeds! Value-based conflict detection protects all relevant parts of the data structure

Evaluation Methodology n GPGPU-Sim 3. 0 (BSD license) q q n Detailed: IPC Correlation of 0. 93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications q q q Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) Cuda. Cuts (CC) Data Mining (AP) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 37

GPGPU-Sim 3. 0. x running SASS (decuda) 0. 976 correlation on subset of CUDA SDK that decuda correctly Disassembles Note: Rest of data uses PTX instead of SASS (0. 93 correlation) (We believe GPGPU-Sim is reasonable proxy. )

Performance (vs. Serializing Tx) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 39

IPC Absolute Performance (IPC) n n n TM on GPU performs well for applications with low contention. Poorly: Memory divergence, low parallelism, high conflict rate (tackle through alg. design/tuning? ) CPU vs GPU? q q CC: FG-Lock version 400 X faster than its CPU version BH: FG-Lock version 2. 5 X faster than its CPU version Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 40

Performance (Exec. Time) Captures 59% of FG Lock Performance 128 X Faster than Serialized Tx Exec. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 41

KILO TM Scaling Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 42

Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible solutions (future work): Solution 1: Application performance tuning (easier with TM vs. FG Lock) Solution 2: Transaction schedule Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 43

Thread Cycle Breakdown n n Status of a thread at each cycle Categories: q q q q TC: In a warp stalled by concurrency control TO: In a warp committing its transactions TW: Have passed commit, and waiting for other threads in the warp to pass TA: Executing an eventually aborted transaction TU: Executing an eventually committed transaction (Useful work) AT: Acquiring a lock or doing an Atomic Operation BA: Waiting at a Barrier NL: Doing non-transactional (Normal) work Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 44

Thread Cycle Breakdown KL FGL KL-UC IDEAL HT-H Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt HT-L KL FGL KL-UC IDEAL ATM KL FGL IDEAL KL-UC CL Hardware TM for GPU Architectures BH CC KL-UC IDEAL AP 45

Core Cycle Breakdown n n Action performed by a core at each cycle Categories: q q EXEC: Issuing a warp for execution STALL: Stalled by a downstream warp SCRB: All warps blocked by the scoreboard, due to data hazards, concurrency control, pending commits (or any combination thereof) IDLE: None of the warps are ready in the instruction buffer. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 46

Core Cycle Breakdown KL FGL KL-UC IDEAL Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt KL FGL KL-UC IDEAL FGL IDEAL KL-UC Hardware TM for GPU Architectures KL-UC IDEAL 47

Read-Write Buffer Usage Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 48

# In-Flight Buffers Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 49

Implementation Complexity n n Logs in Private Memory @ L 1 Data Cache Commit Unit q q q n 5 k. B Last Writer History Unit 19 k. B Transaction Status 32 k. B Read-Set and Write-Set Buffer CACTI 5. 3 @ 40 nm q q 0. 40 mm 2 x 6 Memory Partition 0. 5% of 520 mm 2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 50

Summary n KILO TM q q q 1000 s of Concurrent Transactions Value-Based Conflict Detection Speculative Validation for Commit Parallelism 59% Fine-Grained Locking Performance 0. 5% Area Overhead Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 51

Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 52

Logical Stage Organization Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 53

Execution Time Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 54