EECE 571 T Compute Accelerator Architectures Spring 2019

  • Slides: 58
Download presentation
/ EECE 571 T: Compute Accelerator Architectures Spring 2019 Slide Set #5: GPU Architecture

/ EECE 571 T: Compute Accelerator Architectures Spring 2019 Slide Set #5: GPU Architecture Research Instructor: Dr. Tor M. Aamodt aamodt@ece. ubc. ca NVIDIA Tegra X 1 die photo 1

Learning Objectives • After this lecture you should be familiar with some research on

Learning Objectives • After this lecture you should be familiar with some research on GPU architecture beyond that covered in the assigned readings. 2

Reading (for quiz on Feb 28) Rhu et al. , “v. DNN: Virtualized Deep

Reading (for quiz on Feb 28) Rhu et al. , “v. DNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO 2016. Arunkumar et al. , MCM-GPU: Multi-chip module GPUs for continued performance scalability, ISCA 2017. 3

Research Direction 1: Mitigating SIMT Control Divergence 4

Research Direction 1: Mitigating SIMT Control Divergence 4

Recall: SIMT Hardware Stack AA/1111 E E TOS TOS BB/1111 CC/1001 Reconv. PC DD/0110

Recall: SIMT Hardware Stack AA/1111 E E TOS TOS BB/1111 CC/1001 Reconv. PC DD/0110 F 1111 0110 1001 Thread 1 2 3 4 G/1111 G A A B E G D C E Active Mask Common PC Thread Warp EE/1111 Next PC B C D E G A Time Potential for significant loss of throughput when control flow diverged! 5

Performance vs. Warp Size • 165 Applications IPC normalized to warp size 32 1,

Performance vs. Warp Size • 165 Applications IPC normalized to warp size 32 1, 8 1, 6 Warp Size 4 1, 2 1 0, 8 0, 6 0, 4 0, 2 0 Application Convergent Applications Warp-Size Insensitive Applications Rogers et al. , A Variable Warp-Size Architecture, ISCA 2015 Divergent Applications 6

Dynamic Warp Formation (Fung MICRO’ 07) Warp 0 A Warp 1 1234 A B

Dynamic Warp Formation (Fung MICRO’ 07) Warp 0 A Warp 1 1234 A B A 9 10 11 12 B 9 10 11 12 C -- -- 11 12 5678 1 2 -- -- Time C D 5678 Reissue/Memory Latency 1234 B C Warp 2 5 -- 7 8 SIMD Efficiency 88% C 1 2 7 8 Pack C 5 -- 11 12 -- -- 3 4 How to pick threads to pack into warps? D -6 -- -E D 9 10 -- -- E 9 10 11 12 1234 E 5678 7

Dynamic Warp Formation: Hardware Implementation A: BEQ R 2, B C: … Warp Update

Dynamic Warp Formation: Hardware Implementation A: BEQ R 2, B C: … Warp Update Register T 5 TID 2 x 73 N 8 REQ 1011 PCBA 0110 PC-Warp LUT H Warp Update Register NT 1 TID 6 x N 4 REQ 0100 PCCB 1001 B PC PC C Warp Pool 2 0 0110 0010 OCC IDX 1 1101 1001 H X Y 5 1 2 6 3 7 4 8 A B PC PC A C 5 TID 2 x 3 N 4 1 8 Prio 5 TID 6 x 7 N 4 1 8 Prio PC B TID x 7 N Prio Warp Allocator Iss ue L og ic Thread Scheduler X YZ 1 5 6 2 7 3 4 8 No Lane Conflict Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 8

DWF Pathologies: Starvation • Majority Scheduling – Best Performing – Prioritize largest group of

DWF Pathologies: Starvation • Majority Scheduling – Best Performing – Prioritize largest group of threads with same PC • Starvation – LOWER SIMD Efficiency! – Tricky: Variable Memory Latency Wilson Fung, Tor Aamodt Thread Block Compaction C 1 2 7 8 C 5 -- 11 12 D E 9 2 1 6 3 7 4 8 D E 5 -- --1011 -- 12 -E 1000 s cycles 1 2 3 4 E 5 6 7 8 D 9 6 3 4 E 9 10 11 12 D -- 10 -- -E 9 6 3 4 E -- 10 -- -- Time • Other Warp Scheduler? B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid. x] + K; 9

DWF Pathologies: Extra Uncoalesced Accesses • Coalesced Memory Access = Memory SIMD – 1

DWF Pathologies: Extra Uncoalesced Accesses • Coalesced Memory Access = Memory SIMD – 1 st Order CUDA Programmer Optimization • Not preserved by DWF E: B = C[tid. x] + K; No DWF E E E Memory #Acc = 3 1 2 3 4 5 6 7 8 9 10 11 12 0 x 100 0 x 140 0 x 180 Memory #Acc = 9 With DWF Wilson Fung, Tor Aamodt E E E 1 2 7 12 9 6 3 8 5 10 11 4 0 x 100 0 x 140 0 x 180 Thread Block Compaction L 1 Cache Absorbs Redundant Memory Traffic L 1$ Port Conflict 10

DWF Pathologies: Implicit Warp Sync. • Some CUDA applications depend on the lockstep execution

DWF Pathologies: Implicit Warp Sync. • Some CUDA applications depend on the lockstep execution of “static warps” Warp 0 Warp 1 Warp 2 Thread 0. . . 31 Thread 32. . . 63 Thread 64. . . 95 – E. g. Task Queue in Ray Tracing Implicit Warp Sync. int wid = tid. x / 32; if (tid. x % 32 == 0) { shared. Task. ID[wid] = atomic. Add(g_Task. ID, 32); } my_Task. ID = shared. Task. ID[wid] + tid. x % 32; Process. Task(my_Task. ID); Wilson Fung, Tor Aamodt Thread Block Compaction 11

Observation • Compute kernels usually contain divergent and non-divergent (coherent) code segments • Coalesced

Observation • Compute kernels usually contain divergent and non-divergent (coherent) code segments • Coalesced memory access usually in coherent code segments – DWF no benefit there Wilson Fung, Tor Aamodt Thread Block Compaction Coherent Divergent Static Warp Divergence Dynamic Warp Reset Warps Coales. LD/ST Static Coherent Warp Recvg Pt. 12

Thread Block Compaction • Run a thread block like a warp – Whole block

Thread Block Compaction • Run a thread block like a warp – Whole block move between coherent/divergent code – Block-wide stack to track exec. paths reconvg. • Barrier @ Branch/reconverge pt. – All avail. threads arrive at branch – Insensitive to warp scheduling • Warp compaction Implicit Warp Sync. Starvation Extra Uncoalesced Memory Access – Regrouping with all avail. threads – If no divergence, gives static warp arrangement Wilson Fung, Tor Aamodt Thread Block Compaction 13

Thread Block Compaction PC RPC Active Threads A E - 1 2 3 4

Thread Block Compaction PC RPC Active Threads A E - 1 2 3 4 5 6 7 8 9 10 11 12 D E -- -- -3 -4 -- -6 -- -- -9 10 -- -- --- 12 -C E -1 -2 -- -- -5 -- -7 -8 -- -- 11 A: K = A[tid. x]; B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid. x] + K; Wilson Fung, Tor Aamodt A A A 1 2 3 4 5 6 7 8 9 10 11 12 C C 1 2 7 8 5 -- 11 12 D D 9 6 3 4 -- 10 -- -- C C C 1 2 -- -5 -- 7 8 -- -- 11 12 E E E 1 2 3 4 5 6 7 8 9 10 11 12 D D D -- -- 3 4 -- 6 -- -9 10 -- -- E E E 1 2 7 8 5 6 7 8 9 10 11 12 Thread Block Compaction Time 14

Thread Compactor • Convert activemask from block-wide stack to thread IDs in warp buffer

Thread Compactor • Convert activemask from block-wide stack to thread IDs in warp buffer • Array of Priority-Encoder C E 1 2 -- -- 5 -- 7 8 -- -- 11 12 1 5 -- 2 -- -- -- 7 11 -- 8 12 P-Enc 1 5 -2 11 7 12 8 Warp Buffer C 1 2 7 8 C 5 -- 11 12 Wilson Fung, Tor Aamodt Thread Block Compaction 15

Experimental Results • 2 Benchmark Groups: – COHE = Non-Divergent CUDA applications – DIVG

Experimental Results • 2 Benchmark Groups: – COHE = Non-Divergent CUDA applications – DIVG = Divergent CUDA applications Serious Slowdown from pathologies No Penalty for COHE 22% Speedup on DIVG COHE DIVG DWF TBC 0. 6 0. 7 0. 8 0. 9 1 1. 2 1. 3 IPC Relative to Baseline Per-Warp Stack Wilson Fung, Tor Aamodt Thread Block Compaction 16

Recent work on warp divergence • Intel [MICRO 2011]: Thread Frontiers – early reconvergence

Recent work on warp divergence • Intel [MICRO 2011]: Thread Frontiers – early reconvergence for unstructured control flow. • UT-Austin/NVIDIA [MICRO 2011]: Large Warps – similar to TBC except decouple size of thread stack from thread block size. • NVIDIA [ISCA 2012]: Simultaneous branch and warp interweaving. Enable SIMD to execute two paths at once. • Intel [ISCA 2013]: Intra-warp compaction – extends Xeon Phi uarch to enable compaction. • NVIDIA: Temporal SIMT [described briefly in IEEE Micro article and in more detail in CGO 2013 paper] • NVIDIA [ISCA 2015]: Variable Warp-Size Architecture – merge small warps (4 threads) into “gangs”. 17

+ Thread Frontiers [Diamos et al. , MICRO 2011] 18

+ Thread Frontiers [Diamos et al. , MICRO 2011] 18

Temporal SIMT Spatial SIMT (current GPUs) Pure Temporal SIMT thread 0 1 -wide thread

Temporal SIMT Spatial SIMT (current GPUs) Pure Temporal SIMT thread 0 1 -wide thread 31 ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld mlmlmlml mlmlmlml adadadad adadadad st st st st st st st st 1 warp instruction = 32 threads [slide courtesy of Bill Dally] time 1 cyc 32 -wide datapath (threads) ld ld ld 0 1 2 3 4 5 6 7 8 9 10

Temporal SIMT Optimizations Control divergence — hybrid MIMD/SIMT 32 -wide (41%) 4 -wide (65%)

Temporal SIMT Optimizations Control divergence — hybrid MIMD/SIMT 32 -wide (41%) 4 -wide (65%) 1 -wide (100%) Scalarization Factor common instructions from multiple threads Execute once – place results in common registers [See: SIMT Affine Value Structure (ISCA 2013)] [slide courtesy of Bill Dally]

Scalar Instructions in SIMT Lanes Scalar instruction spanning warp Temporal execution of Warp [slide

Scalar Instructions in SIMT Lanes Scalar instruction spanning warp Temporal execution of Warp [slide courtesy of Bill Dally] Scalar register visible to all threads Multiple lanes/warps Y. Lee, CGO 2013

Variable Warp-Size Architecture • Most recent work by NVIDIA [ISCA 2015] • Split the

Variable Warp-Size Architecture • Most recent work by NVIDIA [ISCA 2015] • Split the SM datapath into narrow slices. – Extensively studied 4 -thread slices • Gang slice execution to gain efficiencies of wider warp. Slices share an L 1 I-Cache and Memory Unit Tim Rogers A Variable Warp-Size Architecture Slices can execute independently 22

IPC normalized to warp size 32 Divergent Application Performance 1, 8 1, 6 1,

IPC normalized to warp size 32 Divergent Application Performance 1, 8 1, 6 1, 4 1, 2 1 0, 8 0, 6 0, 4 0, 2 0 WS 32 Co. MD WS 4 I-VWS Lighting E-VWS I-VWS: E-VWS: Breakon + Warp Size 4 CF Reform Only Game. Physics Obj. Classifier Raytracing HMEAN-DIV Divergent Applications Tim Rogers A Variable Warp-Size Architecture 23

1, 2 WS 32 WS 4 I-VWS: E-VWS: Breakon + Warp Size 4 Reform

1, 2 WS 32 WS 4 I-VWS: E-VWS: Breakon + Warp Size 4 Reform CF Only E-VWS 1 0, 8 0, 6 0, 4 0, 2 A Variable Warp-Size Architecture N -C So N EA M R ad ix et Warp-Size Insensitive Applications Unaffected H e. D ur at Convergent Applications Tim Rogers O rt ec t 2 e am Fe ix at r M G M ul am tip e ly 1 0 G IPC normalized to warp size 32 Convergent Application Performance 24

Research Direction 2: Mitigating High GPGPU Memory Bandwidth Demands 25

Research Direction 2: Mitigating High GPGPU Memory Bandwidth Demands 25

Improve upon CCWS? • CCWS detects bad scheduling decisions and avoids them in future.

Improve upon CCWS? • CCWS detects bad scheduling decisions and avoids them in future. • Would be better if we could “think ahead” / “be proactive” instead of “being reactive” 26

Observations [Rogers et al. , MICRO 2013] • Memory divergence in static instructions is

Observations [Rogers et al. , MICRO 2013] • Memory divergence in static instructions is predictable Main Memory Warp 10 … load … Divergence • Data touched by divergent loads dependent on active mask Main Memory 4 accesses Warp 1 1 Divergence Both Used To Create Cache Footprint Prediction Warp Divergence 2 accesses 1 0 0 1 27

Footprint Prediction 1. Detect loops with locality Some loops have locality Some don’t Limit

Footprint Prediction 1. Detect loops with locality Some loops have locality Some don’t Limit multithreading here 2. Classify loads in the loop Loop with locality while(…) { load 1 … load 2 } 3. Compute footprint from active mask Loop with locality Warp 0 1 1 while(…) { Diverged load 1 4 accesses … + load 2 Not Diverged 1 access } Diverged Not Diverged Warp 0’s Footprint = 5 cache lines 28

DAWS Operation Example Hit Cache A[0] A[64] Hit A[64] A[96] Hit x 30 A[160]

DAWS Operation Example Hit Cache A[0] A[64] Hit A[64] A[96] Hit x 30 A[160] Hit x 30 A[192] A[128] Hit x 30 int C[]={0, 64, 96, 128, 160, 192, 224, 256}; void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid]; A[224] Loop Stop Go 1 st Iter. Warp 1 Hit x 30 Example Compressed Sparse Row Kernel Cache A[32] Cache A[0] while(i < C[tid+1]) { 0 1 1 1 sum += A[ i ]; Warp 0 Divergent Branch Memory Divergence 1 1 Warp 0 2 nd Iter. Go 1 0 0 0 Go 33 rd Iter. Time 0 Time 1 Cache Footprint Stop Warp 1 4 4 No Footprint Warp 0 ++i; Go Time 2 } … 4 Warp 1 Warp 0 Early warps profile 1 Diverged Load Warp 0 has branch divergence Locality Detected loop for later Footprint = 3 X 1 Detected No locality Want to capture warps capture localitywarps detected. Both = nospatial locality Footprint together = 4 X 1 4 Footprint Active threads decreased footprint 29

Sparse MM Case Study Results • Performance (normalized to optimized version) Divergent Code Execution

Sparse MM Case Study Results • Performance (normalized to optimized version) Divergent Code Execution time Within 4% of optimized with no programmer input 30

Coordinated criticality-Aware Warp Acceleration (CAWA) [Lee et al. , ISCA 2015] • Some warps

Coordinated criticality-Aware Warp Acceleration (CAWA) [Lee et al. , ISCA 2015] • Some warps execute longer than others due to lack of uniformity in underlying workload. • Give these warps more space in cache and more scheduling slots. • Estimate critical path by observing amount of branch divergence and memory stalls. • Also, predict if line inserted in line will be used by a warp that is critical using modified version of SHi. P cache replacement algorithm. 31

Other Memory System Performance Considerations • TLB Design for GPUs. – Current GPUs have

Other Memory System Performance Considerations • TLB Design for GPUs. – Current GPUs have translation look aside buffers (makes managing multiple graphics application surfaces easier; does not support paging) – How does large number of threads impact TLB design? – E. g. , Power et al. , Supporting x 86 -64 Address Translation for 100 s of GPU Lanes, HPCA 2014. Importance of multithreaded page table walker + page walk cache. 32

Research Direction 3: Easier Programming with Synchronization 33

Research Direction 3: Easier Programming with Synchronization 33

Synchronization • Locks are not encouraged in current GPGPU programming manuals. • Interaction with

Synchronization • Locks are not encouraged in current GPGPU programming manuals. • Interaction with SIMT stack can easily cause deadlocks: while( atomic. CAS(&lock[a[tid]], 0, 1) != 0 ) ; // deadlock here if a[i] = a[j] for any i, j = tid in warp // critical section goes here atomic. Exch (&lock[a[tid]], 0) ; 34

Correct way to write critical section for GPGPU: done = false; while( !done )

Correct way to write critical section for GPGPU: done = false; while( !done ) { if( atomic. CAS (&lock[a[tid]], 0 , 1 )==0 ) { // critical section goes here atomic. Exch(&lock[a[tid]], 0 ) ; } } Most current GPGPU programs use barriers within thread blocks and/or lock-free data structures. This leads to the following picture… 35

 • Lifetime of GPU Application Development Functionality Performance Time Fine-Grained Locking/Lock-Free E. g.

• Lifetime of GPU Application Development Functionality Performance Time Fine-Grained Locking/Lock-Free E. g. N-Body with 5 M bodies CUDA SDK: O(n 2) – 1640 s (barrier) Barnes Hut: O(n. Logn) – 5. 2 s (locks) Transactional Memory ? Time Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Time 36

Transactional Memory • Programmer specifies atomic code blocks called transactions [Herlihy’ 93] Lock Version:

Transactional Memory • Programmer specifies atomic code blocks called transactions [Herlihy’ 93] Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); TM Version: atomic { X[c] = X[a]+X[b]; } Potential Deadlock! 37

Transactional Memory Programmers’ View: OR TX 2 TX 1 Time TX 2 Time TX

Transactional Memory Programmers’ View: OR TX 2 TX 1 Time TX 2 Time TX 1 Non-conflicting transactions may run in parallel Conflicting transactions automatically serialized Memory A B C D TX 1 Commit TX 2 Commit TX 1 Commit TX 2 Abort TX 2 Commit 38

Are TM and GPUs Incompatible? GPU uarch very different from multicore CPU… KILO TM

Are TM and GPUs Incompatible? GPU uarch very different from multicore CPU… KILO TM [MICRO’ 11, IEEE Micro Top Picks] • Hardware TM for GPUs • Half performance of fine grained locking n Chip area overhead of 0. 5% Hardware TM for GPU Architectures 39

Hardware TM for GPUs Challenge #1: SIMD Hardware • On GPUs, scalar threads in

Hardware TM for GPUs Challenge #1: SIMD Hardware • On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads . . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Committed T 0 T 1 T 2 T 3 Branch Divergence! T 0 T 1 T 2 T 3 Aborted Hardware TM for GPU Architectures 40

KILO TM – Solution to Challenge #1: SIMD Hardware • Transaction Abort – Like

KILO TM – Solution to Challenge #1: SIMD Hardware • Transaction Abort – Like a Loop – Extend SIMT Stack. . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Abort Hardware TM for GPU Architectures 41

Hardware TM for GPUs Challenge #2: Transaction Rollback GPU Core (SM) CPU Core Register

Hardware TM for GPUs Challenge #2: Transaction Rollback GPU Core (SM) CPU Core Register File @ TX Abort @ TX Entry 10 s of Registers Warp Warp 32 k Registers Register File Checkpoint? Hardware TM for GPU Architectures 2 MB Total On-Chip Storage 42

KILO TM – Solution to Challenge #2: Transaction Rollback • SW Register Checkpoint –

KILO TM – Solution to Challenge #2: Transaction Rollback • SW Register Checkpoint – Most TX: Reg overwritten first appearance (idempotent) – TX in Barnes Hut: Checkpoint 2 registers Overwritten Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Abort Hardware TM for GPU Architectures 43

Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol

Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol n Not Available on (current) GPUs n No Private Data Cache per Thread Signatures? n 1024 -bit / Thread n 3. 8 MB / 30 k Threads Hardware TM for GPU Architectures 44 44

Hardware TM for GPUs Challenge #4: Write Buffer GPU Core (SM) Warp Warp L

Hardware TM for GPUs Challenge #4: Write Buffer GPU Core (SM) Warp Warp L 1 Data Cache Problem: 3841024 -1536 lines / 1536 threads < 1 line per. Cache thread! Threads Fermi’s L 1 Data (48 k. B) = 384 X 128 B Lines Hardware TM for GPU Architectures 45

KILO TM: Value-Based Conflict Detection Private Memory Read-Log A=1 Write-Log B=2 TX 1 Global

KILO TM: Value-Based Conflict Detection Private Memory Read-Log A=1 Write-Log B=2 TX 1 Global Memory Tx. Begin LD r 1, [A] ADD r 1, 1 ST r 1, [B] Tx. Commit TX 2 atomic {B=A+1} atomic {A=B+2} A=1 B=0 B=2 Private Memory Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit • Self-Validation + Abort: Read-Log B=0 Write-Log A=2 – Only detects existence of conflict (not identity) Hardware TM for GPU Architectures 46

Parallel Validation? Data Race!? ! Private Memory Read-Log A=1 Write-Log B=2 TX 1 atomic

Parallel Validation? Data Race!? ! Private Memory Read-Log A=1 Write-Log B=2 TX 1 atomic {B=A+1} A=1 Global Memory TX 2 atomic {A=B+2} Tx 1 then Tx 2: A=4, B=2 B=0 OR Tx 2 then Tx 1: A=2, B=3 Private Memory Hardware TM for GPU Architectures Read-Log B=0 Write-Log A=2 47

Serialize Validation? Time TX 1 V+C TX 2 VStall +C Commit Unit Global Memory

Serialize Validation? Time TX 1 V+C TX 2 VStall +C Commit Unit Global Memory V = Validation C = Commit • Benefit #1: No Data Race • Benefit #2: No Live Lock • Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) Hardware TM for GPU Architectures 48

Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts 1. Recently Committed

Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts 1. Recently Committed TX in Parallel 2. Concurrently Committing TX in Commit Order q Approximate Time TX 1 V+C TX 2 V+C TX 3 Stall V+C V = Validation C = Commit Unit RS Global RS RS Memory Conflict Rare Good Commit Parallelism Hardware TM for GPU Architectures 49

Efficiency Concerns? 128 X Speedup over CG-Locks 40% FG-Locks Performance 2 X Energy Usage

Efficiency Concerns? 128 X Speedup over CG-Locks 40% FG-Locks Performance 2 X Energy Usage • Scalar Transaction Management – Scalar Transaction fits SIMT Model – Simple Design – Poor Use of SIMD Memory Subsystem • Rereading every memory location – Memory access takes energy 50

Inefficiency from Scalar Transaction Management • Kilo TM ignores GPU thread hierarchy – Excessive

Inefficiency from Scalar Transaction Management • Kilo TM ignores GPU thread hierarchy – Excessive Control Message Traffic Send-Log SIMT Core CU CU CU-Pass/Fail CU SIMT TX-Outcome CU SIMT Commit Done CU SIMT CU CU CU Core Commit Unit 4 B 4 B 32 B Port – Scalar Validation and Commit Poor L 2 Bandwidth Utilization Last Level Cache • Simplify HW Design, but Cost Energy 51

Intra-Warp Conflict • Potential existence of intra-warp conflict introduces complex corner cases: Correct Outcomes

Intra-Warp Conflict • Potential existence of intra-warp conflict introduces complex corner cases: Correct Outcomes TX 1 TX 2 TX 3 TX 4 Read Set X=9 Y=8 Z=7 W=6 Write Set Y=9 Z=8 W=7 X=6 All Committed @ Validation (Wrong) Global Memory X=9 Y=8 Z=7 W=6 Global Memory X=6 Y=9 Z=8 W=7 Global Memory X=6 Y=8 Z=8 W=6 OR Global Memory X=9 Y=9 Z=7 W=7 52

Intra-Warp Conflict Resolution Execution Intra-Warp Conflict Resolution Validation Commit • Kilo TM stores read-set

Intra-Warp Conflict Resolution Execution Intra-Warp Conflict Resolution Validation Commit • Kilo TM stores read-set and write-set in logs – Compact, fits in caches – Inefficient for search • Naive, pair-wise resolution too slow O((R+W)2) Comparisons Each – T threads/warp, R+W words/thread TX 1 TX 2 TX 3 TX 4 – O(T 2 x (R+W)2), T ≥ 32 53

Fung, MICRO 2013 Intra-Warp Conflict Resolution: 2 -Phase Parallel Conflict Resolution • Insight: Fixed

Fung, MICRO 2013 Intra-Warp Conflict Resolution: 2 -Phase Parallel Conflict Resolution • Insight: Fixed priority for conflict resolution enables parallel resolution • O(R+W) • Two Phases – Ownership Table Construction – Parallel Match 54

Results 40% 66% FG-Lock Kilo. TM-Base TCD Warp. TM+TCD Performance 2 X 1. 3

Results 40% 66% FG-Lock Kilo. TM-Base TCD Warp. TM+TCD Performance 2 X 1. 3 X Energy Usage 0 1 2 3 Execution Time Normalized to FGLock 0 1 2 3 Energy Usage Normalized to FGLock Kilo. TM-Base TCD Warp. TM+TCD Low Contention Workload: Kilo TM w/ SW Optimizations on par with FG Lock 55

Other Research Directions…. • Non-deterministic behavior for buggy code different results over multiple executions

Other Research Directions…. • Non-deterministic behavior for buggy code different results over multiple executions – GPUDet ASPLOS 2013 Result Variation (Kepler) 100% 80% 60% 40% 20% 0% 20000 25000 30000 35000 40000 450000 # edges • Lack of good performance analysis tools – NVIDIA Profiler/Parallel NSight – Aerial. Vision [ISPASS 2010] – GPU analytical perf/power models (Hyesoon Kim) 56

Lack of I/O and System Support… • Support for printf, malloc from kernel in

Lack of I/O and System Support… • Support for printf, malloc from kernel in CUDA • File system I/O? • GPUfs (ASPLOS 2013): – POSIX-like file system API – One file per warp to avoid control divergence – Weak file system consistency model (close->open) – Performance API: O_GWRONCE, O_GWRONCE – Eliminate seek pointer • GPUnet (OSDI 2014): Posix like API for sockets programming on GPGPU. 57

Conclusions • We discussed some recent work on GPU architecture research. 58

Conclusions • We discussed some recent work on GPU architecture research. 58