Scaling Usable Computing Capability Tor M Aamodt University
Scaling Usable Computing Capability Tor M. Aamodt University of British Columbia 14 July 2014 SAMOS XIV
Brief history of computing… + crystal ball How? 1971: Intel 4004 2007: i. Phone Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 2012: 26 -27, 2012 Datacenter Mark Hill & Christos Kozyrakis 1981: IBM 5150 10/2/2020 2
Classic Dennard Scaling (Dally/NVIDIA) 2. 8 x chip capability in same power Scale chip features down 0. 7 x per process generation 3 1. 4 x faster transistors Chip Power 2. 5 0. 7 x capacitance 2 0. 7 x voltage 2 x more transistors 1. 5 1 1 1. 5 2 2. 5 3 Chip Capability 10/2/2020 3
Post Dennard Scaling (Dally/NVIDIA) 2 x chip capability at 1. 4 x power 1. 4 x chip capability at same power 3 Transistors are no faster Static leakage limits reduction in Vth => Vdd stays constant Chip Power 2. 5 Conclusions? 2 2 x more transistors 0. 7 x capacitance End of “free lunch” for software developers? 1. 5 1 1 1. 5 2 Chip Capability 2. 5 3 [source: Bill Dally/NVIDIA] + Restrictive Geometry => 1. 2 x capability at same power 10/2/2020 4
Meanwhile: Software perspective… 1994 1999 2004 2009 2014 0 Ranking [TIOBE Index] 5 10 Python 15 Java. Script C++ 20 Software developers still expect “free lunch” 25 30 Demand for increased programmer productivity 10/2/2020 5
Post Dennard Scaling “Iron Triangle” Programmability Efficiency Performance cf. [Amant, et al. ISCA 2014] 10/2/2020 6
How to study programmability? n n Write software for proposed architecture. Think about hard part. Consider how to change system to make it easier next time. Architecture research: benchmark suites q q 10/2/2020 Created by 3 rd party. No confirmation bias! Save huge amount of time. More papers! 7
Approximate Computing? n n Huge potential to increase efficiency Floating- to fixed-point (MASc, circa 2000) q q q n Easy to determine error bound, e. g. , 90 d. B QSNR Linear system: input & TF norm bounds range/error Input-output nonlinearity: hard inferring required precision at each operation to ensure error bound Debugging approximate programs hard Always some error, is error due to approximation? Real-Time Graphics: Approximate computing for a single application domain (SIGGRAPH) 10/2/2020 8
Fix application; performance target Better Brawny (Oo. O) Multicore (how to get here? ) Wimpy (In-order) Multicore Programmability 16 K thread, SIMT Accelerator FPGA ASIC Efficiency 9
Programmable Accelerators GPU 10 x perf. vs. CPU for right workload 10/2/2020 10
Inevitable solution? Programmability Efficiency 11
“Serial Fraction” i. e. , Hard to accelerate 10/2/2020 “Parallel Fraction” i. e. , Easy to accelerate 12
What defines boundary between hard and easy? Fractionhard = f ( problem, software budget, programming model ) 10/2/2020 13
Result of improving programmability Hard to program accelerator Easier to program Easy to program accelerator Resulting improvement 10/2/2020 14
Why GPU Coding Difficult? n n Manual data movement CPU GPU Lack of generic I/O, system support on GPU For complex algorithms, synchronization Need for performance tuning to reduce q q q n n off-chip accesses memory divergence control divergence Non-deterministic behavior for buggy code Lack of good performance analysis tools 15
Better Programmability 16 K thread, SIMT Accelerator Efficiency 16
Programmable Accelerator n n n Muliti-threaded SIMD HW Aggressive Memory Subsystem Open. CL, CUDA, C++AMP, Open. ACC Kernel CPU Blocks Work Group CPU Wavefront spawn Scalar Thread 1 2 3 done 4 5 6 7 8 GPUBarrier spawn 9 10 11 12 Local Memory GPU Global Memory Time 17
GPU Microarchitecture (10, 000 feet) Single-Instruction, Multiple-Threads GPU SIMT Core Cluster SIMT Core SIMT Core Cluster SIMT Core Interconnection Network Memory Partition GDDR 3/GDDR 5 Memory Partition Off-chip DRAM GDDR 3/GDDR 5
SIMT Execution Model (Levinthal SIGGRAPH’ 84) n n Programmer sees MIMD threads (scalar) GPU bundles threads into warps and runs them in lockstep on SIMD hardware Reconvergence Stack PDOM (Fung MICRO’ 07) foo[] = {4, 8, 12, 16}; A 1 2 3 4 B: if (n > 10) B 1 2 3 4 C -- -- 3 4 D 1 2 -- -- E 1 2 3 4 C: …; else D: E: … …; Time A: n = foo[tid. x]; PC RPC Active Mask 1111 B E 1100 D E 0011 C E E 1. 19
Simplify communication
Transactional Memory n Programmer specifies atomic code blocks called transactions [Herlihy’ 93] Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); TM Version: atomic { X[c] = X[a]+X[b]; } Potential Deadlock! 21
Transactional Memory Programmers’ View: Non-conflicting transactions may run in parallel TX 1 Commit Memory A B C D TX 2 Commit OR TX 2 TX 1 Time TX 2 Time TX 1 Conflicting transactions automatically serialized TX 1 Commit Memory A B C D TX 2 Abort TX 2 Commit 22
Hardware TM for GPUs? Challenge #1: SIMD Hardware n On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads . . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Committed T 0 T 1 T 2 T 3 Branch Divergence! T 0 T 1 T 2 T 3 Aborted 23
KILO TM – Solution to Challenge #1: SIMD Hardware n Transaction abort is like a loop q Employ SIMT Stack. . . Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit. . . Abort 24
Hardware TM for GPUs? Challenge #2: Transaction Rollback GPU Core (SM) CPU Core 10 s of Register File Registers @ TX Entry Abort Warp Warp 32 k Registers Register File Checkpoint? 2 MB Total On-Chip Storage 25
KILO TM – Solution to Challenge #2: Transaction Rollback n SW Register Checkpoint q q Most TX: Reg overwritten first appearance (idempotent) Barnes Hut: Checkpoint 2 registers Overwritten Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit Abort 26
Hardware TM for GPUs? Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol n Not Available on (current) GPUs n No Private Data Cache per Thread Signatures? n 1024 -bit / Thread n 3. 8 MB / 30 k Threads 27
Hardware TM for GPUs? Challenge #4: Write Buffer GPU Core (SM) Warp Warp L 1 Data Cache Problem: 384 lines /Threads 1536 threads line. Cache per thread! 1024 -1536 Fermi’s < L 11 Data (48 k. B) = 384 X 128 B Lines 28
KILO TM Value-Based Conflict Detection TX 1 Global Memory Tx. Begin LD r 1, [A] ADD r 1, 1 ST r 1, [B] Tx. Commit TX 2 Private Memory Read-Log A=1 Write-Log B=2 n atomic {B=A+1} atomic {A=B+2} Tx. Begin LD r 2, [B] ADD r 2, 2 ST r 2, [A] Tx. Commit A=1 B=0 B=2 Private Memory Read-Log B=0 Write-Log A=2 Self-Validation + Abort: q Only detects existence of conflict (not identity) 29
[MICRO 2013] Intra-Warp Conflict Resolution, Temporal Conflict Detection 66% FG-Lock Performance Kilo. TM-Base TCD Warp. TM+TCD 0 1 2 3 Execution Time Normalized to FGLock 1. 3 X Energy Usage Kilo. TM-Base TCD Warp. TM+TCD 0 1 2 3 Energy Usage Normalized to FGLock Low Contention Workload: Kilo TM w/ SW Optimizations on par with FG Lock 30
Simplify performance tuning 10/2/2020 31
Sparse Vector-Matrix Multiply 2 versions from SHOC Optimized Version Simple Version Explicit Scratchpad Use Divergence Each thread has locality Dependent on Warp Size Added Complication Parallel Reduction “Simple” on GPU with “Divergence Aware Warp Scheduling” [MICRO-2013] within 4% performance of “Optimized” on current GPUs. 10/2/2020 32
Sources of Locality Intra-wavefront locality Inter-wavefront locality Wave 0 Wave 1 LD $line (X) Hit Data Cache
120 (Hits/Miss) PKI 100 Misses PKI Inter-Wavefront Hits PKI 80 60 40 Intra-Wavefront Hits PKI 20 0 AVG-Highly Cache Sensitive Average for: BFS, garbage collection, K-means, memcached
Scheduler affects access pattern Round Robin Scheduler Wave 0 Wave 1 Wavefront ld A, B, C, D… Scheduler ld Z, Y, X, W D C B A Cache Wave 0 Wave 1 ld A, B, C, D… . . ld A, B, C, D W X Y Z Greedy then Oldest Scheduler ld A, B, C, D… Wavefront Scheduler D C B A Cache
Use scheduler to shape access pattern Cache-Conscious Wavefront Scheduling Greedy then Oldest Scheduler Wave 0 Wave 1 ld A, B, C, D . . . ld A, B, C, D ld Z, Y, X, W Wavefront Scheduler W X Y Z 63% perf. improvement D C B A Cache working set size per wavefront Cache
Proactive vs. Reactive Scheduling [Rogers et al. , MICRO 2013] • Memory divergence in static instructions is predictable Main Memory Warp 1 0 • Divergence … load … Use To Create Cache Footprint Data touched by divergent loads dependent on active mask Prediction Main Memory 4 accesses Warp 1 1 Divergence Warp Divergence 2 accesses 1 0 0 1 37
Sparse MM Case Study Results • Performance normalized to optimized code Divergent Code Execution time Within 4% of optimized with no programmer input 38
Summary n Effective solutions to Post-Dennard Scaling “Iron Triangle” require studying performance, efficiency AND programmability n Amdahl’s Law implies programmability of accelerators is key to increasing usable computing capability 10/2/2020 39
- Slides: 39