AgingAware CompilerDirected VLIW Assignment for GPGPU Architectures Abbas

Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures Abbas Rahimi‡, Luca Benini†, Rajesh K. Gupta‡ ‡UC San Diego, †University of Bologna Variability. org Micrel. deis. unibo. it /Multi. Therman 1

Failure • o Variability in transistor characteristics is a major challenge in nanoscale NBTI-induced performance degradation CMOS: o ∆VTH = F (Process, Temp, Voltage, Stress) VTH Operational • Static Process variation: Leff and Vth o Stress consumes timing margin. • Dynamic variations: Temperature fluctuations, supply Voltage Stress (workload) o Lifetime is limited by. Aging the most aged component. droops, and device (NBTI, HCI) Complicated with designers 512 CUDA use cores, or 320 5 -way VLIW cores! • o To handle variations conservative guardbands loss of operational efficiency actual circuit delay guardband Clock Process guardband Variability is about Cost and Scale Aging Temperatur e ∆Vth VCC Droop ∆P 2

Related Work 1. NBTI-aware power-gating exploits the sleep state where a circuit is inherently immune to aging [Calimera’ 09, Calimera’ 12] • 2. Equalize the stress among various functional units in single-core [Gunadi’ 10] • 3. They intrusively modified pipeline to support complement mode execution and operand swapping Traditional coarse-grained multi-core utilize selective voltage scaling [Tiwari’ 08, Karpuzcu’ 09] • 4. High power-gating factors impose performance degradation Difference between adaptive voltage and over-designed voltage is small Process variation in GPGPU [Lee’ 11] • Disabling the slowest cores! • Cannot capture the aging which is dynamic in nature! 3

Contribution • Aging-aware compiler that utilizes a dynamic binary optimizer for customizing the kernels code to respond to the specific health state of hardware: • Specific health state (online NBTI sensors) • Uniformly distributes the stress of instructions among various VLIW slots, results in a healthy code generation. • An adaptive reallocation strategy, a fully software solution, without any architectural modification with iso-throughput kernels: • Throughput (healthy kernel) = Throughput (naïve kernel) 4

AMD Evergreen GPGPU Architecture Ultra-threaded Dispatcher L 1 Crossbar Global Memory Hierarchy • Stream Core (SC 0) Processing Elements (PEs) Stream Core (SC 15) T Local Data Storage ILP VLIW Packing ratio = 3/5 X Y Z W Branch Compute Unit (CU 19) Wavefront Scheduler SIMD Fetch Unit Compute Unit (CU 0) Stream Core (SC) Compute Unit (CU) Compute Device General-purpose Reg X : MOV R 8. x, 0. 0 f Y : AND_INT T 0. y, KC 0[1]. x Z : ASHR T 0. x, KC 1[3]. x W: ____ T: _____ Radeon HD 5870 • 20 Compute Units (CUs) • 16 Stream Cores (SCs) per CU (SIMD execution) • 5 Processing Elements (PEs) per SC (VLIW execution) • 4 Identical PEs (PEX, PEY, PEW, PEZ) • 1 Special PET 5

GPGPU Workload Variation 50% ✓ 1. Inter-compute units • Instructions are NOT ✓ 2. Inter-stream cores SIMD Execution uniformly distributed among × 3. Inter-processing elements PEs !! • Uniform workload • Seven kernels execute more variation between than 40% of the 0%− 0. 26% ALU engine CUs: instructions onlybalancing on PEX • Load • Compiler only increases the algorithm of the packing ratio weighted ultra-thread VLIW code generation dispatcher is needed Compute Device Stream Core (SC) Compute Unit (CU) We leverage an average packing ratio of 0. 3 towards reliability improvement! Compute Unit (CU 19) L 1 SIMD Fetch Unit Stream Core (SC 0) Stream Core (SC 15) Processing Elements (PEs) T X Y Z W Branch Compute Unit (CU 0) Wavefront Scheduler Ultra-threaded Dispatcher Finding N-young slots among all available slots Crossbar Global Memory Hierarchy Local Data Storage General-purpose Reg 6

Aging-Aware Compilation Flow Host CPU Naïve Kernel 5 -Way VLIW Bundle Static Code Analysis Non-uniform Inst. Distribution X : MOV … Dynamic Binary Optimizer Healthy Kernel Limited Packing Ratio NBTI Sensors Uniform VLIW Assignment Leveling of slots X : _____ Y : ASHR … Y : _________ Z : MOV … W: ____ T: ____ _ W: ASHR … T: ____ _ GPGPU Periodic healthy kernels generation: 1. “Fatigued” PEs are relaxing! 2. “Young” PEs are working hard! Equalizes the expected lifetime of each PEs 7

Experimental Results uniform ∆VTH=0. 6 m. V VTH = 406 m. V Inter-PEs ∆VTH=10 m. V VTH = 413 m. V Extended lifetime • Process variation and NBTI-induced for 360 hours without power gating in HD 5870. • Periodically the execution of healthy kernels, compared to the naïve kernels • Reduces Vth shift up to 49%(11%) and on average 34%(6%) in presence(absence) of power-gating supports • Imposes 0% throughput penalty (maintaining the naïve ILP) 8

Conclusion • An adaptive compiler-directed technique that uniformly distributes the stress of instructions throughout various VLIW resource slots. Thank you! • Equalizing the expected lifetime of each processing element by regenerating aging-aware healthy kernels that respond to the specific health state of GPGPU while maintaining iso-throughput execution. • Work in progress • Memory subsystems: reducing Vth shift by up to 43% for register files of GPGPU. 9

Aging-aware Kernel Adaptation Flow 1. Reading sensors measurements 2. Static code analysis technique estimates the percentage of instructions that will carry out on every PE (a linear calibration module later fits the predicted ∆VTH shift to the observed ∆VTH shift). 3. Finally, the uniform slot assignment assigns fewer/more instructions to higher/lower stressed slots. 4. Healthy kernel binary Naïve Kernel 2 Naïve Kernel Binary Just-in-time Disassembler Host CPU Device-dependent Assembly Code Wearout Static Code Analysis Estimation ∆Vth−{X, …, W}[t] Performance Degradation Measurement Rank Vth τ Age[1] Vth-X [t] τX [t] Age[2] Vth-Y [t] τY [t] … … … Pred-∆Vth−{X, …, W}[t+1] Linear Calibration ∆Vth−{X, …, W} [t+1] Rank ∆Vth ∆τ Util[1] ∆Vth-Y [t+1] ∆τY [t+1] Util[2] ∆Vth-Z [t+1] ∆τZ [t+1] … … … τ{X, …, W} [t] 3 4 Module ∆τ{X, …, W} [t+1] Aging-aware Slot Assignment Healthy Code Generation Healthy Kernel Binary Memory Mapped Sensors Input Output Kernel 1 NBTI GPGPU Compute Device Sensors Banks 10

Total execution time of adaptation flow • Average execution time of the entire process, starting from disassembler up to the healthy code generation. • Kernel disassembly using online CAL (95% total time) • Static code analysis: 220 K− 900 K cycles • Uniform slot assignment algorithm ≤ 2 K cycles • On average 13 millisecond on a host machine with an Intel i 5 2. 67 GHz 11

AMD APP SDK 2. 5 kernels with parameters Kernel Reduction (Rdn) Binary. Search (BSe) Dwt. Haar 1 D (DH 1 D) Bitonic. Sort (BSo) Fast. Walsh. Transform (FWT) Floyd. Warshall (FW) Binomial. Option (BO) Discrete. Cosine. Transform (DCT) Matrix. Transpose (MT) Matrix. Multiplication (MM) Sobel. Filter (SF) URNG Parameter N= 100, 000 N= 10, 000 N= 1, 000 N= 100 X= 500 Y= 500 X= 300 Y= 300 Z= 300 default input file 12