Bridging the Computation Gap Between Programmable Processors and

Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators Kevin Fan 1, Manjunath Kudlur 2, Ganesh Dasika, Scott Mahlke University of Michigan Advanced Computer Architecture Laboratory 1 Parakinetics, Inc. 2 Nvidia University of Michigan Electrical Engineering and Computer Science

Introduction 5 40 35 30 25 20 15 10 5 0 4 24 fps min. 3 2 1 ARM 9 ARM 11 TI C 6 x Core 2 Duo 0 Cell-phone battery life (hours) Frames/sec MPEG-4 Decoder • Emerging applications have high performance, cost, energy demands – High-quality video – Flash animation • Clear need for application and domain-specific hardware 2 University of Michigan Electrical Engineering and Computer Science

Flexibility • Multiple instances of the same application Xvid Div. X – E. g multiple video codecs • Software algorithms change over time • NRE • Time-to-market 3 FFMpeg University of Michigan Electrical Engineering and Computer Science

ASIC Alternatives Flexibility FPGAs General Purpose Processors DSPs Highly efficient, some programmability Domain-specific accelerators ? ? ? Loop Accelerators, ASICs Efficiency, Performance 4 University of Michigan Electrical Engineering and Computer Science

How much postprogrammability is really required? mdct. c in faad 2 for(k=0; k<N 4; k++) {. . . real = Z 1[k][0]; img = Z 1[k][1]; Z 1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z 1[k][0] = Z 1[k][0] << 1; } } if(b_scale) { Z 1[k][0] = Z 1[k][0] * scale; } } Version 1. 39 Version 1. 40 5 University of Michigan Electrical Engineering and Computer Science

How much postprogrammability is really required? mdct. c in faad 2 for(k=0; k<N 4; k++) {. . . uint 16_t n = k << 1; Complex. Mult(. . . ); X_out[ n] = RE(x); X_out[N 2 - 1 - n] = -IM(x); X_out[N 2 + n] = IM(x); X_out[N - 1 - n] = -RE(x); } X_out[ n] = -RE(x); X_out[N 2 - 1 - n] = IM(x); X_out[N 2 + n] = -IM(x); X_out[N - 1 - n] = RE(x); } Version 1. 33 Version 1. 34 6 University of Michigan Electrical Engineering and Computer Science

How much postprogrammability is really required? H. 264 reference implementation for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (curr. MB->luma_transform_size_8 x 8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m 7 = &curr_res[block_y + j][block_x]; level = iabs (m 7[i]); if (img->Adaptive. Rounding) { fadjust 8 x 8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (curr. MB->luma_transform_size_8 x 8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cof. AC[b 8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level, m 7[i]); img->cof. AC[b 8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level, m 7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m 7[i]); ilev = level; } } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (curr. MB->luma_transform_size_8 x 8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m 7 = &curr_res[block_y + j][block_x]; level = iabs (m 7[i]); if (img->Adaptive. Rounding) { fadjust 8 x 8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (curr. MB->luma_transform_size_8 x 8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cof. AC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level, m 7[i]); img->cof. AC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level, m 7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m 7[i]); ilev = level; } } Version 13. 0 Version 13. 1 for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (curr. MB->luma_transform_size_8 x 8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m 7 = &curr_res[block_y + j][block_x]; level = iabs (m 7[i]); if (img->Adaptive. Rounding) { fadjust 8 x 8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (curr. MB->luma_transform_size_8 x 8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cof. AC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level, m 7[i]); img->cof. AC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level, m 7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m 7[i]); ilev = level; } } Version 13. 2 • Mostly minor changes to loops – Bug fixes – Revisions • Possible to design custom HW with minor programmability extensions 7 University of Michigan Electrical Engineering and Computer Science

Programmable Loop Accelerator • Generalize accelerator without losing efficiency Flexibility FPGAs General Purpose Processors DSPs Domain-specific accelerators ? ? ? Programmable Loop Accelerators, ASICs Efficiency, Performance 8 University of Michigan Electrical Engineering and Computer Science

Designing Loop Accelerators Local Mem << … MEM … … … Point-to-point Connections … … … + & … + … BR * … CRF MEM Local Mem C Code Loop 9 Hardware University of Michigan Electrical Engineering and Computer Science

Loop Accelerator Architecture CRF … FSM BR Control signals Point-to-point Connections … … + & … MEM Local Mem Hardware realization of modulo scheduled loop Parameterized execution resources, storage, connectivity 10 University of Michigan Electrical Engineering and Computer Science

LA Scheduling + FIR Loop Kernel + LD + 1 LD 3 + 2 LD Mult result has longer lifetime 4 x Paths missing x - x 5 No subtract ? ? + 6 4 Tim e FU 0 FU 1 FU 2 0 1 4 5 1 2 3 FU 3 + Mem X + 6 11 University of Michigan Electrical Engineering and Computer Science

LA Datapath Restrictions 8 4 Functionalit y 3. 5 Connectivity Slow-Down 3 2. 5 2 1. 5 1 0. 5 0 0 1 2 3 4 5 6 7 8 9 10 Graph Difference 12 University of Michigan Electrical Engineering and Computer Science

Programmable Loop-Accelerator Architecture CRF Literals Point-to-point Connections Control FSM Memory Bus Control signals … … Functionality Storage Connectivity Control … BR + +/- & &/| MEM SRF RR LA q q … Local Mem PLA Custom FU set Generalized FUs + MOVs Limited size, no addr. Rotating Reg. Files Point-to-point Bus + Port-swapping Hardwired Control Lit. Reg. File + Control Mem 13 University of Michigan Electrical Engineering and Computer Science

Experimental Setup • Wide variety of benchmarks – DSP – Media – Linear Algebra • Baseline LAs: – Used LA synthesis system to generate HDL – 200 MHz @ 0. 13 um • Comparisons: – PLAs (200 MHz @ 0. 13 um) – OR-1200 (300 MHz @ 0. 13 um) 14 University of Michigan Electrical Engineering and Computer Science

Area 1. 4 1. 2 0. 8 0. 6 0. 4 0. 2 LA 15 PLA average sobel lu heat fsed fmradio fir fft dequant dcac 0 bfform Area (mm 2) 1 OR 1 K OR-1200 University of Michigan Electrical Engineering and Computer Science

Power Consumption OR-1200 16 University of Michigan Electrical Engineering and Computer Science

1000 10 PLA 17 average sobel lu heat fsed fmradio fir fft dequant dcac 1 bfform Normalized Power Consumption OR 1 K OR-1200 equiv OR-1200 OR 1 K-equiv University of Michigan Electrical Engineering and Computer Science

Power Breakdown 7 6 5 4 3 2 1 Ctrl 18 Mux FU Bus average sobel lu heat fsed fmradio fir fft dequant dcac 0 bfform Relative Power Overhead 8 RR University of Michigan Electrical Engineering and Computer Science

Scheduling for PLAs Synthesis System Loop 1 Hardware Compiler + SMT-solver Loop 2 • Generalize accelerator architecture • Map new loops to existing hardware 19 University of Michigan Electrical Engineering and Computer Science

PLA Scheduling LA Hardware + Mem + X PLA Hardware +/- Mem X +/- Bus 20 University of Michigan Electrical Engineering and Computer Science

PLA Scheduling PLA Hardware +/- Mem 1 2 +/- X + 3 + LD MOV Bus 4 x Bus 5 x SMT 6 sched_time(j) sched_time(i) + lat(i) – dist(i, j) II Sj I( Xi, fi, ti Xj, fj, tj ) (I + tj Si II + ti + lat(i) – dist(i, j) II) 21 - Tim e FU 0 FU 1 FU 2 0 1 MOV 5 1 3 2 4 FU 3 Bus X 6 University of Michigan Electrical Engineering and Computer Science

Programmability Small, with complex communication 20 18 14 12 10 8 6 4 2 sobel lu heat fsed fmradio fir fft dequant dcac 0 bfform # Perturbations 16 Small, with simple communication 22 University of Michigan Electrical Engineering and Computer Science

S/m W Power Efficiency MIP W 4 : 2 A PL LA: Performance (MIPS) 105 m S/ IP M 2 e: 1 r o C d mon a i a D S/m. W c i l i s MIP Ten Faster than ARM 11 AND 8 x more efficient! IPS/m. W M 5 : x TI C 6 ARM 11: 3 MIPS/m. W OR 1 K: 2 MIPS/m W Itanium 2: 0. 08 MIPS/m. W Power (m. W) 23 University of Michigan Electrical Engineering and Computer Science

Conclusion • Programmable loop accelerators retain efficiency while being programmable • Loop accelerator datapath generalized in a cost-effective way • Significant benefits over GPP: – 4 x-34 x improved power efficiency – 30 x improved area efficiency 24 University of Michigan Electrical Engineering and Computer Science

Questions? ? http: //cccp. eecs. umich. e 25 du University of Michigan Electrical Engineering and Computer Science

University of Michigan Electrical Engineering and Computer Science