Warp Processing Towards FPGA Ubiquity Frank Vahid Professor

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale Contributing Students: Roman Lysecky (Ph. D 2005, now asst. prof. at U. Arizona), Greg Stitt (Ph. D 2006), Kris Miller (MS 2007), David Sheldon (3 rd yr Ph. D), Ryan Mannion (2 nd yr Ph. D), Scott Sirowy (1 st yr Ph. D) Frank Vahid, UC Riverside

Outline n FPGAs n n n Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers n Warp processing n n n Binary decompilation Just-in-time FPGA compilation Directions Frank Vahid, UC Riverside 2

FPGAs Implement circuit by downloading particular bits a 4 x 2 Memory a 1 00 1 1 a 0 01 1 0 1 1 10 0 0 LUT 11 d 0 b a b F 2 x 2 switch matrix G F n 1 0 a b 0 1 11 01 x SM 00 11 11. . . 01 FPGA SM SM 00 01 01. . . LUT y 11 SM LUT 10 SM 11 SM G FPGA -- Field-Programmable Gate Array n Implement circuit by downloading bits n n n N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric n n Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc. CAD tools automatically map desired circuit onto FPGA fabric Frank Vahid, UC Riverside 3

FPGAs are "Programmable" like Microprocessors – Just Download Bits Microprocessor Binaries 001010010 … … FPGA "Binaries" 01110100. . . 001010010 … … Bits loaded into LUTs and SMs Bits loaded into program memory 0010 … Processor Frank Vahid, UC Riverside More commonly known as "bitstream" 0111 … FPGA Processor 4

FPGA – Why (Sometimes) Better than Microprocessor C Code for Bit Reversal x x x = = = (x ((x ((x >>16) >> 8) >> 4) >> 2) >> 1) & & 0 x 00 ff) 0 x 0 f 0 f) 0 x 3333) 0 x 5555) | | | (x ((x ((x <<16); << 8) & << 4) & << 2) & << 1) & Circuit for Bit Reversal X Value Bit. Original Reversed X Value 0 xff 00); 0 xf 0 f 0); 0 xcccc); 0 xaaaa); . . . Compilation. . . Binary sll $v 1[3], $v 0[2], 0 x 10 srl $v 0[2], 0 x 10 or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 8 and $v 1[3], $t 5[13] sll $v 0[2], 0 x 8 and $v 0[2], $t 4[12] or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 4 and $v 1[3], $t 3[11] sll $v 0[2], 0 x 4 and $v 0[2], $t 2[10]. . . Bit Reversed XX Value Processor n Requires between 32 and 128 cycles Processor FPGA n Frank Vahid, UC Riverside Requires only 1 cycle (speedup of 32 x to 128 x) 5

FPGA: Why (Sometimes) Better than Microprocessor C Code for FIR Filter for (i=0; ii << 128; i++) y[i] += += c[i] ** x[i]. . . Circuit for FIR Filter * * * + + + Processor n 1000’s of instructions n Several thousand cycles FPGA Processor n ~ 7 cycles n Speedup > 100 x In general, FPGA better due to circuit's concurrency, from bit-level to task level Frank Vahid, UC Riverside 6

Extensive Studies over Past Decade n Large speedups on many important applications n n See ACM/SIGDA Int. Symp. on FPGAs So why aren't FPGAs ubiquitous? Frank Vahid, UC Riverside 7

Why FPGAs aren’t Mainstream n n Cost – But improving yearly Power – But improving yearly, and energy benefits too Extra chip – But integration continues Programming methodology 1 million system gate FPGA cost Source: Xilinx Frank Vahid, UC Riverside 8

Why FPGAs aren’t Mainstream n Cost Power Extra chip n Programming methodology n n n Though tremendous progress in past decade Application (C/C++/Java/System. C/Handel-C/Streams-C/…) Automated hardware/software partitioning C/C++/Java/VHDL/Verilog/System. C/Handel-C/Streams-C. . . Behavioral synthesis (1990 s) Register transfers Compilation (1960 s, 1970 s) RT synthesis (1980 s, 1990 s) Logic equations / FSMs Assembly code Logic synthesis, physical design (1970 s, 1980 s) Assembling, linking (1950 s, 1960 s) Microprocessor binary Downloading Microprocessors Frank Vahid, UC Riverside FPGA binary Downloading Implementation FPGA circuits 9

So What’s the Holdup? n FPGAs require special compilers n Limits adoption – desktop world dominates n n Applic. Binary Standard Special Compiler 100 software writers for every CAD user Millions of compiler seats worldwide, vs. 15, 000 CAD seats Binary Microproc Binary Proc. Frank Vahid, UC Riverside Includes synthesis, tech. map, place & route FPGA Binary FPGA 10

Outline n FPGAs n n n Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers n Warp processing n n n Binary decompilation Just-in-time FPGA compilation Directions Frank Vahid, UC Riverside 11

Can we Hide FPGAs from Programmers and Standard Tools? SW Binary n Example n Traditional partitioning done here Standard Profiling Compiler Radically different x 86 architectures hidden from programmers and tools n Binary n Translator RISC architecture VLIW architecture n Idea: Hide FPGA from programmers and tools n n Translator Proc. Frank Vahid, UC Riverside FPGA n All execute standard x 86 binaries On-chip tools dynamically translate binary to particular architecture Download standard binary Have on-chip tools dynamically translate binary (portions) to FPGA We call this Warp Processing 12

Warp Processing Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 13

Warp Processing Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 14

Warp Processing Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler beq beq beq add add add µP I Mem D$ FPGA Frank Vahid, UC Riverside On-chip CAD Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Critical Loop Detected 15

Warp Processing Idea 4 On-chip CAD reads in critical region Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 16

Warp Processing Idea 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 reg 3 : = 0 reg 4 : = 0 loop: reg 4 : = reg 4 + mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3 < 10) goto loop ret reg 4 Frank Vahid, UC Riverside 17

Warp Processing Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + + + reg 3 : = 0 + : = 0+ reg 4 + . . . loop: +reg 4 : = reg 4++ mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3. . loop +< 10). goto ret reg 4 Frank Vahid, UC Riverside + . . . 18

Warp Processing Idea 7 On-chip CAD maps circuit onto FPGA Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + reg 3 : = 0 + : = 0+ + reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if. SM (reg 3 goto. . loop +< 10). SM + + ret reg 4 Frank Vahid, UC Riverside + . . . 19

Warp Processing Idea 8 On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: // instructions Shl reg 1, reg 3, that 1 interact FPGA Add reg 5, with reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Dynamic Part. On-chip CAD Module (DPM) + + Software-only “Warped” + reg 3 : = 0 + : =+0 + reg 4 SM SM SM. . . loop: + + + mem[ reg 4 + + reg 4 : =CLB + CLB reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1. . SM. loop SM + if. SM (reg 3 goto + < 10) + ret reg 4 . . . 20

Warp Processing Idea Likely multiple microprocessors per chip, serviced by one on-chip CAD block Profiler µP µP FPGA Frank Vahid, UC Riverside I Mem D$ On-chip CAD 21

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? Frank Vahid, UC Riverside Profiling & partitioning Decompilation Profiler µP FPGA Synthesis I$ D$ Binary Updater On-chip CAD Std. HW Binary JIT FPGA compilation Micropr Binary FPGA Binary binary 22

Decompilation Binary n Synthesis from binary has a potential hurdle n n n High-level information (e. g. , loops, arrays) lost during compilation Direct translation of assembly to circuit – huge overheads Need to recover high-level information Profiling & partitioning Decompilation Synthesis Binary Updater Std. HW Binary JIT FPGA compilation Micropr. Binary FPGA Binary binary Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone Frank Vahid, UC Riverside 23

Decompilation n Binary Profiling & partitioning Solution – Recover high-level information from binary: Decompilation decompilation n Adapted extensive previous work (for different purposes) Developed new decompilation methods also Ph. D. work of Greg Stitt (Ph. D. UCR 2006) n Numerous publications: http: //www. cs. ucr. edu/~vahid/pubs n n Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Frank Vahid, UC Riverside Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Almost Identical Representations Synthesis Binary Updater Std. HW Binary JIT FPGA compilation Micropr. Binary FPGA Binary binary Data Flow Analysis Control/Data Flow. Recovery Graph Creation Control Structure Function Recovery Array Recovery long f( long reg 2 ) { : = array[10] reg 3 long f( short long reg 2 ) {0 long reg 4= =0; 0; reg 4 : = 0 int reg 3 for (long=reg 3 int reg 4 0; = 0; reg 3 < 10; reg 3++) { reg 4 += array[reg 3]; mem[reg 2 loop: + (reg 3 << 1)]; }reg 4 = reg 4 + mem[reg 2 + << reg 4 : = reg 3 reg 4 + mem[ reg 1 1 reg 3 << 1)]; return reg 4; reg 2 + (reg 3 << 1)] reg 5 : = reg 2 + reg 1 } reg 3 = reg 3 + 1; reg 6 reg 3 : = mem[reg 5 reg 3 + 1 + 0] if (reg 3 < 10) goto loop; if (reg 3 < 10)+goto reg 4 : = reg 4 reg 6 loop return reg 4; reg 3 : = reg 3 + 1 } if (reg 3 < 10) goto loop ret reg 4 24

Decompilation Results vs. C n Compared with synthesis from C n Synthesis after decompilation often quite similar n Almost identical performance, small area overhead FPGA 2005 Frank Vahid, UC Riverside 25

Decompilation Results on Optimized H. 264 In-depth Study with Freescale n n Used highly-optimized benchmark Results: Binary approach competitive n Speedups compared to ARM 9 software n n Binary: 2. 48, C: 2. 53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis Frank Vahid, UC Riverside 26

Simple Coding Guidelines Bring Speedups Closer to Ideal n Interesting discovery during H 264 study – C style limited speedup n n Orthogonal to binary vs. C issue – coding style hurt both Developed simple coding guidelines Rewritten software: 20 minutes, and only ~3% slower than original New speedups: Binary: 6. 55, C: 6. 56 n n Binary still competitive with C Following guidelines not required, but helps any approach targeting FPGAs Frank Vahid, UC Riverside 27

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? Frank Vahid, UC Riverside Profiling & partitioning Decompilation Profiler µP FPGA Synthesis I$ D$ Binary Updater On-chip CAD Std. HW Binary JIT FPGA compilation Micropr Binary FPGA Binary binary 28

JIT FPGA Compilation n Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA n e. g. , Our router (ROCR) 10 x faster and 20 x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement Ph. D. work of Roman Lysecky (Ph. D. UCR 2005, now Asst. Prof. at Univ. of Arizona) n Numerous publications: http: //www. cs. ucr. edu/~vahid/pubs n Binary Profiling & partitioning Decompilation Synthesis Binary Updater Std. HW Binary JIT FPGA compilation Micropr. Binary FPGA Binary binary DAC’ 04 Frank Vahid, UC Riverside 29

JIT FPGA Compilation Xilinx ISE 9. 1 s 60 MB Riverside JIT FPGA tools 0. 2 s 3. 6 MB Riverside JIT FPGA tools on a 75 MHz ARM 7 1. 4 s 3. 6 MB Frank Vahid, UC Riverside 30

Overall Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E Simpler FPGA fabric yields faster HW circuits SW Only Execution Frank Vahid, UC Riverside 31

Overall Warp Processing Results Performance Speedup (Overall, Multiple Kernels) Average speedup of 7. 4 n Energy reduction of 38% - 94% SW Only Execution Assuming 100 MHz ARM, fabric in same technology and clocked at rate determined by synthesis Frank Vahid, UC Riverside 32

FPGA Ubiquity via Obscurity n n SW Binary FPGA is hidden from languages and tools Thus, ANY microprocessor platform extendible with FPGA n n n So any program can potentially be sped up by FPGAs No new languages, no new tools Maintains "ecosystem" among application, tool, and architecture developers Standard Profiling Compiler Binary Translator Proc. Architectures Standard binaries Applications Frank Vahid, UC Riverside FPGA Tools 33

Outline n FPGAs n n n Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers n Warp processing n n n Binary decompilation Just-in-time FPGA compilation Directions Frank Vahid, UC Riverside 34

Directions – What’s Next? n Immediate future: Develop warp processing using benchmarks from other domains n n n Desktop, server, scientific With partners – IBM, Freescale May require new decompilation techniques SW Binary Standard Profiling Compiler Binary Translator Proc. Frank Vahid, UC Riverside FPGA 35

Directions – What’s Next? n Applicationspecific FPGA n Delay for each configuration (LUTs/CLB, and LUT sizes 2 -7) for one application Tune FPGA fabric to application (or domain) n n Parameters: LUTs/CLB, LUT size Many more possible, e. g. , switch matrix size, # long vs. short channels Delay & area when tuning parameters for best delay for each app, rather than for all apps Frank Vahid, UC Riverside 36

Directions – What’s Next? n Parallel benchmarks n n Thrd 1 Thrd 2 Thrd 3 Thrd. N SW Binary NAS, SPEComp, Splash, … Map each thread to custom FPGA circuit n Huge potential speedups Standard Profiling Compiler Binary Sample speedups from other works Profiler µP µP Thrd 1 Thrd 2 Thrd 3 FPGA Frank Vahid, UC Riverside I Mem D$ On-chip CAD Thrd. N 37

Directions – What’s Next? n With JIT FPGA compiler, what else is possible? n Implications for existing applications? n n n Image processing, neural networks, . . . Add FPGA hardware to improve performance, like expandable memory? Standard binaries for FPGAs? n Rather than extracting circuit from sequential code, distribute circuit binary itself, use JIT FPGA compiler to best map to FPGA resources + + + + + Frank Vahid, UC Riverside + + Translator Proc. FPGA Translator * * * + Binary FPGA Binary Translator FPGA 38

Summary n n FPGA future looks bright Hiding FPGA via warp processing is feasible n n n Decompilation can recover high-level constructs to yield speedups competitive with source-level JIT FPGA compilation can be made sufficiently lean Many possible directions exist that may use FPGAs to gain ultra-high performance without ultra-high engineering or hardware costs Publications can be found at: http: //www. cs. ucr. edu/~vahid/pubs Frank Vahid, UC Riverside 39