The New Software Invisible Ubiquitous FPGAs that Enable

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale Contributing Students: Roman Lysecky (Ph. D 2005, now asst. prof. at U. Arizona), Greg Stitt (Ph. D 2006), David Sheldon (3 rd yr Ph. D), Ryan Mannion (2 nd yr Ph. D), Scott Sirowy (1 st yr Ph. D) Frank Vahid, UC Riverside

Outline n FPGAs – The New Software n n n Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers n Warp processing n n n Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs Frank Vahid, UC Riverside 2

FPGAs Implement circuit by downloading particular bits a 4 x 2 Memory a 1 00 1 1 a 0 01 1 0 1 1 10 0 0 LUT 11 d 0 b a b F 2 x 2 switch matrix G F n 1 0 a b 0 1 11 01 x SM 00 11 11. . . 01 FPGA SM SM 00 01 01. . . LUT y 11 SM LUT 10 SM 11 SM G FPGA -- Field-Programmable Gate Array n Implement circuit by downloading bits n n n N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric n n Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc. CAD tools automatically map desired circuit onto FPGA fabric Frank Vahid, UC Riverside 3

FPGAs are "Programmable" like Microprocessors – Just Download Bits Microprocessor Binaries 01110100. . . 001010010 … … FPGA "Binaries" "Software" … … More commonly known as "bitstream" Bits loaded into LUTs and SMs Bits loaded into program memory "Hardware" 0010 … Processor Frank Vahid, UC Riverside 0111 … FPGA Processor 4

FPGA – Why (Sometimes) Better than Microprocessor C Code for Bit Reversal x x x = = = (x ((x ((x >>16) >> 8) >> 4) >> 2) >> 1) & & 0 x 00 ff) 0 x 0 f 0 f) 0 x 3333) 0 x 5555) | | | (x ((x ((x <<16); << 8) & << 4) & << 2) & << 1) & Circuit for Bit Reversal X Value Bit. Original Reversed X Value 0 xff 00); 0 xf 0 f 0); 0 xcccc); 0 xaaaa); . . . Compilation. . . Binary sll $v 1[3], $v 0[2], 0 x 10 srl $v 0[2], 0 x 10 or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 8 and $v 1[3], $t 5[13] sll $v 0[2], 0 x 8 and $v 0[2], $t 4[12] or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 4 and $v 1[3], $t 3[11] sll $v 0[2], 0 x 4 and $v 0[2], $t 2[10]. . . Bit Reversed XX Value Processor n Requires between 32 and 128 cycles Processor FPGA n Frank Vahid, UC Riverside Requires only 1 cycle (speedup of 32 x to 128 x) 5

FPGA: Why (Sometimes) Better than Microprocessor C Code for FIR Filter for (i=0; ii << 128; i++) y[i] += += c[i] ** x[i]. . . Circuit for FIR Filter * * * + + + Processor n 1000’s of instructions n Several thousand cycles FPGA Processor n ~ 7 cycles n Speedup > 100 x In general, FPGA better due to circuit's concurrency, from bit-level to task level Frank Vahid, UC Riverside 6

Extensive Studies over Past Decade n Large speedups on many important applications n n See ACM/SIGDA Int. Symp. on FPGAs So why aren't FPGAs ubiquitous? Frank Vahid, UC Riverside 7

Why FPGAs aren’t Ubiquitous n n Cost – But improving yearly Power – But improving yearly, and energy benefits too Extra chip – But integration continues Programming methodology 1 million system gate FPGA cost Source: Xilinx Frank Vahid, UC Riverside 8

Why FPGAs aren’t Mainstream n Cost Power Extra chip n Programming methodology n n n Though tremendous progress in past decade Application (C/C++/Java/System. C/Handel-C/Streams-C/…) Automated hardware/software partitioning C/C++/Java/VHDL/Verilog/System. C/Handel-C/Streams-C. . . Behavioral synthesis (1990 s) Register transfers Compilation (1960 s, 1970 s) RT synthesis (1980 s, 1990 s) Logic equations / FSMs Assembly code Logic synthesis, physical design (1970 s, 1980 s) Assembling, linking (1950 s, 1960 s) Microprocessor binary Downloading Microprocessors Frank Vahid, UC Riverside FPGA binary Downloading Implementation FPGA circuits 9

So What’s the Holdup? n n Limits adoption – desktop world dominates n n n Applic. Binary FPGAs require special compilers 100 software writers for every CAD user Millions of compiler seats worldwide, vs. 15, 000 CAD seats Can't ignore "ecosystem" from separation of applications, tools, and architectures n Standard Special Compiler Includes synthesis, tech. map, place & route Binary Just consider history of popular processors Microproc Binary Architectures FPGA Binary Standard binaries Applications Frank Vahid, UC Riverside Tools Proc. FPGA 10

Outline n FPGAs – The New Software n n n Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers n Warp processing n n n Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs Frank Vahid, UC Riverside 11

Can we Hide FPGAs from Programmers and Standard Tools? SW Binary n Example n Traditional partitioning done here Standard Profiling Compiler Radically different x 86 architectures hidden from programmers and tools n Binary n Translator RISC architecture VLIW architecture n Idea: Hide FPGA from programmers and tools n n Translator Proc. Frank Vahid, UC Riverside FPGA n All execute standard x 86 binaries On-chip tools dynamically translate binary to particular architecture Download standard binary Have on-chip tools dynamically translate binary (portions) to FPGA We call this Warp Processing 12

Warp Processing Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 13

Warp Processing Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 14

Warp Processing Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler beq beq beq add add add µP I Mem D$ FPGA Frank Vahid, UC Riverside On-chip CAD Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Critical Loop Detected 15

Warp Processing Idea 4 On-chip CAD reads in critical region Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 16

Warp Processing Idea 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 reg 3 : = 0 reg 4 : = 0 loop: reg 4 : = reg 4 + mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3 < 10) goto loop ret reg 4 Frank Vahid, UC Riverside 17

Warp Processing Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + + + reg 3 : = 0 + : = 0+ reg 4 + . . . loop: +reg 4 : = reg 4++ mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3. . loop +< 10). goto ret reg 4 Frank Vahid, UC Riverside + . . . 18

Warp Processing Idea 7 On-chip CAD maps circuit onto FPGA Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + reg 3 : = 0 + : = 0+ + reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if. SM (reg 3 goto. . loop +< 10). SM + + ret reg 4 Frank Vahid, UC Riverside + . . . 19

Warp Processing Idea 8 On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: // instructions Shl reg 1, reg 3, that 1 interact FPGA Add reg 5, with reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Profiler I Mem µP D$ FPGA Frank Vahid, UC Riverside Dynamic Part. On-chip CAD Module (DPM) + + Software-only “Warped” + reg 3 : = 0 + : =+0 + reg 4 SM SM SM. . . loop: + + + mem[ reg 4 + + reg 4 : =CLB + CLB reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1. . SM. loop SM + if. SM (reg 3 goto + < 10) + ret reg 4 . . . 20

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? Frank Vahid, UC Riverside Profiling & partitioning Decompilation Profiler µP FPGA Synthesis I$ D$ Binary Updater On-chip CAD Std. HW Binary JIT FPGA compilation Micropr Binary FPGA Binary binary 21

Decompilation Binary n If we don't decompile n n n High-level information (e. g. , loops, arrays) lost during compilation Direct translation of assembly to circuit – big overhead Profiling & partitioning Decompilation Synthesis Binary Updater Need to recover high-level information Std. HW Binary JIT FPGA compilation Micropr. Binary FPGA Binary binary Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone Frank Vahid, UC Riverside 22

Decompilation n Binary Profiling & partitioning Solution – Recover high-level information from binary: Decompilation decompilation n Adapted extensive previous work (for different purposes) Developed new decompilation methods also Ph. D. work of Greg Stitt (Ph. D. UCR 2006) n Numerous publications: http: //www. cs. ucr. edu/~vahid/pubs n n Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Frank Vahid, UC Riverside Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Almost Identical Representations Synthesis Binary Updater Std. HW Binary JIT FPGA compilation Micropr. Binary FPGA Binary binary Data Flow Analysis Control/Data Flow. Recovery Graph Creation Control Structure Function Recovery Array Recovery long f( long reg 2 ) { : = array[10] reg 3 long f( short long reg 2 ) {0 long reg 4= =0; 0; reg 4 : = 0 int reg 3 for (long=reg 3 int reg 4 0; = 0; reg 3 < 10; reg 3++) { reg 4 += array[reg 3]; mem[reg 2 loop: + (reg 3 << 1)]; }reg 4 = reg 4 + mem[reg 2 + << reg 4 : = reg 3 reg 4 + mem[ reg 1 1 reg 3 << 1)]; return reg 4; reg 2 + (reg 3 << 1)] reg 5 : = reg 2 + reg 1 } reg 3 = reg 3 + 1; reg 6 reg 3 : = mem[reg 5 reg 3 + 1 + 0] if (reg 3 < 10) goto loop; if (reg 3 < 10)+goto reg 4 : = reg 4 reg 6 loop return reg 4; reg 3 : = reg 3 + 1 } if (reg 3 < 10) goto loop ret reg 4 23

Decompilation Results vs. C n Compared with synthesis from C n Synthesis after decompilation often quite similar n Almost identical performance, small area overhead FPGA 2005 Frank Vahid, UC Riverside 24

Decompilation Results on Optimized H. 264 In-depth Study with Freescale n n Used highly-optimized benchmark Results: Binary approach competitive n Speedups compared to ARM 9 software n n Binary: 2. 48, C: 2. 53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis Frank Vahid, UC Riverside 25

Tangent: Simple Coding Guidelines Bring Speedups Closer to Ideal n Interesting discovery during H 264 study – C style limited speedup n n Orthogonal to binary vs. C issue – coding style hurt both Developed simple coding guidelines Rewritten software: 20 minutes, and only ~3% slower than original New speedups: Binary: 6. 55, C: 6. 56 n n Binary still competitive with C Following guidelines not required, but helps any approach targeting FPGAs Frank Vahid, UC Riverside 26

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? Frank Vahid, UC Riverside Profiling & partitioning Decompilation Profiler µP FPGA Synthesis I$ D$ Binary Updater On-chip CAD Std. HW Binary JIT FPGA compilation Micropr Binary FPGA Binary binary 27

JIT FPGA Compilation n Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA n e. g. , Our router (ROCR) 10 x faster and 20 x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement Ph. D. work of Roman Lysecky (Ph. D. UCR 2005, now Asst. Prof. at Univ. of Arizona) n Numerous publications: http: //www. cs. ucr. edu/~vahid/pubs n Binary Xilinx ISE Profiling & partitioning Decompilation 9. 1 s Synthesis 60 MB Binary Updater JIT FPGA compilation Riverside JIT FPGA tools 0. 2 s Micropr. Binary 3. 6 MB Riverside JIT FPGA tools on a 75 MHz ARM 7 1. 4 s Std. HW Binary FPGA Binary binary DAC’ 04 3. 6 MB Frank Vahid, UC Riverside 28

Overall Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E Simpler FPGA fabric yields faster HW circuits Currently prototyping our simpler FPGA fabric with Intel, scheduled for Q 3 shuttle SW Only Execution Overall application speedup average is 7. 4 Frank Vahid, UC Riverside 29

Outline n FPGAs – The New Software n n n Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers n Warp processing n n n Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs Frank Vahid, UC Riverside 30

FPGA Ubiquity via Obscurity n Warp processing hides FPGA from languages and tools n New processor platforms with FPGA evolving SW Binary n n Standard Profiling Compiler ANY microprocessor platform extendible with FPGA Maintains "ecosystem": application, tool, and architecture developers New platforms with FPGAs appearing Standard Binary Translator Proc. FPGA Architectures Standard binaries Applications Frank Vahid, UC Riverside Tools 31

FPGA Standard Binaries? n n System. C? SW Binary Microprocessor binary represents one form of a "standard binary for FPGAs" Missing is explicit concurrency n n Standard FPGA Profiling Compiler Standard FPGA Binary binary? Parallelism, pipelining, queues, etc. As FPGAs appear in more platforms, might a more general FPGA binary evolve? Translator Proc. Ecosystem for FPGAs presently sorely missing FPGA Architectures Standard FPGA binaries Applications Frank Vahid, UC Riverside Tools 32

FPGA Standard Binaries? n Translator makes best use of existing FPGA resources n Binary Can even add FPGA, like adding memory, to improve performance n Add more FPGA to your PDA to implement compute-intensive application? Translator Proc. Low-end PDA FPGA Translator + + + + + Frank Vahid, UC Riverside + + + 100 sec FPGA * * * FPGA Binary High-end PDA Translator FPGA 1 sec 33

Summary n n FPGAs may be the new software Hiding FPGA via warp processing is feasible n n n Decompilation can recover high-level constructs to yield speedups competitive with source-level JIT FPGA compilation can be made sufficiently lean Future: Standard binaries for FPGAs? n Extensive work to be done Publications can be found at: http: //www. cs. ucr. edu/~vahid/pubs Frank Vahid, UC Riverside 34