Warp Processing Making FPGAs Ubiquitous via Invisible Synthesis

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida

Introduction n Improved performance enables new applications n Past decade - Mp 3 players, portable game consoles, cell phones, etc. n Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc. 2/55

Introduction FPGAs (Field Programmable Gate Arrays) – Implement custom circuits n 10 x, 100 x, even 1000 x for scientific and embedded apps n n n [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into mainstream n Make FPGAs “Invisible” Performance n FPGAs capable of large performance improvements FPGA u. P 3/55

Introduction – Hardware/Software Partitioning C Code for FIR Filter for (i=0; ii << 16; 128; i++) y[i] += c[i] * x[i]. . . Compiler Hardware for loop Designer creates * * * partitioning *. . . . Hardware/software. . . + + + critical +. . regions selects performance forcustom hardware using hardware implementation. . . . + + + description. . . . [Ernst, Henkel 93] + + [Gupta, De. Micheli 97]. . . . + [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] FPGA Processor n ~1000 cycles language (HDL) n ~ 10 cycles n Speedup = 1000 cycles/ 10 cycles = 100 x 4/55

Introduction – High-level Synthesis High-level Updated Code Binary n Hw/Sw Partitioning Decompilation Libraries/ Object Code Compiler High-level Decompilatio Synthesis n Software Hardware n Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis n Linker n n Bitstream n u. P Create circuit from high-level code FPGA [Gupta, De. Micheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers 5/55

Introduction – High-level Synthesis n High-level Updated Code Binary High-level Synthesis Decompilation Libraries/ Object Code Software Hardware n Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis n Linker n n Bitstream n u. P Create circuit from high-level code FPGA [Gupta, De. Micheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers 6/55

Introduction – High-level Synthesis n for (i=0; i < 16; i++) y[i] += c[i] * x[i] n High-level Synthesis Decompilation Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis n * * * + + + . . . n . . . Create circuit from high-level code n . . . . n [Gupta, De. Micheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers 7/55

Outline n n n Introduction Warp Processing Overview Enabling Technology – Binary Synthesis n Key techniques for synthesis from binaries n n Decompilation Current and Future Directions n n Multi-threaded Warp Processing Custom Communication 8/55

Problems with High-Level Synthesis Specialized High-level Code Updated Language Binary Specialized Synthesis Compiler Decompilation Libraries/ Object Code Software Hardware Linker n Non. Standard Software Tool Flow Problem: High-level synthesis is unattractive to software developers n Requires specialized language n n Requires specialized compiler n Bitstream n Spark, ROCCC, Catapult. C, … Limited commercial success n u. P System. C, Napa. C, Handel. C, … Software developers reluctant to change tools FPGA 9/55

Warp Processing – “Invisible” Synthesis Libraries/ Object Code High-level Code Updated Binary High-Level Code Updated Binary Compiler Decompilation Synthesis Decompilation Software Binary Updated Libraries/ Object Code Synthesis Decompilation Software Hardware n Standard Software Move compilation Tool Flow before synthesis Solution: Make synthesis “invisible” n 2 Requirements n Standard software tool flow n Software Linker Hardware n Bitstream Hide synthesis tool n n u. P Perform compilation before synthesis FPGA Move synthesis on chip Similar to dynamic binary translation n n [Transmeta] But, translate to hw 10/55

Warp Processing – “Invisible” Synthesis Libraries/ Object Code High-level Code Updated Binary High-Level Code Updated Binary Compiler Decompilation Synthesis Decompilation Software Binary Updated Libraries/ Object Code n Solution: Make synthesis “invisible” n 2 Requirements n Synthesis Decompilation Software Hardware Standard software tool flow n Software Linker Hardware Bitstream u. P FPGA Warp processor looks like standard u. P but invisibly synthesizes hardware n Perform compilation before synthesis Hide synthesis tool n n Move synthesis on chip Similar to dynamic binary translation n n [Transmeta] But, translate to hw 11/55

Warp Processing – “Invisible” Synthesis Libraries/ Object Code C, High-level C++, Java, Code Matlab Updated Binary n High-Level Code Updated Binary gcc, Decompilation g++, Compiler javac, keil Advantages n Synthesis Decompilation Software Binary Updated Libraries/ Object Code n Synthesis Decompilation Software Hardware n Software Linker Hardware Bitstream u. P FPGA Warp processor looks like standard u. P but invisibly synthesizes hardware n Supports all languages, compilers, IDEs Supports synthesis of assembly code Support synthesis of library code Also, enables dynamic optimizations 12/55

Warp Processing Background: Basic Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 13/55

Warp Processing Background: Basic Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 14/55

Warp Processing Background: Basic Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler beq beq beq add add add µP I Mem D$ FPGA On-chip CAD Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Critical Loop Detected 15/55

Warp Processing Background: Basic Idea 4 On-chip CAD reads in critical region Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 16/55

Warp Processing Background: Basic Idea 5 On-chip CAD converts critical region into control data flow graph (CDFG) Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 reg 3 : = 0 reg 4 : = 0 loop: reg 4 : = reg 4 + mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3 < 10) goto loop ret reg 4 17/55

Warp Processing Background: Basic Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + + + reg 3 : = 0 + : = 0+ reg 4 + . . . loop: +reg 4 : = reg 4++ mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3. . loop +< 10). goto ret reg 4 + . . . 18/55

Warp Processing Background: Basic Idea 7 On-chip CAD maps circuit onto FPGA Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + reg 3 : = 0 + : = 0+ + reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if. SM (reg 3 goto. . loop +< 10). SM + + ret reg 4 + . . . 19/55

Warp Processing Background: Basic Idea 8 On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: // instructions Shl reg 1, reg 3, that 1 interact FPGA Add reg 5, with reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + Software-only “Warped” reg 3 : = 0 + : = 0+ + reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if. SM (reg 3 goto. . loop +< 10). SM + + ret reg 4 + . . . 20/55

Expandable Logic RAM Expandable. RAM Logic– –System Warp tools detects duringinvisibly start, adapt amount. RAM of FPGA, improves performance invisiblyhardware. application to use less/more DMA Cache FPGA Profiler µP µP Warp Tools Expandable Logic Expandable RAM u. P Performance 21/55

Expandable Logic n Allows for customization of platforms n User can select FPGAs based on used applications Application Portable Gaming Performance Unacceptable Performance 22/55

Expandable Logic n Allows for customization of platforms n User can select FPGAs based on used applications Application Portable Gaming Performance. . . . • User can customize FPGAs to the desired amount of performance • Performance improvement is invisible – doesn’t require new binary from the developer 23/55

Expandable Logic n Allows for customization of platforms n User can select FPGAs based on used applications Application No-FPGA Web Browser Performance Acceptable Performance • Platform designer doesn’t have to decide on fixed amount of FPGA. • User doesn’t have to pay for FPGA that isn’t needed 24/55

Warp Processing Background: Basic Technology n Challenge: CAD tools normally require powerful workstations Develop extremely efficient on-chip CAD tools n n Requires efficient synthesis Requires specialized FPGA, physical design tools (JIT FPGA compilation) n Binary Synthesis [Lysecky FCCM 05/DAC 04], University of Arizona Logic Optimization Profiler u. P FPGA I$ D$ On-chip CAD JIT FPGA compilation n Technology Mapping Placement & Routing HW Binary Updated Binary 25/55

Xilinx ISE 0. 2 s 3. 6 MB ute 9. 1 s Manually performed On-chip CAD Ro Pla ce ap. M ch Te g. Lo Sy RT nth Sy n. esi s Op t. Warp Processing Background: On-Chip CAD 60 MB 46 x improvement 30% perf. penalty On a 75 Mhz ARM 7: only 1. 4 s 26/55

Warp Processing: Initial Results - Embedded Applications n Average speedup of 6. 3 x n n Achieved completely transparently Also, energy savings of 66% 27/55

Outline n n n Introduction Warp Processing Overview Enabling Technology – Binary Synthesis n Key techniques for synthesis from binaries n n Decompilation Current and Future Directions n n Multi-threaded Warp Processing Custom Communication 28/55

Binary Synthesis n Warp processors perform synthesis from software binary – “binary synthesis” n Problem: No high-level information n Synthesis needs high-level constructs > 10 x slowdown Can we recover high-level information for synthesis? n Make binary synthesis (and Warp processing) competitive with highlevel synthesis for(i=0; i i<<128; i++) y[i]+= +=c[i]**x[i]. . . . Compiler Addi r 1, r 0, 0 Ld r 3, 256(r 1) Ld r 4, 512(r 1) Subi r 2, r 1, 128 Jnz r 2, -5 No high-level constructs – arrays, loops, etc. Binary Synthesis FPGA Processor Hardware can be > 10 x to 100 x 29/55

Decompilation n We realized decompilation recovers high-level information n We studied existing approaches n n n But, generally used for binary translation or sourcecode recovery May not be suitable for synthesis [Cifuentes 94, 99, 01][Mycroft 99, 01] Dis. C, dcc, Boomerang, Mocha, Source. Again Determined relevant techniques n Adapted existing techniques for synthesis 30/55

Decompilation – Control/Data Flow Graph Recovery n Recovery of control/data flow graph (CDFG) n n Format used by synthesis Difficult because of indirect jumps n n Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks n [Cifuentes 99, 00] Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Control/Data Flow Graph Creation reg 3 : = 0 reg 4 : = 0 loop: reg 1 : = reg 3 << 1 reg 5 : = reg 2 + reg 1 reg 6 : = mem[reg 5 + 0] reg 4 : = reg 4 + reg 6 reg 3 : = reg 3 + 1 if (reg 3 < 10) goto loop ret reg 4 31/55

Decompilation – Data Flow Analysis n Original purpose - remove temporary registers n Area overhead – 130% n Need new techniques for binary synthesis Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Data Flow Analysis reg 3 : = 0 reg 4 : = 0 loop: reg 4 : = reg 4 + mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3 < 10) goto loop ret reg 4 32/55

Decompilation – Data Flow Analysis n Strength Reduction – Compare-with-zero instructions reg 4 Sub reg 3, reg 4, reg 5 Bz reg 3, -5 reg 5 Optimized DFG Sub reg 4 reg 3 Not needed, wastes area = 0 = reg 5 Branch? n Operator Size Reduction Lb reg 4, 0(reg 1) Mvi reg 5, 16 Add reg 3, reg 4, reg 5 32 -bit reg 4 Optimized DFG 32 -bit reg 5 Load Byte 16 8 -bit reg 4 5 -bit reg 5 32 -bit + 8 -bit + 32 -bit reg 3 Only 8 -bit adder needed 8 -bit reg 3 Area Overhead Reduced to 10% 33/55

Decompilation – Function Recovery n Recover parameters and return values n n Def-use analysis of prologue/epilogue 100% success rate Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Function Recovery long f( long reg 2 ) { int reg 3 = 0; int reg 4 = 0; loop: reg 4 = reg 4 + mem[reg 2 + reg 3 << 1)]; reg 3 = reg 3 + 1; if (reg 3 < 10) goto loop; return reg 4; } 34/55

Decompilation – Control Structure Recovery n Recover loops, if statements n Uses interval analysis techniques n n [Cifuentes 94] 100% success rate Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Control Structure Recovery long f( long reg 2 ) { long reg 4 = 0; for (long reg 3 = 0; reg 3 < 10; reg 3++) { reg 4 += mem[reg 2 + (reg 3 << 1)]; } return reg 4; } 35/55

Decompilation – Array Recovery n Detect linear memory patterns and row-major ordering calculations n ~ 95% success rate n n [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00] Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Array Recovery long f( short array[10] ) { long reg 4 = 0; for (long reg 3 = 0; reg 3 < 10; reg 3++) { reg 4 += array[reg 3]; } return reg 4; } 36/55

Comparison of Decompiled Code and Original Code n Decompiled code almost identical to original code n n Only difference is variable names Binary synthesis is competitive with high-level synthesis Decompiled Code Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } long f( short array[10] ) { long reg 4 = 0; for (long reg 3 = 0; reg 3 < 10; reg 3++) { reg 4 += array[reg 3]; } return reg 4; } Almost Identical Representations 37/55

Binary Synthesis Tool Flow High-level Updated Source Binary Libraries/ Object Code Initially, high-level source is compiled and linked to form a binary Compiler Decompilation Binary Synthesis u. P Decompilation Hw/Sw Estimation Modifies binary to use synthesized hardware Bitstream Binary Hw/Sw Partitioning Recovers highlevel information needed for synthesis Profiling Software Hardware Binary Updater Profiling Synthesis Profiling Updated Binary FPGA Hardware Netlists Bitstream ~30, 000 lines of C code 38/55

Binary Synthesis is Competitive with High-Level Synthesis Small difference in speedup n Binary synthesis competitive with high-level synthesis n n n Binary speedup: 8 x, High-level speedup: 8. 2 x High-level synthesis only 2. 5% better Commercial products beginning to appear n Critical Blue, Binachip 39/55

Binary Synthesis with Software Compiler Optimizations n But, binaries generated with few optimizations n Optimizations for software may hurt hardware n Need new decompilation techniques Hardware synthesized from optimized binary may be inefficient C code SW Compiler Binary is optimized for software Optimized Binary Synthesis u. P FPGA 40/55

Loop Rerolling n Problem: Loop unrolling may cause inefficient hardware n Longer synthesis times n n n Larger area requirements n n Super-linear heuristics Unrolling 100 times => synthesis time is 1002 times longer Unrolling by compiler unlikely to match unrolling by synthesis Loop structure needed for advanced synthesis techniques Non-unrolled Loop Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Unrolled Loop Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Synthesis Execution Times Solution: We introduce loop rerolling to undo loop unrolling 41/55

Loop Rerolling – Identifying Unrolled Loops n Idea - Identify consecutively repeating instruction sequences Original C Code x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Unrolled Loop Binary Map to String x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Add r 3, 1 Ld r 0, b(0) Add r 1, r 0, 1 St a(0), r 1 Ld r 0, b(1) Add r 1, r 0, 1 St a(1), r 1 Mov r 4, r 3 Add r 3, 1 => B Ld r 0, b(0) => A Add r 1, r 0, 1 => B St a(0), r 1 => C Ld r 0, b(1) => A Add r 1, r 0, 1 => B St a(1), r 1 => C Mov r 4, r 3 => D Suffix Tree b abcabcd c abcd d c d BABCABCD [Ukkonen 95] d abcd String Representation d Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) 42/55

Loop Rerolling Original C Code x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; 1) Add r 3, 1 Ld r 0, b(0) Add r 1, r 0, 1 St a(0), r 1 Ld r 0, b(1) Add r 1, r 0, 1 St a(1), r 1 Mov r 4, r 3 Replace constants with induction variable expression Determine relationship of constants Add r 3, 1 Ld r 0, b(0) Add r 1, r 0, 1 St a(0), r 1 Ld r 0, b(1) Add r 1, r 0, 1 St a(1), r 1 Mov r 4, r 3 Unrolled Loop Identificiation 2) Add r 3, 1 i=0 loop: Ld r 0, b(i) Add r 1, r 0, 1 St a(i), r 1 Bne i, 2, loop Mov r 4, r 3 Rerolled, decompiled code 3) reg 3 = reg 3 + 1; for (i=0; i < 2; i++) array 1[i]=array 2[i]+1; reg 4=reg 3; Average Speedup of 1. 6 x 43/55

Strength Promotion n Problem: Strength reduction may cause inefficient hardware B[i]4 B[i+1] 18 1 B[i+2] 101 B[i+1] B[i] 3 B[i]5 B[i+2] 34 1 B[i+3]6 B[i] B[i+3] 66 1 << << << *+ +* << +* + Replace with multiplication Identify strength- << + However, some of the strength reduction was beneficial + reduced subgraphs A[i] B[i] 10 B[i+1] 18 B[i+2] 5 B[i+2] 1 B[i+3]6 B[i+3] 1 << * * Strength promotion lets synthesis decide on strength reduction, not software compiler Average Speedup of 1. 5 << << + + << Synthesis reapplies strength reduction to get optimal DFG + + + A[i] 44/55

Multiple ISA/Optimization Results What about aggressive software compiler optimizations? n n May obscure binary, making decompilation impossible What about different instructions sets? n Side effects may degrade hardware performance Speedups similar on Micro. Blaze speedups on ARM MIPSfor for–O 1 Speedups similar much larger and –O 3 ARM –O 3 and between optimizations Micro. Blaze is a slower MIPS microprocessor Complex instructions -O 3 optimizations of ARM didn’t hurtwere very beneficial to synthesis hardware Speedup n 45/55

High-level vs. Binary Synthesis: Proprietary H. 264 Decoder n High-level synthesis vs. binary synthesis n n Collaboration with Freescale Semiconductor H. 264 Decoder n n n MPEG-4 Part 10 Advanced Video Coding (AVC) 3 x smaller than MPEG-2 Better quality MPEG 2 H. 264 46/55

High-level vs. Binary Synthesis: Proprietary H. 264 Decoder Binary synthesis competitive with highlevel synthesis n Binary synthesis was competitive with highlevel synthesis n n High-level speedup – 6. 56 x Binary speedup – 6. 55 x 47/55

Outline n n n Introduction Warp Processing Overview Enabling Technology – Binary Synthesis n Key techniques for synthesis from binaries n n Decompilation Current and Future Directions n n Multi-Threaded Warp Processing Custom Communication 48/55

Thread Warping - Overview Architectural Trend – Include more cores on chip Result – More multi-threaded applications Profiler b( ) Function a( ) Warp FPGA for (i=0; i < 10; i++) create. Thread( b ); b( µP) a( µP) OS can only schedule 2 threads Remaining 8 threads placed in thread queue OS schedules 4 threads to custom accelerators b( b( )) b( ) µP b( ) Thread Queue OS b( ) Warp tools create custom accelerators for b( ) Warp Tools OS b( µP) b( ) b( ) 3 x more thread parallelism b( ) 49/55

Thread Warping - Overview Profiler detects performance critical loop in b( ) Profiler b( ) Function a( ) for (i=0; i < 10; i++) create. Thread( b ); b( ) FPGA b(Warp ) b( ) a( µP) b( µP) OS µP b( ) Warp tools create larger/faster accelerators Warp Tools b( ) Potentially > 100 x speedup 50/55

Thread Warping - Results Thread warping 120 x faster than 4 -u. P (ARM) system n Comparison of thread warping (TW) and multi-core n n Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA 51/55

Warp Processing – Custom Communication No. C – Network on a Chip provides communication between multiple cores [Benini, De. Micheli][Hemani][Kumar] Problem: Best topology is application dependent µP µP Performance App 1 µP µP Performance Bus Mesh App 2 Bus Mesh 52/55

Warp Processing – Custom Communication No. C – Network on a Chip provides communication between multiple cores [Benini, De. Micheli][Hemani][Kumar] Problem: Best topology is application dependent µP µP Performance App 1 FPGA µP Warp processing can dynamically choose topology – 2 x to 100 x improvement Performance µP Bus Mesh App 2 Bus Mesh Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing” 53/55

Summary Any Language Updated Binary Developer is unaware of FPGA/synthesis Any Compiler Decompilation Standard Updated Binary Binary Synthesis Profiler Expandable Logic FPGA I$ D$ u. P FPGA On-chip CAD Warp Processing u. P Performance Decompilation makes possible JIT FPGA Compilation HW Binary Updated Binary Warp processing invisibly achieves > 100 x speedups 54/55

References n n 1. 2. 3. 4. 5. 6. 7. 8. Patent Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004 Hardware/Software Partitioning of Software Binaries G. Stitt and F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164 - 170. Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659 -681. Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES) Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547 -554. Hardware/Software Partitioning of Software Binaries: A Case Study of H. 264 Decode G. Stitt, F. Vahid, G. Mc. Gregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285290. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp. 396 -397. Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250 -255. Supported by NSF, SRC, Intel, IBM, Xilinx 55/55