Warp Processors Towards Separating Function and Architecture Frank

  • Slides: 45
Download presentation
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Faculty member, Center for Embedded Computer Systems, UC Irvine This research is supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Motorola

Main Idea Warp Processors – Dynamic HW/SW Partitioning 2 Profile application to determine critical

Main Idea Warp Processors – Dynamic HW/SW Partitioning 2 Profile application to determine critical regions 1 Initially execute application in software only 5 Partitioned application executes faster with lower energy consumption (speed has been “warped”) Profiler I$ µP Warp Config. Logic Architecture 3 Partition critical regions to hardware D$ Dynamic Part. Module (DPM) 4 Program configurable logic & update software binary 2

Separating Function and Architecture n Benefits to “standard binary” for microprocessor n n n

Separating Function and Architecture n Benefits to “standard binary” for microprocessor n n n Concept: separate function from detailed architecture Uniform, mature development tools Same binary can run on variety of architectures New architectures can be developed and introduced for existing applications Trend towards dynamic translation and optimization of function in mapping to architecture SW SW ______ ______ Standard Profiling Compiler Binary Processor 1 Processor 2 Processor 3

Introduction Partitioning software kernels or all of sw to FPGA Profiler SW SW ______

Introduction Partitioning software kernels or all of sw to FPGA Profiler SW SW ______ ______ Critical Regions n Improvements eclipse those of dynamic software methods n SW SW ______ ______ HW SW n ______ ______ n Processor FPGA Commonly one chip today n Speedups of 10 x to 1000 x Far more potential than dynamic SW optimizations (1. 3 x, maybe 2 -3 x? ) Energy reductions of 90% or more Why not more popular? n n n Loses benefits of standard binary Non-standard chips (didn’t even exist a few years ago) Special tools, harder to design/test/debug, . . . 6

Single-Chip Microprocessor/FPGA Platforms Appearing Commercially FPGAs are big, but Moore’s Law continues on… and

Single-Chip Microprocessor/FPGA Platforms Appearing Commercially FPGAs are big, but Moore’s Law continues on… and mass-produced platforms can be very cost effective FPGA next to processor increasingly common Courtesy of Atmel Courtesy of Altera Power. PCs Courtesy of Xilinx Courtesy of Triscend 7

Introduction Binary-Level Hardware/Software Partitioning n SW Binary Traditional partitioning done here Standard Profiling Compiler

Introduction Binary-Level Hardware/Software Partitioning n SW Binary Traditional partitioning done here Standard Profiling Compiler Can we dynamically move software kernels to FPGA? n Enabler – binary-level partitioning and synthesis n Binary Partitioner n n Netlist n Processor ASIC/FPGA Processor Partition and synthesize starting from SW binary Initially desktop based Advantages n Modified Binary [Stitt & Vahid, ICCAD’ 02] Any compiler, any language, multiple sources, assembly/object support, legacy code support Disadvantage n Loses high-level information n Quality loss? 8

Introduction Binary-Level Hardware/Software Partitioning Stitt/Vahid’ 04 9

Introduction Binary-Level Hardware/Software Partitioning Stitt/Vahid’ 04 9

Introduction Binary Partitioning Enables Dynamic Partitioning n SW Binary Dynamic HW/SW Partitioning n n

Introduction Binary Partitioning Enables Dynamic Partitioning n SW Binary Dynamic HW/SW Partitioning n n Standard Profiling Compiler n Advantages n n n Binary n Proc. n FPGA n No special desktop tools Completely transparent Avoid complexities of supporting different FPGA types Complements other approaches n CAD Embed partitioning CAD tools on-chip Feasible in era of billion-transistor chips Desktop CAD best from purely technical perspective Dynamic opens additional market segments (i. e. , all software developers) that otherwise might not use desktop CAD Back to “standard binary” – opens processor architects to world of speedup using FPGAs 10

Warp Processors Tools & Requirements n Binary Warp Processor Architecture n n n On-chip

Warp Processors Tools & Requirements n Binary Warp Processor Architecture n n n On-chip profiling architecture Configurable logic architecture Dynamic partitioning module Decompilation Partitioning Binary Updater RT Synthesis Logic Synthesis Profiler u. P Config. Logic Arch. I$ D$ DPM Technology Mapping Placement & Routing HW Binary DPM with u. P overkill? Consider that FPGA much bigger than u. P. Also consider there may be dozens or u. P, but all can share one DPM. Updated Binary 11

Warp Processors All that CAD on-chip? CAD people may first think dynamic HW/SW partitioning

Warp Processors All that CAD on-chip? CAD people may first think dynamic HW/SW partitioning is “absurd” n Those CAD tools are complex Te 1 -2 mins 1 min 1 -2 mins 2 – 30 mins 10 MB 20 MB 10 MB 50 MB 60 MB Ro u Pla te Lo g 1 min ce RT Sy n 30 s ch rtit Pa ion ing co De mp . n . M n . S yn n ap Require long execution times on powerful desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that on-chip? n 12

Warp Processors Tools & Requirements n But, in fact, on-chip CAD may be practical

Warp Processors Tools & Requirements n But, in fact, on-chip CAD may be practical since specialized n n n Traditional CAD -- Huge, arbitrary input Warp Processor CAD -- Critical sw kernels n n FPGA features evaluated for impact on CAD influences FPGA features Add architecture features for kernels Binary Updater RT Synthesis Traditional FPGA – huge, arbitrary netlists, ASIC prototyping, varied I/O Warp Processor FPGA – kernel speedup Careful simultaneous design of FPGA and CAD n Partitioning FPGA n n Decompilation CAD n n Binary Logic Synthesis Technology Mapping Profiler u. P I$ D$ Placement & Routing HW Binary Config. Logic Arch. Updated Binary DPM 13

Warp Processors Configurable Logic Architecture n Loop support hardware n n Data address generators

Warp Processors Configurable Logic Architecture n Loop support hardware n n Data address generators (DADG) and loop control hardware (LCH), found in digital signal processors – fast loop execution Supports memory accesses with regular access pattern Synthesis of FSM not required for many critical loops 32 -bit fast Multiply-Accumulate (MAC) unit Lysecky/Vahid, DATE’ 04 Profiler u. P Config. Logic Arch. I$ D$ DPM DADG & LCH Reg 0 Reg 1 Reg 2 32 -bit MAC Configurable Logic Fabric 14

Warp Processors Configurable Logic Fabric Simple fabric: array of configurable logic blocks (CLBs) surrounded

Warp Processors Configurable Logic Fabric Simple fabric: array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) Simple CLB: Two 3 -input 2 -output LUTs n n n Adj. CLB n Profiler AR M Config. Logic Arch. d e f LUT o 1 o 2 o 3 o 4 Adj. CLB carry-chain support Simple switch matrices: 4 -short, 4 -long channels Designed for simple fast CAD n a b c I D $ $ DADG LCH DPM 0 1 2 3 0 L 1 L 2 L 3 L Lysecky/Vahid, DATE’ 04 SM SM SM 32 -bit MAC CLB Configurable Logic Fabric SM CLB SM SM 3 L 2 L 1 L 0 L 3 2 1 0 0 1 2 3 0 L 1 L 2 L 3 L 15

Warp Processors Dynamic Partitioning Module (DPM) n Binary Dynamic Partitioning Module n n Decompilation

Warp Processors Dynamic Partitioning Module (DPM) n Binary Dynamic Partitioning Module n n Decompilation Executes on-chip partitioning tools Consists of small low-power processor (ARM 7) n Current So. Cs can have dozens On-chip instruction & data caches Memory: a few megabytes Partitioning Binary Updater RT Synthesis Logic Synthesis Technology Mapping Profiler ARM u. P I$ D$ Memory WCLA I$ D$ DPM Placement & Routing HW Binary Updated Binary 17

Warp Processors Decompilation Software Binary n Goal: recover high-level information lost during compilation n

Warp Processors Decompilation Software Binary n Goal: recover high-level information lost during compilation n n Otherwise, synthesis results will be poor Utilize sophisticated decompilation methods n n Developed over past decades for binary translation Indirect jumps hamper CDFG recovery n But not too common in critical loops (function pointers, switch statements) Binary Parsing CDFG Creation discover loops, if-else, etc. reduce operation sizes, etc. reroll loops, etc. allows parallel memory access Control Structure Recovery Removing Instruction. Set Overhead Undoing Back-End Compiler Optimizations Alias Analysis Annotated CDFG 18

Warp Processors Decompilation Results n In most situations, we can recover all highlevel information

Warp Processors Decompilation Results n In most situations, we can recover all highlevel information n Recovery success for dozens of benchmarks, using several different compilers and optimization levels: 19

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 –

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 – 30 mins 10 MB 20 MB 10 MB 50 MB 60 MB Ro ce 1 min Pla ch Te . M n. Sy g. 30 s RT Lo Sy n. ing Pa rtit ion De co mp . Execution Time and Memory Requirements <1 s 1 MB 20

Warp Processors Dynamic Partitioning Module (DPM) Binary Decompilation Partitioning Binary Updater RT Synthesis Logic

Warp Processors Dynamic Partitioning Module (DPM) Binary Decompilation Partitioning Binary Updater RT Synthesis Logic Synthesis Technology Mapping Profiler ARM u. P I$ D$ Memory WCLA I$ D$ DPM Placement & Routing HW Binary Updated Binary 24

Warp Processors RT Synthesis n n Converts decompiled CDFG to Boolean expressions Maps memory

Warp Processors RT Synthesis n n Converts decompiled CDFG to Boolean expressions Maps memory accesses to our data address generator architecture n n Detects read/write, memory access pattern, memory read/write ordering Optimizes dataflow graph n n Removes address calculations and loop counter/exit conditions Loop control handled by Loop Control Hardware 1 • Memory Read r 1 + r 1 DADG Read r 2 • Increment Address Read r 2 r 3 + Stitt/Lysecky/Vahid, DAC’ 03 25

Warp Processors RT Synthesis n Maps dataflow operations to hardware components n n We

Warp Processors RT Synthesis n Maps dataflow operations to hardware components n n We currently support adders, comparators, shifters, Boolean logic, and multipliers Creates Boolean expression for each output bit of dataflow graph r 1 32 -bit adder r 2 r 3 8 + < r 4 r 5 32 -bit comparator r 4[0]=r 1[0] xor r 2[0], carry[0]=r 1[0] and r 2[0] r 4[1]=(r 1[1] xor r 2[1]) xor carry[0], carry[1]= ……. Stitt/Lysecky/Vahid, DAC’ 03 26

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 –

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 – 30 mins 10 MB 20 MB 10 MB 50 MB 60 MB Ro ce 1 min Pla ch Te . M n. g. Lo Sy n. Sy 30 s RT Pa rtit ion De co mp . ing Execution Time and Memory Requirements <1 s <1 s. 5 MB 1 MB . 5 MB 27

Warp Processors Dynamic Partitioning Module (DPM) Binary Decompilation Partitioning Binary Updater RT Synthesis Logic

Warp Processors Dynamic Partitioning Module (DPM) Binary Decompilation Partitioning Binary Updater RT Synthesis Logic Synthesis Technology Mapping Profiler ARM u. P I$ D$ Memory WCLA I$ D$ DPM Placement & Routing HW Binary Updated Binary 28

Warp Processors Logic Synthesis n Optimize hardware circuit created during RT synthesis n n

Warp Processors Logic Synthesis n Optimize hardware circuit created during RT synthesis n n Large opportunity for logic minimization due to use of immediate values in the binary code Utilize simple two-level logic minimization approach r 1 4 + r 2[0] = r 1[0] xor 0 r 2[1] = r 1[1] xor 0 xor carry[0] r 2[2] = r 1[2] xor 1 xor carry[1] r 2[3] = r 1[3] xor 0 xor carry[2] … Logic Synthesis r 2[0] = r 1[0] r 2[1] = r 1[1] xor carry[0] r 2[2] = r 1[2] xor carry[1] r 2[3] = r 1[3] xor carry[2] … Stitt/Lysecky/Vahid, DAC’ 03 29

Warp Processors - ROCM n ROCM – Riverside On-Chip Minimizer n n n Two-level

Warp Processors - ROCM n ROCM – Riverside On-Chip Minimizer n n n Two-level minimization tool Utilized a combination of approaches from Espresso-II [Brayton, et al. 1984] and Presto [Svoboda & White, 1979] Eliminate the need to compute the off-set to reduce memory usage Utilizes a single expand phase instead of multiple iterations On average only 2% larger than optimal solution for benchmarks Lysecky/Vahid, DAC’ 03 Lysecky/Vahid, CODES+ISSS’ 03 Expand on-set dc-set off-set Reduce Irredundant 30

Warp Processors - ROCM Results ROCM executing on 40 MHz ARM 7 requires less

Warp Processors - ROCM Results ROCM executing on 40 MHz ARM 7 requires less than 1 second Small code size of only 22 kilobytes Average data memory usage of only 1 megabyte 500 MHz Sun 40 MHz ARM 7 (Triscend A 7) Ultra 60 Lysecky/Vahid, DAC’ 03 Lysecky/Vahid, CODES+ISSS’ 03 31

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 –

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 – 30 mins 10 MB 20 MB 10 MB 50 MB 60 MB <1 s <1 s 1 s . 5 MB 1 MB Ro ce 1 min Pla ch Te . M n. g. Lo Sy n. Sy 30 s RT Pa rtit ion De co mp . ing Execution Time and Memory Requirements . 5 MB 32

Warp Processors Dynamic Partitioning Module (DPM) Binary Decompilation Partitioning Binary Updater RT Synthesis Logic

Warp Processors Dynamic Partitioning Module (DPM) Binary Decompilation Partitioning Binary Updater RT Synthesis Logic Synthesis Technology Mapping Profiler ARM u. P I$ D$ Memory WCLA I$ D$ DPM Placementand & Routing Placement Routing HW Binary Updated Binary 33

Warp Processors Technology Mapping/Packing n ROCPAR – Technology Mapping/Packing n n n Decompose hardware

Warp Processors Technology Mapping/Packing n ROCPAR – Technology Mapping/Packing n n n Decompose hardware circuit into basic logic gates (AND, OR, XOR, etc. ) Traverse logic network combining nodes to form single-output LUTs Combine LUTs with common inputs to form final 2 -output LUTs Pack LUTs in which output from one LUT is input to second LUT Pack remaining LUTs into CLBs Lysecky/Vahid, DATE’ 04 Stitt/Lysecky/Vahid, DAC’ 03 34

Warp Processors Placement n ROCPAR – Placement n n Identify critical path, placing critical

Warp Processors Placement n ROCPAR – Placement n n Identify critical path, placing critical nodes in center of configurable logic fabric Use dependencies between remaining CLBs to determine placement n Attempt to use adjacent cell routing whenever possible Lysecky/Vahid, DATE’ 04 Stitt/Lysecky/Vahid, DAC’ 03 CLB CLB CLB 35

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 –

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 – 30 mins 10 MB 20 MB 10 MB 50 MB 60 MB <1 s <1 s. 5 MB 1 MB Ro ce 1 min Pla ch Te . M n. g. Lo Sy n. Sy 30 s RT Pa rtit ion De co mp . ing Execution Time and Memory Requirements 1 s <1 s 1 MB . 5 MB 36

Warp Processors Routing n FPGA Routing n n Find a path within FPGA to

Warp Processors Routing n FPGA Routing n n Find a path within FPGA to connect source and sinks of each net VPR – Versatile Place and Route [Betz, et al. , 1997] n Modified Pathfinder algorithm n n n Allows overuse of routing resources during each routing iteration If illegal routes exists, update routing costs, rip-up all routes, and reroute Increases performance over original Pathfinder algorithm Routability-driven routing: Use fewest tracks possible Timing-driven routing: Optimize circuit speed Route Routing Resource Graph Rip-up yes illegal? congestion? no Done! 37

Warp Processors Routing n Riverside On-Chip Router (ROCR) n Represent routing nets between CLBs

Warp Processors Routing n Riverside On-Chip Router (ROCR) n Represent routing nets between CLBs as routing between SMs n Resource Graph n n n Routing n n n Nodes correspond to SMs Edges correspond to short and long channels between SMs Greedy, depth-first routing algorithm routes nets between SMs Assign specific channels to each route, using Brelaz’s greedy vertex coloring algorithm Requires much less memory than VPR as resource graph is much smaller Route Rip-up yes illegal? congestion? Routing Resource Graph no Done! Lysecky/Vahid/Tan, submitted to DAC’ 04 38

Warp Processors Routing: Performance and Memory Usage Results n Average 10 X faster than

Warp Processors Routing: Performance and Memory Usage Results n Average 10 X faster than VPR (TD) n n Up to 21 X faster for ex 5 p Memory usage of only 3. 6 MB n 13 X less than VPR Lysecky/Vahid/Tan, to appear in DAC’ 04 39

Warp Processors Routing: Critical Path Results 32% longer critical path than VPR (Timing Driven)

Warp Processors Routing: Critical Path Results 32% longer critical path than VPR (Timing Driven) 10% shorter critical path than VPR (Routability Driven) Lysecky/Vahid/Tan, submitted to DAC’ 04 40

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 –

Warp Processors ap ute 1 -2 mins 1 min 1 -2 mins 2 – 30 mins 10 MB 20 MB 10 MB 50 MB 60 MB <1 s <1 s. 5 MB 1 s <1 s 1 MB . 5 MB 1 MB Ro ce 1 min Pla ch Te . M n. g. Lo Sy n. Sy 30 s RT Pa rtit ion De co mp . ing Execution Time and Memory Requirements 10 s 3. 6 MB . 5 MB 41

Warp Processors Experimental Setup n Warp Processor n n Embedded microprocessor Configurable logic fabric

Warp Processors Experimental Setup n Warp Processor n n Embedded microprocessor Configurable logic fabric with fixed frequency 80% that of the microprocessor n n Our CAD tools executed on a 75 MHz ARM 7 processor DPM active for ~10 seconds Experiment: key tools automated; some other tasks assisted by hand Versus traditional HW/SW Partitioning n n ARM Config. Logic Arch. I$ D$ DPM Used dynamic partitioning module to map critical region to hardware n n Based on commercial single-chip platform (Triscend A 7) Profiler ARM processor Xilinx Virtex-E FPGA (maximum possible speed) Manually partitioned software using VHDL synthesized using Xilinx ISE 4. 1 on desktop ARM 7 I$ D$ Xilinx Virtex-E FPGA 45

Warp Processors: Initial Results Speedup (Critical Region/Loop) Average loop speedup of 29 x 46

Warp Processors: Initial Results Speedup (Critical Region/Loop) Average loop speedup of 29 x 46

Warp Processors: Initial Results Speedup (overall application with ONLY 1 loop sped up) Average

Warp Processors: Initial Results Speedup (overall application with ONLY 1 loop sped up) Average speedup of 2. 1 vs. 2. 2 for Virtex-E 4. 1 47

Warp Processors: Initial Results Energy Reduction (overall application, 1 loop ONLY) Average energy reduction

Warp Processors: Initial Results Energy Reduction (overall application, 1 loop ONLY) Average energy reduction of 33% v. s 36% for Xilinx Virtex-E 74% 48

Warp Processors Execution Time and Memory Requirements (on PC) Xilinx ISE 9. 1 s

Warp Processors Execution Time and Memory Requirements (on PC) Xilinx ISE 9. 1 s Manually performed 46 x improvement ROCPAR 0. 2 s 60 MB 3. 6 MB On a 75 Mhz ARM 7: only 1. 4 s 49

Multi-processor platforms n Multiple processors can share a single DPM n n Time-multiplex Just

Multi-processor platforms n Multiple processors can share a single DPM n n Time-multiplex Just another processor whose task is to help the other processors Processors can even be soft cores in FPGA DPM can even re-visit same application in case use or data has changed u. P u. P DPM Shared by all u. P Config. Logic Arch. u. P 50

Idea of Warp Processing can be Viewed as JIT FPGA compilation n JIT FPGA

Idea of Warp Processing can be Viewed as JIT FPGA compilation n JIT FPGA Compilation n n Idea: standard binary for FPGA Similar benefits as standard binary for microprocessor n n VHDL/Verilog Binary Standard Profiling CAD Tools Portability, transparency, standard tools May involve microprocessor for compactness of non-critical behavior JIT FPGA Comp. * * + + MEM FPGA Std. Binary HW Binary JIT FPGA Comp. + + FPGA 51

Future Directions n n n Already widely known that mapping sw to FPGA has

Future Directions n n n Already widely known that mapping sw to FPGA has great potential Our work has shown that mapping sw to FPGA dynamically may be feasible Extensive future work needed on tools/fabric to achieve overall application speedups/energy improvements of 100 x-1000 x 52

Ultimately… n Working towards separation of function from architecture n n n Write application,

Ultimately… n Working towards separation of function from architecture n n n Write application, create “standard binary” Map binary to any microprocessor (one or more), any FPGA, or combination thereof Enables improvements in function and architecture without the heavy interdependence of today SW SW ______ ______ Standard Profiling Compiler Binary Function Processor 1 Processor FPGA 1 Processor + FPGA Function Processor + FPGA 53

Publications & Acknowledgements All these publications are available at http: //www. cs. ucr. edu/~vahid/pubs

Publications & Acknowledgements All these publications are available at http: //www. cs. ucr. edu/~vahid/pubs n Dynamic FPGA Routing for Just-in-Time FPGA Compilation, R. Lysecky, F. Vahid, S. Tan, Design Automation Conference, 2004. n A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid, Design Automation and Test in Europe Conference (DATE), February 2004. n Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid, ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; to appear in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp. n A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ACM/IEEE ISSS/CODES conference, 2003. n Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design Automation Conference, 2003. n On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conference, 2003. n The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid, IEEE Design and Test of Computers, November/December 2002. n Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference on Computer Aided Design, November 2002. We gratefully acknowledge financial support from the National Science Foundation and the Semiconductor Research Corporation for this work. We also appreciate the collaborations and support from Motorola, Triscend, and Philips/Tri. Media. 54