Computing System Element Choices Programmability Flexibility GPPs General

Computing System Element Choices Programmability / Flexibility GPPs General Purpose Processors Superscalar VLIW Reconfigurable Computing Also known as Custom Computing Machines (CCMs) Utilize hardware devices customized to match computation Using: FPGAs (Fine grain) or Micro-coded arrays of simple processors (coarse grain) Application Specific Processors DSPs Network Processors Graphics Processors …. . Re-configurable Hardware customization/reconfigurablity, how? Change both functionality of hardware cells (elements) and their spatial connectivity to match requirements of computation/application on the fly (at runtime). Co-Processors ASICs Specialization , Development cost/time Performance/Chip Area/Watt + Shorter Useful Life cycle (Computational Efficiency) Software Hardware EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Spatial vs. Temporal Computing Space vs. Time Trade-off Spatial (using hardware) Temporal (using software/program) Processor Instructions Defined by fixed functionality and connectivity of hardware elements Processor running programs written using a pre-defined fixed set of instructions (ISA) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Computing Element Programmability Defining Terms Fixed Functionality: Fixed Hardware • Computes one function (e. g. FP-multiply, divider, DCT) • Function defined at fabrication time. • e. g ASICs Programmable: Functionality Not Fixed • Computes “any” computable function: – Processor: GPPs, ASPs – Configurable Hardware: e. g. FPGAs. • Function defined/changed after fabrication (e. g at compilation or runtime). Late binding Parameterizable Hardware: Performs limited “set” of functions e. g. Co-Processors EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Computing Element Choices Observation • Generality and computational efficiency are in some sense inversely related to one another: – The more general-purpose a computing element is and thus the greater the number of tasks it can perform, the less efficient it will be in performing any of those specific tasks. – Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and make compromises on other less important features. • To counter the problem of specialized and computationally intense problems for which general purpose machines cannot achieve the necessary performance: – Special-purpose processors (ASPs), attached processors, and coprocessors have been built for many years, especially in such areas as image or signal processing (for which many of the computational tasks can be very well defined). Fixed ASP ISA Solution? • – The problem with such machines (i. e ASPs) is that they are special-purpose; as problems change or new ideas and techniques develop, their lack of flexibility makes them problematic as long-term solutions. Due to fixed ISA limitations Reconfigurable computing or Custom Computing Machines (CCMs) using FPGAs (Field Programmable Gate Arrays, first introduced in 1986 by Xilinx) or other reconfigurable (customizable) hardware can offer an attractive alternative to other computing element choices. FPGAs originally developed for: 1 - hardware design verification, 2 - rapid-prototyping, and 3 - potential ASIC-replacement EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

What is Reconfigurable Computing (RC)? • Utilize reconfigurable hardware devices: (spatially-programmed connections of hardware processing elements) tailored to application: • Customizing hardware to match computations needed/present in a particular application by changing hardware functionality on the fly (at runtime). • Reconfigurable Computing Goal: Goal Using reconfigurable hardware devices to build systems with advantages over conventional computing solutions in terms of: Advantages - Flexibility - Performance - Power - Time-to-market - Life cycle cost Of RC (vs. ASICS) Computational Efficiency (vs. processors) (vs. ASICS) “Hardware” customized to specifics of problem. Direct map of problem specific dataflow, control. Circuits “adapted” as problem requirements change. Hardware customization/reconfigurablity, how? Change both functionality of hardware cells (elements) and their spatial connectivity to match requirements of computation/application on the fly (at runtime). Still spatial computing but both functionality and connectivity of hardware elements are not fixed EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Conventional Programmable Processors Vs. Configurable devices Conventional Programmable Processors: • • Moderately wide datapath which have been growing larger over time (e. g. 16, 32, 64, 128 bits). Support for large on-chip instruction/data caches which have also been growing larger over time that can now hold thousands of instructions. High bandwidth instruction distribution so that several instructions may be issued per cycle at the cost of dedicating considerable die area for instruction fetch/distribution/issue/scheduling. A single thread of computation control per processor core. (SMT changes this) Configurable devices (such as FPGAs): • • Narrow datapath (e. g. almost always one bit), On-chip space for only one instruction per compute element -- i. e. the single instruction which tells the FPGA array cell (Configurable Logic Block, CLB) what function to perform and how to route its inputs and outputs (connectivity to other cells). Minimal die area dedicated to instruction distribution such that it takes hundreds or thousands of compute cycles to change the active set of array instructions (e. g From one FPGA configuration to another). Issue: Potentially long reconfiguration latency Can handle regular and bit-level computations more efficiently than processors. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Why Reconfigurable Computing? • To improve performance (including predictability) and computational energy efficiency over a software implementations (vs. processors: GPPs, ASPs). – e. g. signal processing applications in configurable hardware. • Provide powerful, application-specific operations in hardware (ASIC-like). • To improve product flexibility and lower development cost/time compared to hardware (vs. ASICs) – e. g. encryption, compression or network protocols handling in configurable hardware • To use the same hardware for different purposes at different points in time in the computation (lowers cost vs. ASIC). – Given sufficient use of each configuration to tolerate potentially long reconfiguration latency/overheads EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Benefits of Reconfigurable Logic Devices • Non-permanent customization and application development after fabrication Customization achieved by changing both function of – “Late Binding” hardware elements and their connectivity to match requirements of application • Economies of scale (amortize large, fixed design costs) Lower development time/cost than ASICS • Shorter time-to-market than ASICs (dealing with evolving requirements and standards, new ideas) Potential Disadvantages: • Efficiency penalty (area, performance, power) compared to ASICs. i. e Lower computational efficiency than ASICS • Need for correctness Verification. (common to all hardware-based solutions) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Spatial/Configurable Hardware Benefits/Drawbacks • Potentially, an order of magnitude (10 x) or higher raw computational density advantage over processors. • Potential for fine-grained (bit-level) control/parallelism --can offer another order of magnitude benefit. • Locality. Spatial/Configurable Drawbacks • Each compute/interconnect resource dedicated to single function. • Must dedicate resources for every computational subtask. • Infrequently needed portions of a computation sit idle --> inefficient use of resources vs. ASICs. Common to all hardware-based solutions (but much better than processors) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Configurable Computing Application Areas • • • • In general many types of applications with few computationally intensive “kernels” (inner-loops? ) that can done more efficiently in hardware Digital signal processing Encryption Image processing Telemetry Data processing (remote-sensing) Data/Image/Video compression/decompression Low-power (through hardware "sharing") Scientific/Engineering physical system modeling (e. g. finite-element computations). Network applications (e. g. reconfigurable routers) Variable precision arithmetic Logic-intensive applications In-the-field hardware enhancements Adaptive (learning) hardware elements Rapid system prototyping Original applications of FPGAs Verification of processor and ASIC designs …. . . EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Technology Trends Driving Configurable Computing • • Increasing gap between "peak" performance of general-purpose processors and "average actually achieved" performance. – Most programmers don't write code that gets anywhere near the peak performance of current superscalar CPUs Improvements in FPGA hardware: capacity and speed: – FPGAs use standard SRAM processes and "ride the commodity technology" curve (e. g. VLSI technology) – Volume pricing even though customized solution Improvements in synthesis and FPGA mapping/routing software Increasing number of transistors on a (processor) chip (one billion+): How to use them efficiently? – – – Bigger caches (Most popular)? Multiple processor cores? (Chip Multiprocessors - CMPs) SMT support? IRAM-style vector/memory? DSP cores or other application specific processors? Reconfigurable logic (FPGA or other reconfigurable logic)? A Combination of the above choices? Heterogeneous Computing System on a Chip? Micro-Heterogeneous Computing (MHC) ? EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Configurable Computing Architectures • • Configurable Computing architectures combine elements of general-purpose computing and application-specific integrated circuits (ASICs). – The general-purpose processor operates with fixed circuits that perform multiple tasks under the control of software. – An ASIC contains circuits specialized to a particular task and thus needs little or no software to instruct it. i. e configuration bit stream The configurable computer can execute software commands that alter its configurable devices (e. g FPGA circuits) as needed to perform a variety of jobs. i. e to change both functionality and connectivity of hardware elements (cells) on the fly (i. e at runtime) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Levels of the Reconfigurable Computational Elements e. g FPGAs (according to grain size of implemented components) Reconfigurable Logic Datapaths Arithmetic Control Bit-Level Operations e. g. encoding Finer Grain Dedicated data paths Arithmetic kernels Configurable Processors Real-Time Operating e. g. Filters, AGU e. g. Convolution Systems (RTOS): Process management Coarser Grain EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

• Hybrid-Architecture Computer Combines general-purpose processors (GPPs) and reconfigurable devices, commonly: – 2 – 1 • • FPGA chips (Fine-grain reconfigurable hardware) , or Micro-coded arrays of simple processors (Coarse-grain reconfigurable hardware). A controller FPGA may load circuit configurations stored in memory onto the processor FPGA in response to the requests of the operating program. If the memory does not contain a requested circuit, the processor FPGA sends a request to the PC host, which then loads the configuration for the desired circuit. Common Hybrid Configurable Architecture Today: – One or more FPGAs on board connected to host via I/O bus (e. g PCI-Express) Possible Future Hybrid Configurable Architecture: – Integrate a region of configurable hardware (FPGA or something else) onto processor chip itself as reconfigurable functional units or coprocessors – Integrate configurable hardware onto DRAM chip=> Flexible computing without memory bottleneck Current Hybrid-Architecture on a chip: Hybrid FPGAs: Integrate one or more hard-wired GPPs with an FPGA on the same chip Example: Xilinx Virtex-II Pro, Virtex-4/5 FX (FPGA with one or two Power. PC cores) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Reconfigurable Computer: Levels of Coupling Different levels of coupling in a hybrid reconfigurable system. Reconfigurable logic is shaded. Loose Coupling Tight Coupling 3 ISA Support 2 Reconfigurable functional units coprocessor (on chip) (on or off chip) 1 Future direction Function Calls Attached (e. g. via PCI) reconfigurable processing unit (Most common today) Lower communication time/overheads (higher bandwidth/lower latency 4 External standalone processing unit (e. g. via network/IO interface) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Sample Configurable Computing Application: Prototype Video Communications System • • • Uses a single FPGA to perform four functions that typically require separate chips. A memory chip stores the four circuit configurations and loads them sequentially into the FPGA. as needed Initially, the FPGA's circuits are configured to acquire digitized video data. The chip is then rapidly reconfigured to transform the video information into a compressed form and reconfigured again to prepare it for transmission. Finally, the FPGA circuits are reconfigured to modulate and transmit the video information. At the receiver, the four configurations are applied in reverse order to demodulate the data, uncompress the image and then send it to a digital-to-analog converter so it can be displayed on a television screen. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Early Configurable (or Custom) Computing Successes • DEC Programmable Active Memories, PAM (1992): – A universal hardware FPGA-based co-processor closely coupled to a standard host computer developed at DEC's Paris Research Laboratory. – Fast RSA decryption implementation on a reconfigurable machine (10 x faster than the fastest ASIC at the time) • Splash 2 (1993): More on Splash 2 in lecture handout – Attached Processor System using Xilinx FPGAs as processing elements developed at Center for Computing Sciences. – Performs DNA Sequence matching 300 x Cray 2 speed, and 200 x a 16 K Thinking Machines CM 2 speed. • Many modern processors and ASICs are verified using FPGA-based hardware emulation systems. • For many digital signal processing/filtering (e. g FIR, IIR) algorithms, single chip FPGAs outperform DSPs by 10100 x. Fixed-point not FP EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: Programmable Circuitry: FPGAs • • Field-Programmable Gate Array (FPGA) introduced by Xilinx (1986). Original target applications of FPGAs: 1 - hardware design verification, 2 - rapid-prototyping, and 3 - potential ASIC-replacement. Programmable circuits can be created or removed by sending signals to gates in the logic elements (configuration bit stream). To change both functionality and connectivity of logic blocks A built-in grid of circuits arranged in columns and rows allows the designer to connect a logic element to other logic elements or to an external memory or microprocessor. The logic elements are grouped in Configurable Logic Blocks (CLBs) that perform basic binary operations such as AND, OR and NOT Firms, including Xilinx and Altera, have developed devices with the capability of 4, 000 or more equivalent gates. Recently, in addition to “ general-purpose” or generic FPGAs, more specialized FPGA families targeting specific areas such as DSP applications have been developed with hard-wired functional units (e. g. hard-wired MAC units, processors …). i. e. Hybrid FPGAs EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: Field Programmable Gate Arrays (FPGAs) • • • Chip contains many small building blocks that can be configured to implement different functions. – These building blocks are known as CLBs (Configurable Logic Blocks) FPGAs typically "programmed" by having them read in a stream of configuration information from off-chip: Using configuration bit stream – Typically in-circuit programmable (as opposed to EPLDs -Electrically Programmable Logic Devices- which are typically programmed by removing them from the circuit and using a PROM programmer). 25% of an FPGA's gates are application-usable – The rest control the configurability, interconnects, etc. As much as 10 X clock rate degradation compared to fully custom hardware implementations (ASICs) Typically built using SRAM fabrication technology. Since FPGAs "act" like SRAM or logic, they lose their program (i. e configuration) when they lose power. – Thus configuration bits need to be reloaded on power-up. Usually reloaded from a PROM, or downloaded from memory via an I/O bus. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: FPGAs Look-Up Table (LUT) • K-LUT -- K input lookup table • Any function of K inputs by programming table In 00 01 10 11 Out 0 1 1 0 Out Mem Configuration 2 -LUT In 1 In 2 2 -LUT 4 -LUT EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: FPGAs Conventional FPGA Tile ~ 75% of FPGA area K-LUT (typical k=4) w/ optional output Flip-Flop 4 -LUT ~ 25% of FPGA area Or configurable Logic Block (CLB) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: FPGAs A Generic Island-style FPGA Routing Architecture One Tile 64 CLBs (8 x 8) CLB Customization (configurability) achieved by changing both functionality of hardware elements (CLBs here) and their spatial connectivity to match requirements of computation on the fly using configuration bit stream EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: FPGAs Xilinx XC 4000 Interconnect Customization (configurability) achieved by changing both functionality of hardware elements (CLBs here) and their spatial connectivity to match requirements of computation on the fly using configuration bit stream EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: FPGAs Xilinx XC 4000 Configurable Logic Block (CLB) Cascaded 4 LUTs (2 4 -LUTs -> 1 3 -LUT) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: FPGAs vs. RISC Processors Computational Density Comparison FPGAs 10 X RISC Processors EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Fine-grain Reconfigurable Hardware Devices: FPGAs Processor vs. FPGA Area FPGA Processor Cache? EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs Step 0? : Determine what portion of computation is migrated to hardware i. e. Hardware-Software Partitioning (co-design) • (1) Hardware Design Specification: A hardware design to realize the selected hardware-bound computationally-intensive portion of the application is specified using RTL/HDL/logic diagrams. Result of Hardware-Software Partitioning (co-design) • Synthesis & Layout: Vendor supplied device-specific software tools are used to convert the hardware design to netlist format. – (2) Partition the design into logic blocks (CLBs) : LUT Mapping – Then find a good (3) placement for each block and (4) routing between them • Then the serial configuration bitstream is generated (5) and fed down to the FPGAs themselves – The configuration bits are loaded into a "long shift register" on the FPGA. – The output lines from this shift register are control wires that control the behavior of all the CLBs on the chip. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs Step 0? : Determine what portion of computation is migrated to hardware (FPGA in this case) (1) Hardware Design RTL Tech. Indep. Optimization (5) configuration bitstream generation Specified in RTL/HDL/Logic diagrams … i. e. Synthesis & Layout (2) Partition the design CLBs LUT Mapping Bitstream Generation (3) Placement for each CLB Placement (4) Routing between CLBs Routing Config. Data EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Reconfigurable Processor Tools Flow (Hardware/Software Co-design Process Flow) Portion to be done in Reconfigurable hardware (e. g FPGA) Portion be done in software C Compiler ARC Object Code Customer Application / IP (C code) (2) Partitioning (3) Placement (4) Routing Linker Executable C Model Simulator (1) Hardware Design Specification C Debugger RTL HDL Synthesis & Layout Configuration Bits (5) configuration bitstream generation Development Board Hybrid System EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs Starting Point: (1) Hardware Design Specification RTL/HDL/logic diagrams • RTL – t=A+B – Reg(t, C, clk); • Logic – Oi=AiÅBiÅCi – Ci+1 = Ai. BiÚBi. CiÚAi. Ci EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs (2) Partition the design into logic blocks (CLBs) : LUT Mapping EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs (3) Placement of CLBs • Maximize locality – minimize number of wires in each channel – minimize length of wires – (but, cannot put everything close) • Often start by partitioning/clustering • State-of-the-art finish via simulated annealing EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs (3) Placement of CLBs Goal: Maximize Locality EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs (4) Routing Between CLBs • Often done in two passes: – Global to determine channel. – Detailed to determine actual wires and switches. • Difficulty is: – Limited available channels. – Switchbox connectivity restrictions. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Programming/Configuring FPGAs (4) Routing Between CLBs EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Overall Configurable Hardware Approach • • Select critical portions or phases of an application where hardware customizations will offer an advantage: e. g. computationally intensive portion “kernel(s)” of application. Hardware-Software Partitioning Map those application phases to FPGA hardware: – Hand hardware design/RTL/VHDL – VHDL => synthesis & layout If it doesn't fit in FPGA, re-select application phase (smaller) and try again. Perform timing analysis to determine rate at which configurable design can be clocked. Write interface software for communication between main processor (GPP) and configurable hardware: – Determine where input / output data communicated between software and configurable hardware will be stored – Write code to manage its transfer (like a procedure call interface in standard software) – Write code to invoke configurable hardware (e. g. memory-mapped I/O) Compile software (including interface code) Send configuration bits to the configurable hardware Run program. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Configurable Hardware Application Challenges • This process turns applications programmers into: e. g. System. C ? – Part-time hardware designers. Highly Desirable to ease transition: Silicon Compliers C Si ? • Performance analysis problems => what should we put in Partitioning hardware? – Hardware-software co-design problem • Choice and granularity of computational elements. • Choice and granularity of interconnect network/degree of coupling. • Long reconfiguration latency. • Synthesis problems. • Testing/reliability problems. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Issues in Using FPGAs for Reconfigurable Computing • Hardware-Software Partitioning (co-design) • Run-time reconfiguration latency/overhead – Time to load configuration bitstream – may take seconds (improving) • • • Reconfiguration latency hiding techniques. e. g Hybrid-FPGAs I/O bandwidth limitations: Need for tight coupling. Speed, power, cost, density (improving) With software (processors) High-level language support (improving) C Si Performance, space estimators Design verification Partitioning and mapping across several FPGAs Partial reconfiguration Supported in some/most recent Configuration caching. To reduce reconfiguration latency high-end FPGAs EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Example Reconfigurable Computing Research Efforts • • PRISM (Brown) PRISC (Harvard) RC-1 DPGA-coupled u. P (MIT) GARP (RC-3), Pleiades, … (UCB) One. Chip (Toronto) RC-2 RAW (MIT) RC-4 REMARC (Stanford) RC-5 CHIMAERA RC-6 (Northwestern) • • DEC PAM Splash 2 NAPA (NSC) E 5 etc. (Triscend) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models 1. Unaffected by array logic: Interfacing – Triscend E 5 2. Dedicated IO Processor. – NAPA 1000 May be considered as ASICreplacement efforts and not RC? 3. Instruction Augmentation: (Tight Coupling) – Special Instructions / Coprocessor Ops - PRISM (Brown, 1991) - Chimaera (Northwestern, 1997) - Virtex-4 FX (Xilinx) Usually FPGA-based - PRISC (Harvard, 1994) - GARP (Berkeley, 1997) – VLIW/microcoded arrays extension to processor Usually arrays of - REMARC (Stanford, 1998) - Raw (MIT, 1997) Simple processors - Morpho. Sys (UC Irvine, 2000) - MATRIX (MIT, 1997) - Ra. Pi. D (Reconfigurable Pipelined Datapaths) (University of Washington, 1996) - Pipe. Rench (Carnegie Mellon, 1999) - DAPDNA-2 (IPFlex Inc. , 2004? ) See DAPDNA Handout ……… 4. Autonomous co/stream processor – -- One. Chip (Toronto , 1998) RC Array runs a separate thread EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models: Interfacing • Configurable logic used in place of: – ASIC environment customization – External FPGA/PLD devices • Example – bus protocols – peripherals – sensors, actuators • Case for: – Always have some system adaptation to do – Modern chips have capacity to hold processor + glue logic – reduce part count – Glue logic vary – valued added must now be accommodated on chip (formerly board level) May be considered as ASICreplacement effort and not RC EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Example: Interface/Peripherals Example: Triscend E 5 May be considered as ASIC-replacement effort and not RC EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models: IO Processor • Configurable array dedicated to • Case for: servicing IO channel(s): – Many protocols, services – sensor, lan, wan, peripheral. . supported. • Provides: – Only need few at a time. – Flexible protocol handling. – Dedicate attention, offload work to IO processor. Why? – Flexible stream computation • compression, encrypt (inplace by RC hardware) • Looks like IO peripheral to processor. EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Reconfigurable IO Processor Example: NAPA 1000 Block Diagram TBT Toggle. Bus. TM Transceiver System Port External Memory Interface CR 32 RPC Compact. RISCTM 32 Bit Processor Reconfigurable Pipeline Cntr BIU PMA Bus Interface Unit Pipeline Memory Array CR 32 Peripheral Devices SMA ALP Adaptive Logic Processor CIO Configurable I/O Scratchpad Memory Array EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Reconfigurable IO Processor Example: NAPA 1000 as IO Processor SYSTEM HOST Application Specific System Port NAPA 1000 CIO Sensors, Actuators, or other circuits Memory Interface ROM & DRAM EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models: Instruction Augmentation Computation/ISA Semantic Gap • Observation: Instruction Bandwidth – Processor can only describe a small number of basic computations in a cycle i. e per instruction • I bits 2 I operations i. e instruction expression limit – This is a small fraction of the operations one could do even in terms of w w w Ops • w 22(2 w) operations – Processor could have to issue w 2(2 (2 w) -I) operations (instructions) just to describe some computations – An a priori selected base set of functions (via ISA i. e Fixed ISA instructions) could be very bad for some applications i. e. ISA/Application Mismatch • Motivation for application-specific processors/ISAs ASPs I = opcode size W = operand word size EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models: Instruction Augmentation • Idea: – Provide a way to augment the processor’s instruction set (Base ISA) with operations needed by a particular application and realized by RC hardware. – Close semantic gap / avoid mismatch between fixed ISA and shown in last slide application computational operations needed. • What’s required: – Some way to fit augmented instructions into stream – Execution engine for augmented instructions: decode/execute • If programmable, has own instructions • FPGA or array of simple micro-coded processors – Interconnect to augmented instructions. Configurable hardware array to realize and execute new augmented instructions EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Instruction Augmentation First Effort In Instruction Augmentation: PRISM (Brown, 1991) • Processor Reconfiguration through Instruction Set Metamorphosis (PRISM) • FPGA on bus (similar to Splash 2) • Access as memory mapped peripheral • Explicit context management • PRISM-1 – 68010 (10 MHz) + XC 3090 – can reconfigure FPGA in one second – 50 -75 clocks for operations EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

PRISM-1 Results Raw kernel speedups (IO configuration time not included? ) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Instruction Augmentation PRISC (Harvard, 1994) PRISC = PRogrammable Instruction Set Computers • Takes next step: – What if we put configurable array on chip? Tight Coupling – How to integrate into processor ISA? Here base ISA selected is MIPS • Architecture: – RC array tightly coupled into processor register file as “superscalar” Programmable Functional Unit (PFU). FPGA-like configurable array – Flow-through array (no state, combinational logic PFU) Single cycle execution PFU = Programmable Functional Unit (paper RC-1) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

PRISC ISA Integration Here base ISA selected is MIPS Configuration Miss Requested Logical PFU Function Number – Add expfu instruction (execute programmable functional unit) to MIPS ISA – 11 bit address space for user defined expfu instructions – Fault on pfu instruction mismatch i. e needed array configuration • trap code to service instruction miss not loaded in PFU – All operations occur in one clock cycle no state, combinational logic PFU – Easily works with processor context switch • No state in PFU+ fault on mismatch pfu instruction i. e combinational logic PFU (paper RC-1) PFU = Programmable Functional Unit EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

PRISC Results • All compiled • working from MIPS binary • <200 4 LUTs ? – 64 x 3 • 200 MHz MIPS base SPEC 92 May not be a good target app. ? (paper RC-1) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Instruction Augmentation Chimaera (Northwestern, 1997) • Start from PRISC idea – – Integrate as functional unit No state Similar to PRISC: single-cycle, combinational logic configurable array RFUOPs (like expfu): Reconfigurable FU Operation Stall processor on instruction miss, reload • Adds: i. e needed array configuration not loaded – Manage multiple instructions loaded – More than 2 inputs possible i. e. configuration caching i. e > 2 operand registers (paper RC-6) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Chimaera Architecture • “Live” copy of register file values feed into array • Each row of array may compute from register values or intermediates (other rows) • Tag on array to indicate RFUOP Major difference from PRISC: More than two register inputs possible FPGA-Like Array (paper RC-6) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Chimaera Architecture • Array can compute on values as soon as placed in register file • Logic is combinational No state in array • When RFUOP matches – stall until result ready • critical path – only from late inputs – Drive result from matching row (paper RC-6) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Instruction Augmentation GARP (Berkeley, 1997) GARP = Global Array Reconfiguable Processor in one chip • Integrates configurable array (FPGA-like array) as coprocessor: Not FU – Similar bandwidth/coupling to processor as FU – Array has its own access to memory • Support multi-cycle operation: – Allow state – Cycle counter to track operation • Fast operation selection: – Cache for multiple configurations – Dense encodings, wide path to memory FPGA-Like Array MIPS is also the base ISA selected for GARP (paper RC-3) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Base ISA GARP • Augmented MIPS ISA -- coprocessor operations – Issue gaconfig to make a particular configuration resident (may be active or cached). – Explicitly move data to/from array • 2 writes, 1 read (like FU, but not 2 W+1 R) – Processor suspend during co-processor operation: • Cycle count tracks operation State/multi-cycle operation allowed in GARP – Array may directly access memory: • Processor and array share memory space – cache/mmu keeps consistent between processor and array • Can exploit streaming and data parallel operations. (paper RC-3) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

GARP Processor Instructions Pass operands to array Load a configuration Get results from array Augmented to MIPS ISA (paper RC-3) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

GARP Array • Row oriented logic – Denser for datapath operations • Dedicated path for – Processor/memory data • Processor does not have to be involved in array memory path. i. e DMA used by array FPGA-Like Array (paper RC-3) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

GARP Results FPGA-Like Array • General results – 10 -20 x on stream, feedforward operation – 2 -3 x when datadependencies limit pipelining (paper RC-3) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

PRISC/Chimera vs. GARP • PRISC/Chimaera – Basic op is single cycle: expfu (rfuop) – No state – Could conceivably have multiple PFUs? – Discover parallelism => run in parallel? – Can’t run deep pipelines – Configurable array has no direct access to memory • GARP – Basic op is multicycle • gaconfig • mtga • mfga – Can have state/deep pipelining – Multiple arrays viable? – Identify mtga/mfga w/ corr gaconfig? – Configurable array has access to memory EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Common Instruction Augmentation Features • To get around instruction expression limits (and the resulting computation/ISA semantic gap): – Define new instruction in configurable array (FPGA usually). • Many bits of config … broad expressability • Many parallel operations possible. Close semantic gap – Give array configuration short “name” which processor can callout (augmented instructions) • …effectively the address of the operation EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Coarse Grain RC Hybrid-Architecture RC Compute Models: VLIW/microcoded Model • Similar to instruction augmentation model but…. • Usually utilize micro-coded arrays of simple processors (Coarse-grain reconfigurable hardware, not FPGAs). • Single tag (address, instruction) – controls a number of more basic operations • Some difference in expectation – can sequence a number of different tags/operations together Examples: -REMARC (Stanford, 1998) - RAW (MIT, 1997) - Morpho. Sys (UC Irvine, 2000) - MATRIX (MIT, 1997 - Ra. Pi. D (Reconfigurable Pipelined Datapaths) (University of Washington, 1996) -Pipe. Rench (Carnegie Mellon, 1999) - DAPDNA-2 (IPFlex Inc. , 2004? ) ……… See DAPDNA-2 Handout EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

VLIW/microcoded Model REMARC (Stanford, 1998) • Array of “nano-processors” – 16 b, 32 instructions each – VLIW like execution, global sequencer Global Control Unit • Coprocessor interface (similar to GARP) But – No direct array memory (paper RC-5) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

REMARC Architecture • Issue coprocessor rex – Global controller sequences nano-processors – Multiple cycles (microcode) • Each nano-processor has own I-store (VLIW) Micro-code Here array has 8 x 8 = 64 nano-processors (micro-coded) RAM Nano-processor (paper RC-5) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

REMARC Results MPEG 2 DES (paper RC-5) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models: Observation • All single threaded – Limited to parallelism at: • Instruction level (VLIW, bit-level) • Data level (vector/stream/SIMD) – No task/thread level parallelism (TLP) • Except for IO dedicated task parallel with processor task EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models: Autonomous Coroutine • Array task is decoupled from processor Separate thread – Fork operation / join upon completion • Array has own – Internal state – Access to shared state (memory) • NAPA supports this to some extent – Task level, at least, with multiple devices Example: - One. Chip (Toronto , 1998) (paper RC-2) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Hybrid-Architecture RC Compute Models: Autonomous Coroutine One. Chip (Toronto , 1998) • Want array to have direct memory operations • Want to fit into programming model/ISA – without forcing exclusive processor/FPGA operation – allowing decoupled processor/array execution • Key Idea: – FPGA operates on memory regions – Make regions explicit to processor issue – scoreboard memory blocks To Check for dependency violations (paper RC-2) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

One. Chip Pipeline Processor FPGA (paper RC-2) Has its own configured simple processor “limited ISA” EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

One. Chip Coherency RAW WAR WAW (paper RC-2) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

One. Chip Instructions • Basic Operation is: Memory to memory – FPGA MEM[Rsource] MEM[Rdst] • block sizes powers of 2 Source Memory Block Base address Destination Memory Block Base address • Supports 14 “loaded” functions – DPGA/contexts so 4 can be cached (paper RC-2) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

One. Chip • • Basic op is: FPGA MEM No state between these ops coherence is that ops appear sequential could have multiple/parallel FPGA Compute units – scoreboard with processor and each other • Can’t chain FPGA operations? (paper RC-2) EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012

Summary • Several different models and uses for a “Reconfigurable Processor”: – On computational kernels • seen the benefits of coarse-grain interaction – GARP, REMARC, One. Chip – Missing: • More full application (multi-application) benefits of these architectures. . . • Exploit density and expressiveness of fine-grained, spatial operations • Number of ways to integrate cleanly into processor architecture…and their limitations EECC 722 - Shaaban # lec # 9 Fall 2012 10 -15 -2012