Reconfigurable Parallel Stream Processor with selfassembling and self

Re-configurable Parallel Stream Processor with self-assembling and self -restorable micro-architecture Lev Kirischian, Irina Terterian, Pil Woo Chun and Vadim Geurkov Embedded and Re-configurable Systems Lab RYERSON University, CANADA

Example of Multi-task Data-Flow workload where each task can run in different modes Task 4: Mode 1 Task 3 Task 2: Mode 1 Task 1: Mode 1 Mode 2 Mode 3 Mode 4 Mode 7 Task 2: Mode 2 Mode 3 Time

Usual Approach: Conventional Processors with Software-to-Task Optimization (Compilers +OS) Software-to-task optimization allows using conventional computing platforms with fixed architecture (Superscalar, VLIW, etc. ) coupled with software compilers and OS. Limitations of the conventional processors 1. If tasks are executed on sequential computing system – processing time often cannot fit specification requirements 2. If tasks are executed on parallel computing system with fixed architecture – cost-effectiveness of these parallel computers strongly depend on the tasks algorithm or data structure

Alternative Approach: Application Specific Processors (ASP) with Static Hardware-to-Task Optimization ASP allows reaching required cost-performance parameters because ASP-architecture is optimized on data-flow graph of the task and task data structure Limitations for the Application Specific Processors 1. Decrease of performance if task algorithm or data structure changes 2. Limited possibility for further modernization 3. High cost for multi-task or multi-mode custom computing systems

Proposed Approach: Reconfigurable Processor with Dynamic Architecture-to-Task Optimization High-performance computing system for multi-task dataflow applications should contain two major components: 1. Dynamically Re-configurable Computing Platform based on partially-configurable FPGA devices to provide maximum possible hardware flexibility. 2. Library of Application Specific Virtual Processors (ASVP) – configuration bit-streams to program On-Chip Application Specific Processor’s circuitry for the period of time while Application (Task) is active.

Architecture of Partially Reconfigurable FPGA devices (Xilinx “Virtex” Family) Configuration Data Files Internal Configuration SRAM In Out I / O CLBs Block CLBs I / O Frame RAM Frame #1 #i #N Internal (Virtual BUS) CLB - Configurable Logic Block - Uniform Logic Element of a Frame, smallest individually configurable component in the FPGA

Concept of Application Specific Virtual Processor (ASVP) – a group of logic resources dedicated and optimally configured to reflect the algorithm and data structure of the task. ASVP is presented in a form of configuration data file (configuration bit-stream) to be downloaded into the FPGA when task should be activated

Life-cycle of Application Specific Virtual Processor 1. ASVP-core downloads to the Reconfigurable platform before task activation 2. ASVP performs the task data processing as long as it is necessary without interruption or time sharing of dedicated logic resources with any other task 3. After task completion all resources included in the ASVP can be re-configured for any other task.

ASVP Architecture-to-Task Optimization in Partially Reconfigurable FPGA Data-Flow Graph Data In XOR + Data Out FPGA Slots: 1 Virtual Hardware Component XOR FPGA X O R 2 3. . . + Input Output Internal (Virtual) BUS

Micro-architecture of a Virtual Hardware Component

Virtual Hardware Component & Virtual Bus Interconnection Virtual Bus Virtual Hardware Component Boundary

Micro-architecture of Application Specific Virtual Processor (ASVP) Micro-architecture of ASVP is based on Virtual Hardware Components interconnected via Virtual Bus lines

Parallel Task Processing on the Dynamically Reconfigurable Stream Processor (DRSP) Data out #2 Data out #3 Data in #2 ASVP 1 for Task 1 ASVP 2 ASVP 3 Data out #1 Data in #1 I/O 2 I/O 3 I/O 4 FU 1 FU 2 FU 3 FU 4 RIM 1 RIM 2 RIM 3 RIM 4 Virtual Bus

DRSP: System Level Architecture Host PC Data Stream Source Task Memory Task 1: {Afix+Amodes} …………………. Task h: {Afix+Amodes} P C I Bus PCI-Interface Module PRCP-base Cache Memory {Amodes i} Configuration & Data Bus Reconfigurable Functional Unit Afix i + … RT-HOS Data Out

Architecture of Reconfigurable Computing Module SPI PCI Inter face 800 Mbit/S SPI 2 x 3. 43 Gbit / S (12 bit*300 MHz) Input LVDS ports Real-Time Hardware Operating System Based on XCV 50 E Vertex FPGA Config. Files / Data Cache (4 x 512 KB) Reconfig. Functional Unit [ RFM 0111 -002] 8. 12 Gbit /S LVTTL BUS (64 bit x 133 MHz) 2 x 3. 43 Gbit / S (12 bit*300 MHz) Output LVDS Ports

Reconfigurable Computing Module based on Xilinx “Virtex-E family of FPGA Devices

Restoration of ASVP using spare CLB-column Column # 1 AP i X O R 2 + 3. . . + Input Output Communication Field If hardware fault occurs the damaged Virtual Hardware Component can be relocated to the reserved CLB-column.

When the proposed technology is most beneficial? • Workload consists of many tasks, where each task can run in different modes. • Each task requires high-speed data-stream processing • Task algorithms may be modified within life cycle of a system • Active tasks must run in parallel and should not be interrupted in any case when one of the tasks switches its mode or terminates. • System can be remotely or self-restored even if some hardware fault occurs

DRSP Application for Networked Intelligent Manufacturing Systems High performance parallel data-stream processing (up to thousands of billions operations / sec. ) of big volume of data (up to hundreds of Giga bits) for: a) Complex image processing and image recognition, b) Spectrum analysis and digital signal processing, c) Data transmission via LAN with data compression / decompression and encryption / decryption, d) Control of high performance manufacturing equipment and robotic systems.

Acceleration of Task / Mode Switching Acceleration 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Number of CLB-slots in Virtual Component Acceleration of task or mode switching comparing with Entire FPGA-based system increases when number of CLB-columns in ASVP is minimal and can be over that 20 times faster

Minimization of Hardware Resources Minimization of Logic resources in DRSP approach Comparing with entire FPGA-based systems: Modes 2 4 8 16 4 2. 8 4. 4 7. 6 14 8 5. 6 8. 8 15. 2 28 16 11. 2 17. 6 30. 4 56 Tasks When number of tasks and task modes increases in a workload, respectively increases the cost-effectiveness of DRSP

SUMMARY: RDSP Comparing with Conventional CPU, DSP or ASP Platforms DRSP Conv. CPU DSP ASP Much lower than DRSP Lower than DRSP Somewhat higher Flexibility Lower than DRSP Much lower than DRSP None, or very little Reliability Much lower than DRSP Lower than DRSP Performance

Thank you