NASAESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS AHS

  • Slides: 17
Download presentation
NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS (AHS 2013) A 2 B: AN INTEGRATED

NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS (AHS 2013) A 2 B: AN INTEGRATED FRAMEWORK FOR DESIGNING HETEROGENEOUS AND RECONFIGURABLE SYSTEMS C. Pilato , R. Cattaneo, G. Durelli, A. A. Nacci, M. D. Santambrogio, D. Sciuto Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria, Italy Torino, Italy – June 27 th, 2013

Motivations 2 q The design of reconfigurable systems is a difficult task Interactions between

Motivations 2 q The design of reconfigurable systems is a difficult task Interactions between the different phases have to be taken into account q Decisions in the frontend phase may highly affect the backend implementation: iterative exploration E. g. : Mapping onto reconfigurable regions and floorplacing of the tasks may generate low-quality solutions due to a wrong partitioning or assignment of implementations q Currently, the optimal design methodology (and the number of its iterations) is not known in advance A 2 B is an ongoing project at Politecnico di Milano to assist the design of such complex systems Torino, Italy – June 27 th, 2013

Agenda 3 q Framework Overview Design Space Exploration Solution Generation q Preliminary Results –

Agenda 3 q Framework Overview Design Space Exploration Solution Generation q Preliminary Results – Test Case q Conclusions and Future Work Torino, Italy – June 27 th, 2013

Framework Overview 4 q Inputs: Exploration Information about the target device (. XML) Application

Framework Overview 4 q Inputs: Exploration Information about the target device (. XML) Application source files (. C) plus custom pragma for additional information • (e. g. , task level parallelism/kernels) q Decision Making (Exploration): Task graph generation Library generation Mapping, Scheduling, Floorplacing Architectural modification q Refinement (Evaluation): Specification of the platform details Code generation for target platform q Output: Evaluation Torino, Italy – June 27 th, 2013 Project files ready for the synthesis with back-end tools

XML Exchange Format 5 q The entire project can be represented through an XML

XML Exchange Format 5 q The entire project can be represented through an XML file Architecture: components’ characteristics (e. g. , reconfigurable regions), … Applications: source code files and profiling information Library: task implementations with the characterization (time, resources, . . . ) Partitions: task graph, mapping and scheduling, … q It allows a modular organization of the framework, but also the sharing of information among the different phases The phases can be applied in any order to progressively optimize the design The designer can perform as many iterations as he/she wants to refine the solution q Specific details of the target architecture are taken into account only in the refinement phase (interactions with backend tools) Torino, Italy – June 27 th, 2013

Task Graph Generation 6 q Application source code files can be analyzed to extract

Task Graph Generation 6 q Application source code files can be analyzed to extract the task graphs Profiling information can drive the generation of such solutions q Task graph will be then specified in the XML file as processing nodes connected by data transfers Currently they can be designed by hand, but automated methodologies for automatic extraction will be investigated in the future Transformations to improve the description by splitting/merging the tasks #pragma omp task void threshold(unsigned char *o 1, unsigned char *r, unsigned char t, int * p){ nt DIMH = p[0]; int min. H 1 = p[1]; int max. H 1 = p[2]; int min. V 1 = p[3]; int max. V 1 = p[4]; for(v=min. V 1; v<max. V 1; v++) for(h=min. H 1; h<max. H 1; h++){ If(original 1[v*DIMH+h]>thresh){ result[v*DIMH*BPP+h*BPP]=255; result[v*DIMH*BPP+h*BPP+1]=255; result[v*DIMH*BPP+h*BPP+2]=255; } else{ result[v*DIMH*BPP+h*BPP]=0; result[v*DIMH*BPP+h*BPP+1]=0; result[v*DIMH*BPP+h*BPP+2]=0; } } } Torino, Italy – June 27 th, 2013

Library Generation: a collection of implementations 7 q LLVM-based compiler to extract the dataflow

Library Generation: a collection of implementations 7 q LLVM-based compiler to extract the dataflow graph of each task Estimation of required resources (including bit-width analysis) [IMP] Interaction with HLS synthesis tools to obtain more accurate results q Generated implementations are then store in the XML file to offer opportunities to the mapping phase and information to the floorplacer Politecnico di Milano/Imperial College of London joint effort to integrate High Level Analysis techniques into the toolchain A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing – R. Cattaneo, X. Niu, C. Pilato, T. Becker, W. Luk, M. D. Santambrogio (to appear in Re. Co. So. C 13) Torino, Italy – June 27 th, 2013

Mapping, Scheduling and Floorplacing 8 q We generate one or more configurations where each

Mapping, Scheduling and Floorplacing 8 q We generate one or more configurations where each task of the applications is analyzed and assigned (via Mapping, Scheduling and Floorplanning – M/S/FP) to An available and admissible implementation A component of the architecture (e. g. , processor or reconfigurable region) q This allows to “share” implementations across different tasks (hardware sharing) move a task implementation to another processing element at run-time (task relocation) Torino, Italy – June 27 th, 2013

Architecture Exploration 9 q An additional step can be included to explore the target

Architecture Exploration 9 q An additional step can be included to explore the target architecture Adding/removing processing elements (reconfigurable regions) Modifying their parameters Determining the proper interconnection topology q It can iteratively affect: task graph transformations and library generation mapping and floorplacing: modification to the computational resources (especially the number of reconfigurable regions) q It allows a progressive and iterative refinement of the solution and a concurrent customization of both architecture and application E. g. : mapping and floorplacing can suggest which resources should be added Torino, Italy – June 27 th, 2013

Supported Platforms q Virtex-5 XC 5 VLX 110 T (embedded) Two XCF 32 P

Supported Platforms q Virtex-5 XC 5 VLX 110 T (embedded) Two XCF 32 P Platform Flash PROMs (32 Mbyte each) System. ACE™ Compact Flash configuration controller 64 -bit wide 256 Mbyte DDR 2 small outline DIMM (SODIMM) q Maxeler Max. Workstation (HPC system) Intel i 7 2600 s@2. 8 GHz, 16 GB RAM, 500 GB HDD Max 3 dataflow engine (DFE) Virtex 6 SX 475 T FPGA, 24 GB memory DFE connected to CPU via PCI Express Torino, Italy – June 27 th, 2013 10 XUPV 5 CPU 0 CPU 1 MAX 3 DFE Interface FPGA CPU CPU DDR 2 (256 MB) Reconf. Area DRAM (24 GB) Compute FPGA DRAM (16 GB)

Target-Dependent Code Generation . c - Source code for CPU 11 . xml FPGA-based

Target-Dependent Code Generation . c - Source code for CPU 11 . xml FPGA-based embedded system CPU Compiler - DFGs for HW tasks - Mapping configurations Max. Workstation DFG-Max. J DFG-C HLS (C-VHDL) Manual VHDL Implementations Manual Max. J Implementations HLS (Max. J-VHDL) Max. IDE exec bin The code can be always further optimized by hand; e. g. , glue code for data transfers Bitstream Generation bit Torino, Italy – June 27 th, 2013

Graphical User Interface (GUI) 12 q Practical GUI to support the designer, to limit

Graphical User Interface (GUI) 12 q Practical GUI to support the designer, to limit the errors in the interactions with the XML and to allow custom design methodologies Torino, Italy – June 27 th, 2013

Preliminary results: edge detection 13 q Edge detection application: 4 stages of computation C

Preliminary results: edge detection 13 q Edge detection application: 4 stages of computation C + custom #pragmas based description Extracted taskgraph and corresponding DFG of first stage (Scale, 1 x parallelism) q We generate 4 implementations with different levels of parallelism and resource consumption for each of the 4 tasks of the application “parallelism X”: X pixels processed at once q Maxeler Backend Torino, Italy – June 27 th, 2013

Experimental Results / 1 14 q Static vs reconfigurable design (both extracted using the

Experimental Results / 1 14 q Static vs reconfigurable design (both extracted using the framework) q We limit the available area to 10 klut and implement the most performing design q Static (parallelism 4) IP 0: S IP 1: B IP 2: E q Reconfigurable (parallelism 8) IP 3: T R 0: S, T R 1: B, E Task Name Area Occupation S 664 B 64 E 7680 T 7376 Region Name Final Area Occupation Total area consumption 332+32+3840+3688= 7876 R 0 max(664, 64)=664 R 1 max(7680, 7376)=7680 Total area consumption 7376+64=8344 Torino, Italy – June 27 th, 2013

Experiment Results / 2 15 q Reconfiguration time is automatically masked (when possible) q

Experiment Results / 2 15 q Reconfiguration time is automatically masked (when possible) q Partial Reconfiguration improves performance of application via resource multiplexing Torino, Italy – June 27 th, 2013

Conclusions and Future Work 16 q A 2 B is a modular framework to

Conclusions and Future Work 16 q A 2 B is a modular framework to design reconfigurable systems Easy to plug alternative methods for each of the phase Possibility to perform progressive refinement of both application and architecture q A 2 B is becoming part of a larger project (ASAP – Advanced Synthesis of Applications and Platforms) Refinement will also include the generation of System. C TLM models of the target system for (co-)simulation and early validation More architectural templates Closer interaction with actual synthesis (e. g. , high-level synthesis) Automated methodologies to accelerate the design Torino, Italy – June 27 th, 2013

Thank you! Riccardo Cattaneo rcattaneo@elet. polimi. it Research partially funded by the European Community’s

Thank you! Riccardo Cattaneo rcattaneo@elet. polimi. it Research partially funded by the European Community’s Seventh Framework Programme, FASTER project. Torino, Italy – June 27 th, 2013