Zynq Development Flow to Accelerate C Code Copyright

  • Slides: 28
Download presentation
Zynq Development Flow to Accelerate C Code © Copyright 2013 Xilinx.

Zynq Development Flow to Accelerate C Code © Copyright 2013 Xilinx.

All Programmable SOC Approach Requirements SW Spec Iterate HW Spec Iterate Verify Accelerators RTL

All Programmable SOC Approach Requirements SW Spec Iterate HW Spec Iterate Verify Accelerators RTL . © Copyright 2013 Xilinx.

Hardware Acceleration Requirements SW Spec Iterate HW Spec Iterate Verify Accelerators RTLAccelerators RTL Vivado

Hardware Acceleration Requirements SW Spec Iterate HW Spec Iterate Verify Accelerators RTLAccelerators RTL Vivado HLS Synthesizes the C/C++ into RTL Blocks. © Copyright 2013 Xilinx.

Hardware Acceleration: The Reason Why Processing System (PS) 1 -2 Gops – Two ARM®

Hardware Acceleration: The Reason Why Processing System (PS) 1 -2 Gops – Two ARM® Cortex™-A 9 with NEON™ extensions – Floating Point support – Up to 1 GHz operation – L 2 Cache – 512 KB Unified – On-Chip Memory of 256 KB – Integrated Memory Controllers – Run full Linux 10 -1000+ Gops Programmable Logic (PL) – 28 K-444 K logic cells – High bandwidth AMBA interconnect – ACP port - cache coherency for additional soft processors Programmable Logic Provides Superior Performance. © Copyright 2013 Xilinx.

Example: Medical Application HW Solution using 52 floating point operations at 166 MHz Back

Example: Medical Application HW Solution using 52 floating point operations at 166 MHz Back Projection algorithm: used in tomography applications, including CAT scanners Seconds/Frame 0. 8 DMAs connected to ACP port for short communication times Processes direct resample 2 b resample 2 a resample 1 0. 6 0. 4 0. 2 Accelerator built with Vivado HLS 0 Cortex A 9 7 x Speedup. © Copyright 2013 Xilinx. Cortex A 9 + Accelerators

Hardware Accelerators C Functions Implemented in Hardware – Custom not off-the-shelf There is no

Hardware Accelerators C Functions Implemented in Hardware – Custom not off-the-shelf There is no Library of Hardware Accelerators – Each function in the user C code is typically unique – Each Accelerator is therefore typically unique – Common Functions would already be Xilinx IP Vivado HLS C Libraries – C Functions Provided with Vivado HLS • Guaranteed to Synthesize in to good Hardware (Qo. R & Performance) – Available for some commonly used functions • Math functions, Video functions, Xilinx IP functions – Accelerators are typically required at a level of granularity above this. © Copyright 2013 Xilinx.

Steps to Create An Accelerator 1. Assess The Original Software – How to select

Steps to Create An Accelerator 1. Assess The Original Software – How to select a C function? 2. Processing System Select The Interconnect – How to get data between PS & PL? 3. Common Peripherals Create an Accelerator – How to create high-performance RTL to from? 4. Integrate the Accelerator 1 5 ARM® Dual Cortex-A 9 MPCore™ System 2 Common Peripherals Custom Peripherals 4 3 Common Accelerators Custom Accelerators – How to connect the CPU and Accelerator block? 5. 7 Series Programmable Logic Memory Interfaces Let’s go through these steps …. . Update the Original Software – How to make code modifications? . © Copyright 2013 Xilinx.

Identify A Function To Accelerate General Guidelines – Some functions are obvious examples for

Identify A Function To Accelerate General Guidelines – Some functions are obvious examples for investigation – Some are not good candidates and give little improvement Profiling – Quantitative Identification • Analyze the Execution – Software Analysis • Analysis of the code execution – Hardware Profiling • Monitoring of the hardware buses . © Copyright 2013 Xilinx.

Software Profiling Measure Where CPU time is spent – A number of free tools

Software Profiling Measure Where CPU time is spent – A number of free tools are available – Users may have their own in-house profiler – Recommendation: gprof • Go to Google if you have questions: well supported C/C++ Profilers Parasoft Insure++ AQTime Pro (for MSV) Code. Tune Free Proffy Profiny Embedded Profiler gprof Free Valgrind Free Ig. Prof Free Very Sleepy Free LTProf Visual Studio Team System Profiler . © Copyright 2013 Xilinx.

Hardware Profiling PG 037 AXI Performance Monitor – Can be used to understand the

Hardware Profiling PG 037 AXI Performance Monitor – Can be used to understand the operation of the hardware – Initially the Accelerator does not exist • More likely to highlight input and outputs are not meeting performance • OR internal connections are not operating as fast as they could – Allows a System Metric to be Monitored Processing System Common Peripherals Data Out Data In Memory Interfaces ARM® Dual Cortex-A 9 MPCore™ System Common Accelerators Custom Accelerators . © Copyright 2013 Xilinx. 7 Series Programmable Logic Common Peripherals Custom Peripherals AXI Monitor To Be Added

Good Video Examples To Accelerate Pixel Level Processing Functions – Any video application performing

Good Video Examples To Accelerate Pixel Level Processing Functions – Any video application performing pixel level processing • Find edges, Find motion, etc – HD video is a Plus • More Data Requires more Performance – Open. CV Functions • Open. CV. org provides 2500 software algorithms for Computer Vision • The most popular functions have drop-in replacements with Vivado HLS video library functions Click Here • Discussed in Detail Later … Frame level processing can be done on the CPU • Read Image, Scale Image, etc. © Copyright 2013 Xilinx. Click Here

Steps to Accelerate the C Code Partition out the C code to a top-level

Steps to Accelerate the C Code Partition out the C code to a top-level function – There must be a single-top level for synthesis Add AXI Interfaces to the Design – Select which Interfaces to use – Select which ports to access via AXI 4 -Lite Add Optimization Directives – The primary optimization is typically pipelining • Many other optimizations are possible – HLS C libraries can be used to optimize Synthesize the design & Package the IP . © Copyright 2013 Xilinx.

Optimize the C code for Synthesis General case – The code source functions now

Optimize the C code for Synthesis General case – The code source functions now in hls. cpp should be made synthesizable – Process is: • Read the Coding Style Guide • Synthesize & modify what cannot be synthesized • Review the results • Use optimization directives to improve performance or area (Qo. R) • Sometime modify the code for better Qo. R . © Copyright 2013 Xilinx.

Optimization Strategy Start with the Baseline Design Pipeline the Loops & Functions – Performance

Optimization Strategy Start with the Baseline Design Pipeline the Loops & Functions – Performance for each individual Dataflow the Loops & Functions Baseline Pipelined Dataflow Optimization BRAM 2792 2790 24 FF 891 1136 883 LUT 2315 2114 1606 128, 744, 588 4, 150, 224 2, 076, 613 Interval – Performance operating side-by-side Baseline Pipelined FIFO Stream Sepia Filter RAM Sobel Filter FIFO Stream Process entire image …then… Process entire image Pipelined Units Dataflow FIFO Stream Sepia Filter RAM FIFO Sobel Filter FIFO Stream. © Copyright 2013 Xilinx. Pipelined and Parallel Operation

Pipeline for Optimal Performance FPGA Benefits – Operations can occur in parallel/concurrently/pipelined – Vivado

Pipeline for Optimal Performance FPGA Benefits – Operations can occur in parallel/concurrently/pipelined – Vivado HLS creates a high-performance pipeline by directive hls. cpp #include “hls_video. h” // HLS video library #include “hls_opencv. h” // HLS Open. CV I/O void top(AXI_STREAM& src_axi, AXI_STREAM& dst_axi, int rows, int cols){ //Add directives to create AXI streaming interfaces RGB_IMAGE img[6]; RGB_PIXEL pix(100, 100); Datalow all functions to work in parallel #pragma HLS dataflow hls: : AXIvideo 2 Mat(src_axi, img[0]); hls: : Sobel(img[0], img[1], 1, 0); hls: : Sub. S(img[1], pix, img[2]); hls: : Scale(img[2], img[3], 2, 0); hls: : Erode(img[3], img[4]); hls: : Dilate(img[4], img[5]); hls: : Mat 2 AXIvideo(img[5], dst_axi); }. © Copyright 2013 Xilinx.

Verify the Synthesizable Code Create a top-level C Test Bench – This will verify

Verify the Synthesizable Code Create a top-level C Test Bench – This will verify the algorithm – Most of this code can come from the orignal C source • It will not be synthesized; Only needs to be valid C/C++ int main() … Ipl. Image* src=cv. Load. Image("test_1080 p. bmp"); Ipl. Image* dst=cv. Create. Image(cv. Get. Size(src), src->depth, src->n. Channels); hls_top(src_axi, dst_axi, src->height, src->width); Code to validate the HLS code is correct HLS Functionhls. cpp cv. Save. Image("result_1080 p. bmp", dst); cv. Release. Image(&src); cv. Release. Image(&dst); // Check the results, return 0 only if correct return err_cnt; . © Copyright 2013 Xilinx. Return 0 if no errors

Validating the C Algorithm Run C Simulation Built-in C Development – A frame of

Validating the C Algorithm Run C Simulation Built-in C Development – A frame of video is processed in less than 15 secs – In RTL it takes 10 Hours – Algorithms productively refined Accelerated Development – Faster development • Bit-accurate modeling Seconds vs Hours: The C Simulation Advantage. © Copyright 2013 Xilinx.

Synthesis Run Synthesis – Executed from the toolbar – Vivado HLS is project based,

Synthesis Run Synthesis – Executed from the toolbar – Vivado HLS is project based, allowing multiple solutions for the same project Accelerated Development – Synthesis typically takes less than 10 minutes to complete . © Copyright 2013 Xilinx.

Adding Directives in the GUI 2) Switch to directives pane 1) Open the source

Adding Directives in the GUI 2) Switch to directives pane 1) Open the source code 3) Select & right-click Directives can also be applied using a Tcl script. © Copyright 2013 Xilinx.

Create Optimized Solutions Create New Solutions – Based on the original C code –

Create Optimized Solutions Create New Solutions – Based on the original C code – Apply Optimizations – Change Device – Change clock period Compare Solutions – Performance & Area – Trade-offs . © Copyright 2013 Xilinx. Compare Solutions

RTL Simulation Run Simulation Automated RTL Verification – The C testbench is re-used –

RTL Simulation Run Simulation Automated RTL Verification – The C testbench is re-used – No need to for the user to write an RTL testbench HLD Simulation Support – Xsim, Isim, Model. Sim … – On Linux: VCS, NCSim Accelerated Development – If the C test bench checks the design, the RTL is auto-verified Test Bench must return a 0 to confirm the results are good No RTL Test Bench is Required to Verify the Design. © Copyright 2013 Xilinx.

Create IP Create an IP – IP Catalog & Vivado – System Generator Block

Create IP Create an IP – IP Catalog & Vivado – System Generator Block – Pcore for XPS Evaluate RTL Implementation – Optionally launch RTL synthesis – View the Project in Vivado IP RTL block creation finishes in <1 Minute. © Copyright 2013 Xilinx.

AXI 4 -Lite Interface C Driver Files AXI 4 -Lite Interface – Can be

AXI 4 -Lite Interface C Driver Files AXI 4 -Lite Interface – Can be used to group multiple ports into a memory mapped interface – C Driver files are created automatically C Driver Files IP Catalog package. © Copyright 2013 Xilinx.

Accelerator Integration Output Directory Integrate the IP Software Drivers – IP Integrator – System

Accelerator Integration Output Directory Integrate the IP Software Drivers – IP Integrator – System Generator for DSP – Pcore into XPS Output IP Take and Use the RTL Output – Use the RTL in the IP package Vivado RTL Directories Vivado Project Open the Vivado Project & examine the RTL design Recommendation is to use IP Integrator. © Copyright 2013 Xilinx.

IP Integrator Support Add Vivado HLS IP to IP Catalog – Data types supported:

IP Integrator Support Add Vivado HLS IP to IP Catalog – Data types supported: IPI can propagate them Vivado HLS Vivado IP Catalog IP Settings Add IP Browse to IP Zip File . © Copyright 2013 Xilinx.

IP Integrator Supported Add HLS from IP Catalog Vivado Select IP Right-Click & Add

IP Integrator Supported Add HLS from IP Catalog Vivado Select IP Right-Click & Add IP Open Block Design Connect IP . © Copyright 2013 Xilinx.

Export to SDK Vivado Perform Connections – Configure Zynq – Configure Interconnect – Validate

Export to SDK Vivado Perform Connections – Configure Zynq – Configure Interconnect – Validate the Design – Generate Output Products Export to SDK – The final step, updating the source code will be performed in SDK . © Copyright 2013 Xilinx.

Thank You XILINX CONFIDENTIAL Xilinx Confidential.

Thank You XILINX CONFIDENTIAL Xilinx Confidential.