Energy and Performance Exploration of Accelerator Coherency Port

  • Slides: 15
Download presentation
Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian

Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, Italy Microelectronic Systems Design Research Group, University of Kaiserslautern, Germany {mohammadsadegh. sadr 2, luca. benini}@unibo. it, {weis, wehn}@eit. uni-kl. de ver 0

Outline Introduction ZYNQ Architecture (Brief) Motivations & Contributions Infrastructure Setup (Hardware & Software) Memory

Outline Introduction ZYNQ Architecture (Brief) Motivations & Contributions Infrastructure Setup (Hardware & Software) Memory Sharing Methods Experimental Results Lessons Learned & Conclusion Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 2

Introduction 1951 Performance Per Watt!! UNIVAC I : 0. 015 operations per 1 watt-second

Introduction 1951 Performance Per Watt!! UNIVAC I : 0. 015 operations per 1 watt-second Half a century later! 2012 ST P 2012 : 40 billion operations per 1 watt-second Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ

Introduction Solution : Specialized functional units (Accelerators) - Problem can be more complicated! var

Introduction Solution : Specialized functional units (Accelerators) - Problem can be more complicated! var 1 Better Performance Per Watt! DRAM e. g. Multiple CPU cores! var 2 cached - Every processing element: Should have a consistent view of the shared CPU var 1 TASK 1 What about Variables? memory! Faster! TASK 2 - Accelerator Coherency Port. L 1$ (ACP): Allows accelerator hardware var 2 TASK 3 To Perform coherent accesses ? ? ? TASK 4 To CPU(s)CPU memory space! should More Power Efficient!Flush the cache! Case 2 Case 1 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ var 3

Xilinx ZYNQ Architecture PL PS SGP 0 Peripherals (UART, USB, Network, SD, GPIO, …)

Xilinx ZYNQ Architecture PL PS SGP 0 Peripherals (UART, USB, Network, SD, GPIO, …) SGP 1 DMA Controller (ARM PL 330) HP 0 AXI Masters HP 1 HP 2 HP 3 DRAM Controller (Synopsys Intelli. DDR MPMC) Inter Connect (ARM NIC-301) L 2 PL 310 AXI Slaves AXI Master MGP 0 MGP 1 ACP OCM S n o o p L 1 ARM A 9 NEON MMU Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 5

Motivations & Contributions PL - PS For each method, Which method is better What

Motivations & Contributions PL - PS For each method, Which method is better What is the data transfer speed? to share data between in the Various acceleration methods are addressed How much is the energy consumption? CPU and Accelerator? Effect of background workload literature (GPU, hardware boards, on …)performance? HP 0 DRAM Controller - We develop an infrastructure (HW+SW) For the Xilinx ZYNQ S L ARM A 9 AXI Master (Accelerator) 1 NEON MMU n - We run practical tests & PL 310 measurements o To quantify the efficiency of different CPU-accelerator ARM A 9 o L NEON OCM memory sharing methods. 1 p MMU L 2 ACP Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 6

Hardware Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration

Hardware Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 7

Software Linux Kernel Level Drivers AXI Dummy Driver Simple driver: Over ACP: - Initializes

Software Linux Kernel Level Drivers AXI Dummy Driver Simple driver: Over ACP: - Initializes the dummy AXIkmalloc masters (HP 1) - Triggers an endless read/write loop Over HP: dma_alloc_coherent AXI Driver user side interface application AXI Driver More complicated: - Handles AXI masters - ACP & HP 0 - Memory allocation - ISR registration - statistics PL 310 - time measurement Background application: A Simple memory read/write loop Oprofile statistical profiler. Measure all CPU performance metrics. Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 8

Processing Task Definition We define : Different methods to accomplish the task. Measure :

Processing Task Definition We define : Different methods to accomplish the task. Measure : Execution time & Energy. Image Sizes: 4 KBytes 16 K 65 K 128 K 256 K 1 MBytes 2 MBytes 128 K Allocated by: kmalloc dma_alloc_coherent Depends on the memory Sharing method Source Image (image_size bytes) @Source Address Selection of Pakcets: (Addressing) - Normal - Bit-reversed Result Image (image_size bytes) @Dest Address Loop: N times Measure execution interval. FIFO: 128 K read FIR write process Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 9

Memory Sharing Methods • ACP Only (HP only is similar, there is no SCU

Memory Sharing Methods • ACP Only (HP only is similar, there is no SCU and L 2) ACP Accelerator SCU L 2 DRAM • CPU only (with&without cache) • CPU ACP (CPU HP similar) CPU 2 1 Accelerator ACP SCU L 2 DRAM ACP --- CPU --- ACP --Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 10

Speed Comparison ACP Loses! CPU OCM between CPU ACP & CPU HP 298 MBytes/s

Speed Comparison ACP Loses! CPU OCM between CPU ACP & CPU HP 298 MBytes/s 239 MBytes/s 4 K 16 K 64 K 128 K 256 K 1 MBytes Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 11

Dummy Traffic Effect ACP: 1664 Mbytes/s HP: 1382 Mbytes/s CPU dummy traffic Occupies cache

Dummy Traffic Effect ACP: 1664 Mbytes/s HP: 1382 Mbytes/s CPU dummy traffic Occupies cache entries So less free entries remain for the accelerator 256 K Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 12

Power Comparison Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance

Power Comparison Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 13

Energy Comparison CPU only methods : worst case! CPU OCM always between CPU ACP

Energy Comparison CPU only methods : worst case! CPU OCM always between CPU ACP and CPU HP CPU ACP ; always better energy than CPU HP 0 When the image size grows CPU ACP converges CPU HP 0 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 14

Lessons Learned & Conclusion • If a specific task should be done by the

Lessons Learned & Conclusion • If a specific task should be done by the cooperation of CPU and accelerator: • CPU ACP and CPU OCM are always better than CPU HP in terms of energy • If we are running other applications which heavily depend on caches, CPU OCM and then CPU HP are preferred! • If a specific task should be done by accelerator only: • For small arrays ACP Only & OCM Only can be used • For large arrays (>size of L 2$) HP Only always acts better. Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 15