The World Leader in High Performance Signal Processing

  • Slides: 16
Download presentation
The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded

The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile Analog Devices Inc. , Norwood, MA

Outline u Multi-core programming challenge u Framework requirements u Framework Methodology u Multimedia data-flow

Outline u Multi-core programming challenge u Framework requirements u Framework Methodology u Multimedia data-flow analysis u BF 561 dual-core architecture analysis u Framework models u Combining Frameworks u Results u Conclusion

Multi-core Programming Challenge u To meet the growing processing demands placed by embedded applications,

Multi-core Programming Challenge u To meet the growing processing demands placed by embedded applications, multi-core architectures have emerged as a promising solution u Embedded developers strive to take advantage of extra core(s) without a corresponding increase in programming complexity u Ideally, the performance increase should approach “N” times where “N” is the number of cores u Managing shared-memory and inter-core communications makes the difference! u Developing a framework to manage code and data will help to speed development time and ensure optimal performance u We target some compute intensive and high bandwidth applications on an embedded dual-core processor

Framework requirements u Scalable u Equal u. A across multiple cores load balancing between

Framework requirements u Scalable u Equal u. A across multiple cores load balancing between all cores core data item request is always met at the L 1 memory level u Minimum possible data memory footprint

Framework methodology u Understanding the parallel data-flow of the application with respect to spatial

Framework methodology u Understanding the parallel data-flow of the application with respect to spatial and temporal locality u Efficiently mapping the data-flow to the private and shared resources of the architecture

Multimedia Data-flow Analysis

Multimedia Data-flow Analysis

ADSP-BF 561 Dual-core Architecture Analysis u Dual-Core SDRAM (L 3) 4 x(16 – 128

ADSP-BF 561 Dual-core Architecture Analysis u Dual-Core SDRAM (L 3) 4 x(16 – 128 MB) architecture 8 -10 sclk L 2 shared and Unified Code and Data (128 KB) 8 -10 sclk 9 cclk u Private SRAM/Cache L 1 code and Data memory u Shared L 2 and external memory L 1 Code (32 KB) u 4 Memory DMA channels L 1 Data (64 KB) u Shared 1 cclk Core A Core B peripheral interface

Framework models u Slice/Line processing u Macro-block u Frame u GOP processing

Framework models u Slice/Line processing u Macro-block u Frame u GOP processing

Framework design u Data moved directly from the peripheral DMA to the lowest (Level

Framework design u Data moved directly from the peripheral DMA to the lowest (Level 1 or Level 2) possible memory level based on the data access granularity u DMA is used for all data management across memory levels, saving essential core cycles in managing data u Multiple Data buffers are used to avoid core and DMA contention u Semaphores are used for inter-core communication

Line processing framework Video In Video Out Rx_Line 0 Rx_Line 2 Tx_Line 0 Tx_Line

Line processing framework Video In Video Out Rx_Line 0 Rx_Line 2 Tx_Line 0 Tx_Line 2 Core A Internal L 1 memory u No Rx_Line 1 Rx_Line 3 Tx_Line 1 Tx_Line 3 Core B Internal L 1 memory L 2 or L 3 accesses made, thereby saving external memory ban u Only DMA channels used to manage data u Applicable examples - color conversion, histogram equalization, f

Macro-block processing framework L 2 Rx Buffers L 2 Tx Buffers Video In PPI

Macro-block processing framework L 2 Rx Buffers L 2 Tx Buffers Video In PPI 1 Rx 0 Rx 1 Rx 0 Tx 1 Core A Internal L 1 memory u No Rx 1 Core B Internal L 1 memory L 3 accesses u Applicable examples - edge detection, JPEG/MJPEG encoding/de

Frame processing framework L 3 Frame Buffers Rx 0 Rx 1 Tx 0 Tx

Frame processing framework L 3 Frame Buffers Rx 0 Rx 1 Tx 0 Tx 1 Rx 0 Rx 1 Core A Internal L 1 memory u Applicable example - motion detection Rx 0 Rx 1 Tx 0 Tx 1 Rx 0 Rx 1 Core B Internal L 1 memory

GOP processing framework L 3 Frame Buffers GOP = 4 Rx 0 Rx 1

GOP processing framework L 3 Frame Buffers GOP = 4 Rx 0 Rx 1 Tx 0 Tx 1 Rx 0 Rx 1 Core A Internal L 1 memory u Applicable Rx 0 Rx 1 Tx 0 Tx 1 Rx 0 Rx 1 Core B Internal L 1 memory examples - encoding/decoding algorithms such as MP

Results Template Core cycles/pi xel*(appr ox. ) single core Core cycles/pixe l*(approx. ) -

Results Template Core cycles/pi xel*(appr ox. ) single core Core cycles/pixe l*(approx. ) - two cores L 1 data memory required( bytes) Line Processing 42 80 (line size)*2; for ITU-656 - 1716*2 Macro-block 36 Processing 72 (Macro-block size(nxm))* 2 Frame processing 70 (size of subprocessing block)*(num ber of dependent blocks) 35 L 2 data memory required (bytes) Comments double buffering in L 1 Slice of a frame; (macroblock height *line size)*4 double buffering in L 1 and L 2 (size of subprocessing block)*(num ber of dependent blocks) Only L 1 or L 2 cannot be used double buffering in L 1 or L 2

Using the Templates Identify the following items for an application u The granularity of

Using the Templates Identify the following items for an application u The granularity of the sub-processing block in the image processing algorithm u The available L 1 and L 2 data memory, as required by the specific templates. u The estimate of the computation cycles required per subprocessing block u The spatial and temporal dependencies between the subprocessing blocks. If dependencies exist, then the templates needs modification to account for data dependencies

Conclusion u Understanding the data access pattern of an application is key to efficient

Conclusion u Understanding the data access pattern of an application is key to efficient programming model for embedded systems u The frameworks combine techniques to efficiently manage the shared resources and exploit the known data access pattern in multimedia applications to achieve a 2 X speed-up u The memory footprint is equal to the smallest data access granularity of the application u The frameworks can be combined to integrate multiple algorithms with different data access pattern within an application