Platform Design Exploiting Data Level Parallelism DLP SIMD


![SIMD Performance Computational 106 efficiency [MOPS/W] 105 Application specific cores 104 SIMD 103 102 SIMD Performance Computational 106 efficiency [MOPS/W] 105 Application specific cores 104 SIMD 103 102](https://slidetodoc.com/presentation_image_h2/eb9057736e760fac70e5bce8eb24c0ca/image-3.jpg)
































- Slides: 35

Platform Design Exploiting Data Level Parallelism (DLP) SIMD architectures TU/e 5 kk 70 Henk Corporaal Bart Mesman Platform Design H. Corporaal and B. Mesman

DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor flexibility efficiency 10/26/2021 Platform Design H. Corporaal and B. Mesman 2
![SIMD Performance Computational 106 efficiency MOPSW 105 Application specific cores 104 SIMD 103 102 SIMD Performance Computational 106 efficiency [MOPS/W] 105 Application specific cores 104 SIMD 103 102](https://slidetodoc.com/presentation_image_h2/eb9057736e760fac70e5bce8eb24c0ca/image-3.jpg)
SIMD Performance Computational 106 efficiency [MOPS/W] 105 Application specific cores 104 SIMD 103 102 [Roza] 10/26/2021 Programmable processors 101 100 2 1 Platform Design 0. 5 0. 25 H. Corporaal and B. Mesman 0. 13 0. 07 Feature size [um] 3

SIMD: Topics Overview • Enhance performance: architecture methods • Data Level Parallelism – Application area – Subword parallelism • Locally connected SIMDs – Xetal • Fully connected SIMDs – Imagine • Communication in SIMD processors – RCSIMD – DCSIMD 10/26/2021 Platform Design H. Corporaal and B. Mesman 4

Enhance performance: 4 architecture methods • (Super)-pipelining • Powerful instructions – MD-technique • multiple data operands per operation – MO-technique • multiple operations per instruction • Multiple instruction issue 10/26/2021 Platform Design H. Corporaal and B. Mesman 5

Characteristics of Media Applications • Poorly matched to conventional architectures – Caches – Instruction-Level Parallelism – Few arithmetic units • Well-matched to modern VLSI technology – Lots (100’s - 1000’s) of ALUs fit on a single chip Communication/synchronization often bottleneck 10/26/2021 Platform Design H. Corporaal and B. Mesman 6

Architecture methods Powerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: Assembly: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; set ldv mulvi ldv addv stv c = a + 5*b 10/26/2021 Platform Design H. Corporaal and B. Mesman vl, 64 v 1, 0(r 2) v 2, v 1, 5 v 1, 0(r 1) v 3, v 1, v 2 v 3, 0(r 3) 7

Architecture methods Powerful Instructions (1) SIMD computing SIMD Execution Method time • Exploit data locality of e. g. image processing applications • Effect on code size? • Effect on power consumption? node 1 node 2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n 10/26/2021 Platform Design H. Corporaal and B. Mesman 8

Architecture methods Powerful Instructions (1) • Sub-word parallelism – SIMD on restricted scale: – Used for Multi-media instructions – Motivation: use a powerful 64 -bit alu as 4 x 16 -bit alus • Examples – MMX, SUN-VIS, HP MAX-2, AMDK 7/Athlon 3 Dnow, Trimedia II – Example: i=1. . 4|ai-bi| 10/26/2021 Platform Design H. Corporaal and B. Mesman * * 9

LCSIMD LC-SIMD (Locally connected; e. g. Xetal, Imap) long communication delays: shift operations Instructions Bus PE 0 10/26/2021 PE 2 Platform Design PE 319 H. Corporaal and B. Mesman 10

FCSIMD FC-SIMD (Fully Connected; Imagine) expensive communication network Instructions Bus PE 0 PE 1 PE 2 PE 319 Fully Connected Communication Network 10/26/2021 Platform Design H. Corporaal and B. Mesman 11

LC: Xetal Objectives High-degree of system integration CMOS imaging + DSP low cost camera systems Low power consumption mobile & remote sensing Flexibility programmable DSP and control functions 10/26/2021 Platform Design H. Corporaal and B. Mesman 12

Xetal Architecture 10/26/2021 Platform Design H. Corporaal and B. Mesman 13

Parallel Processing (SIMD) 2 columns /processor neighbour communication low-speed clock (16 MHz) clock gating shared address decoding minimal memory read access LOW-POWER 10/26/2021 Platform Design H. Corporaal and B. Mesman 14

Xetal Specs & Performance 10/26/2021 Platform Design H. Corporaal and B. Mesman 15

Simulation Results(1 -input) 10/26/2021 Platform Design H. Corporaal and B. Mesman 16

Simulation Results(1 output) 10/26/2021 Platform Design H. Corporaal and B. Mesman 17

Simulation Results(2) 10/26/2021 Platform Design H. Corporaal and B. Mesman 18

Imagine: Representative Applications • Stereo Depth Extraction • Polygon Rendering • MPEG Encoding/Decoding Render 101100 010110 001001 Encode/Decode Encoded 2 D Data 10/26/2021 Platform Design H. Corporaal and B. Mesman 2 D Video Stream 19

Stream Processing Input Data Kernel Stream Output Data Image 0 convolve SAD Image 1 convolve Depth Map convolve • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 arithmetic ops per memory reference) 10/26/2021 Platform Design H. Corporaal and B. Mesman 20

Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster SDRAM Stream Register File ALU Cluster ALU Cluster SDRAM Peak BW: 10/26/2021 2 GB/s ALU Cluster 32 GB/s Platform Design 544 GB/s H. Corporaal and B. Mesman 21

SDRAM Stream Register File Application Data Bandwidth Usage 2 GB/s 10/26/2021 ALU Cluster 32 GB/s Platform Design 544 GB/s H. Corporaal and B. Mesman 22

Stream Register File: Details Arbiter Single-ported 128 KB SRAM 32 W/cycle (1024 x 32 W) 10/26/2021 Platform Design Stream buffers SRF: H. Corporaal and B. Mesman To/From Arithmetic Clusters 23

Local Register File + + + * * / To SRF CU Intercluster Network Arithmetic Cluster: Details Cross Point From SRF • Units support floating-point / 32 -bit / dual 16 -bit / quad 8 -bit instructions – 4 -cycle pipelined FMUL, FADD, FSUB, FTOI, ITOF, FFRAC – 17 -cycle FDIV (pipelined for 1 FDIV every 7 cycles) 10/26/2021 Platform Design H. Corporaal and B. Mesman 24

The Imagine Stream Processor SDRAM Network Interface ALU Cluster 7 ALU Cluster 6 ALU Cluster 5 ALU Cluster 4 ALU Cluster 3 ALU Cluster 2 ALU Cluster 1 Stream Register File: 32 k. W SRAM ALU Cluster 0 Microcontroller: 2 K VLIW Instrs Host Processor Stream Controller Network Streaming Memory System Imagine Stream Processor 10/26/2021 Platform Design H. Corporaal and B. Mesman 25

Imagine Floorplan • 22 million transistors • 500 MHz • TI GS 30 KA: – 0. 15 mm Ldrawn – 0. 13 mm Leff – CMOS process 10/26/2021 Platform Design H. Corporaal and B. Mesman 26

Imagine Programming Environment Stereo. Depth. Extraction(…) { // Load Input Images. . . // Run Kernels convolve 7 x 7 (Raw. Image, Conv. Image); convolve 3 x 3 (Conv. Image, Conv 2 Image); . . . // Store Output Convolve 7 x 7(…) {. . . while(!In. empty()) {. . . p 0 = k 0 * in 10; p 12 = k 21 * in 32; p 34 = k 43 * in 54; p 56 = k 65 * in 76; sum = (p 0 + p 12) + (p 34 + p 56); . . . } } } 10/26/2021 Platform Design H. Corporaal and B. Mesman 27

Applications • Algorithms need Dynamic communication: – lens distortion – bucket processing – Mirroring, … 10/26/2021 Platform Design H. Corporaal and B. Mesman 28

Imap 10/26/2021 Platform Design H. Corporaal and B. Mesman 29

Imap 10/26/2021 Platform Design H. Corporaal and B. Mesman 30

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R 2 R 7 R 5 R 3 PE_6 PE_3 data Bus_1 Bus_2 R 6 V dst-add Bus_0 src-add PE_4 PE_2 Message format 10/26/2021 Platform Design H. Corporaal and B. Mesman 31

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R 2 R 7 R 5 R 3 R 6 Bus_0 Bus_1 Bus_2 Larger distance: PE_7 PE_1 10/26/2021 Platform Design H. Corporaal and B. Mesman 32

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R 2 R 7 R 5 R 3 PE_7 PE_5 R 6 Bus_0 Bus_1 Bus_2 Priority PE_6 PE_2 10/26/2021 Platform Design H. Corporaal and B. Mesman 33

DC-SIMD: arbitration PE Read: V des-add data write: give priority to further PES PEn PEid PEn+1 xor Read data src-add PEn+2 Next reg. ab v 00 n+2 01 n+1 10 n 11 V des-add data n+2 : 2. v Select (ab) a=v’. 2’ Buffer instruction: b=a’. v’+a. 1’ 10/26/2021 src-add n+1 : (2+v). 1 n Platform Design : (1+2+v). 0 H. Corporaal and B. Mesman 34

Conclusions • SIMD nicely matches – Image applications: data-level parallelism – VLSI efficiency: copy-paste of simple elements • So – Very efficient architecture for image processing – Low power! Also by trading off clock vs performance – High memory bandwidth with a single memory port • But – – Programmer is burdened with vector thinking and code rewriting Compilers are not good at recognizing opportunities for vector executions How to provide the data: vector memories and register files Need for a “control” processor for control code and if-then-else • Communication is a problem: – Unable to perform indirect PE addressing-> DC-SIMD 10/26/2021 Platform Design H. Corporaal and B. Mesman 35