Platform Design Exploiting Data Level Parallelism DLP SIMD

  • Slides: 35
Download presentation
Platform Design Exploiting Data Level Parallelism (DLP) SIMD architectures TU/e 5 kk 70 Henk

Platform Design Exploiting Data Level Parallelism (DLP) SIMD architectures TU/e 5 kk 70 Henk Corporaal Bart Mesman Platform Design H. Corporaal and B. Mesman

DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor

DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor flexibility efficiency 10/26/2021 Platform Design H. Corporaal and B. Mesman 2

SIMD Performance Computational 106 efficiency [MOPS/W] 105 Application specific cores 104 SIMD 103 102

SIMD Performance Computational 106 efficiency [MOPS/W] 105 Application specific cores 104 SIMD 103 102 [Roza] 10/26/2021 Programmable processors 101 100 2 1 Platform Design 0. 5 0. 25 H. Corporaal and B. Mesman 0. 13 0. 07 Feature size [um] 3

SIMD: Topics Overview • Enhance performance: architecture methods • Data Level Parallelism – Application

SIMD: Topics Overview • Enhance performance: architecture methods • Data Level Parallelism – Application area – Subword parallelism • Locally connected SIMDs – Xetal • Fully connected SIMDs – Imagine • Communication in SIMD processors – RCSIMD – DCSIMD 10/26/2021 Platform Design H. Corporaal and B. Mesman 4

Enhance performance: 4 architecture methods • (Super)-pipelining • Powerful instructions – MD-technique • multiple

Enhance performance: 4 architecture methods • (Super)-pipelining • Powerful instructions – MD-technique • multiple data operands per operation – MO-technique • multiple operations per instruction • Multiple instruction issue 10/26/2021 Platform Design H. Corporaal and B. Mesman 5

Characteristics of Media Applications • Poorly matched to conventional architectures – Caches – Instruction-Level

Characteristics of Media Applications • Poorly matched to conventional architectures – Caches – Instruction-Level Parallelism – Few arithmetic units • Well-matched to modern VLSI technology – Lots (100’s - 1000’s) of ALUs fit on a single chip Communication/synchronization often bottleneck 10/26/2021 Platform Design H. Corporaal and B. Mesman 6

Architecture methods Powerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD:

Architecture methods Powerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: Assembly: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; set ldv mulvi ldv addv stv c = a + 5*b 10/26/2021 Platform Design H. Corporaal and B. Mesman vl, 64 v 1, 0(r 2) v 2, v 1, 5 v 1, 0(r 1) v 3, v 1, v 2 v 3, 0(r 3) 7

Architecture methods Powerful Instructions (1) SIMD computing SIMD Execution Method time • Exploit data

Architecture methods Powerful Instructions (1) SIMD computing SIMD Execution Method time • Exploit data locality of e. g. image processing applications • Effect on code size? • Effect on power consumption? node 1 node 2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n 10/26/2021 Platform Design H. Corporaal and B. Mesman 8

Architecture methods Powerful Instructions (1) • Sub-word parallelism – SIMD on restricted scale: –

Architecture methods Powerful Instructions (1) • Sub-word parallelism – SIMD on restricted scale: – Used for Multi-media instructions – Motivation: use a powerful 64 -bit alu as 4 x 16 -bit alus • Examples – MMX, SUN-VIS, HP MAX-2, AMDK 7/Athlon 3 Dnow, Trimedia II – Example: i=1. . 4|ai-bi| 10/26/2021 Platform Design H. Corporaal and B. Mesman * * 9

LCSIMD LC-SIMD (Locally connected; e. g. Xetal, Imap) long communication delays: shift operations Instructions

LCSIMD LC-SIMD (Locally connected; e. g. Xetal, Imap) long communication delays: shift operations Instructions Bus PE 0 10/26/2021 PE 2 Platform Design PE 319 H. Corporaal and B. Mesman 10

FCSIMD FC-SIMD (Fully Connected; Imagine) expensive communication network Instructions Bus PE 0 PE 1

FCSIMD FC-SIMD (Fully Connected; Imagine) expensive communication network Instructions Bus PE 0 PE 1 PE 2 PE 319 Fully Connected Communication Network 10/26/2021 Platform Design H. Corporaal and B. Mesman 11

LC: Xetal Objectives High-degree of system integration CMOS imaging + DSP low cost camera

LC: Xetal Objectives High-degree of system integration CMOS imaging + DSP low cost camera systems Low power consumption mobile & remote sensing Flexibility programmable DSP and control functions 10/26/2021 Platform Design H. Corporaal and B. Mesman 12

Xetal Architecture 10/26/2021 Platform Design H. Corporaal and B. Mesman 13

Xetal Architecture 10/26/2021 Platform Design H. Corporaal and B. Mesman 13

Parallel Processing (SIMD) 2 columns /processor neighbour communication low-speed clock (16 MHz) clock gating

Parallel Processing (SIMD) 2 columns /processor neighbour communication low-speed clock (16 MHz) clock gating shared address decoding minimal memory read access LOW-POWER 10/26/2021 Platform Design H. Corporaal and B. Mesman 14

Xetal Specs & Performance 10/26/2021 Platform Design H. Corporaal and B. Mesman 15

Xetal Specs & Performance 10/26/2021 Platform Design H. Corporaal and B. Mesman 15

Simulation Results(1 -input) 10/26/2021 Platform Design H. Corporaal and B. Mesman 16

Simulation Results(1 -input) 10/26/2021 Platform Design H. Corporaal and B. Mesman 16

Simulation Results(1 output) 10/26/2021 Platform Design H. Corporaal and B. Mesman 17

Simulation Results(1 output) 10/26/2021 Platform Design H. Corporaal and B. Mesman 17

Simulation Results(2) 10/26/2021 Platform Design H. Corporaal and B. Mesman 18

Simulation Results(2) 10/26/2021 Platform Design H. Corporaal and B. Mesman 18

Imagine: Representative Applications • Stereo Depth Extraction • Polygon Rendering • MPEG Encoding/Decoding Render

Imagine: Representative Applications • Stereo Depth Extraction • Polygon Rendering • MPEG Encoding/Decoding Render 101100 010110 001001 Encode/Decode Encoded 2 D Data 10/26/2021 Platform Design H. Corporaal and B. Mesman 2 D Video Stream 19

Stream Processing Input Data Kernel Stream Output Data Image 0 convolve SAD Image 1

Stream Processing Input Data Kernel Stream Output Data Image 0 convolve SAD Image 1 convolve Depth Map convolve • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 arithmetic ops per memory reference) 10/26/2021 Platform Design H. Corporaal and B. Mesman 20

Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster SDRAM Stream Register

Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster SDRAM Stream Register File ALU Cluster ALU Cluster SDRAM Peak BW: 10/26/2021 2 GB/s ALU Cluster 32 GB/s Platform Design 544 GB/s H. Corporaal and B. Mesman 21

SDRAM Stream Register File Application Data Bandwidth Usage 2 GB/s 10/26/2021 ALU Cluster 32

SDRAM Stream Register File Application Data Bandwidth Usage 2 GB/s 10/26/2021 ALU Cluster 32 GB/s Platform Design 544 GB/s H. Corporaal and B. Mesman 22

Stream Register File: Details Arbiter Single-ported 128 KB SRAM 32 W/cycle (1024 x 32

Stream Register File: Details Arbiter Single-ported 128 KB SRAM 32 W/cycle (1024 x 32 W) 10/26/2021 Platform Design Stream buffers SRF: H. Corporaal and B. Mesman To/From Arithmetic Clusters 23

Local Register File + + + * * / To SRF CU Intercluster Network

Local Register File + + + * * / To SRF CU Intercluster Network Arithmetic Cluster: Details Cross Point From SRF • Units support floating-point / 32 -bit / dual 16 -bit / quad 8 -bit instructions – 4 -cycle pipelined FMUL, FADD, FSUB, FTOI, ITOF, FFRAC – 17 -cycle FDIV (pipelined for 1 FDIV every 7 cycles) 10/26/2021 Platform Design H. Corporaal and B. Mesman 24

The Imagine Stream Processor SDRAM Network Interface ALU Cluster 7 ALU Cluster 6 ALU

The Imagine Stream Processor SDRAM Network Interface ALU Cluster 7 ALU Cluster 6 ALU Cluster 5 ALU Cluster 4 ALU Cluster 3 ALU Cluster 2 ALU Cluster 1 Stream Register File: 32 k. W SRAM ALU Cluster 0 Microcontroller: 2 K VLIW Instrs Host Processor Stream Controller Network Streaming Memory System Imagine Stream Processor 10/26/2021 Platform Design H. Corporaal and B. Mesman 25

Imagine Floorplan • 22 million transistors • 500 MHz • TI GS 30 KA:

Imagine Floorplan • 22 million transistors • 500 MHz • TI GS 30 KA: – 0. 15 mm Ldrawn – 0. 13 mm Leff – CMOS process 10/26/2021 Platform Design H. Corporaal and B. Mesman 26

Imagine Programming Environment Stereo. Depth. Extraction(…) { // Load Input Images. . . //

Imagine Programming Environment Stereo. Depth. Extraction(…) { // Load Input Images. . . // Run Kernels convolve 7 x 7 (Raw. Image, Conv. Image); convolve 3 x 3 (Conv. Image, Conv 2 Image); . . . // Store Output Convolve 7 x 7(…) {. . . while(!In. empty()) {. . . p 0 = k 0 * in 10; p 12 = k 21 * in 32; p 34 = k 43 * in 54; p 56 = k 65 * in 76; sum = (p 0 + p 12) + (p 34 + p 56); . . . } } } 10/26/2021 Platform Design H. Corporaal and B. Mesman 27

Applications • Algorithms need Dynamic communication: – lens distortion – bucket processing – Mirroring,

Applications • Algorithms need Dynamic communication: – lens distortion – bucket processing – Mirroring, … 10/26/2021 Platform Design H. Corporaal and B. Mesman 28

Imap 10/26/2021 Platform Design H. Corporaal and B. Mesman 29

Imap 10/26/2021 Platform Design H. Corporaal and B. Mesman 29

Imap 10/26/2021 Platform Design H. Corporaal and B. Mesman 30

Imap 10/26/2021 Platform Design H. Corporaal and B. Mesman 30

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R 2 R 7 R 5 R 3 PE_6 PE_3 data Bus_1 Bus_2 R 6 V dst-add Bus_0 src-add PE_4 PE_2 Message format 10/26/2021 Platform Design H. Corporaal and B. Mesman 31

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R 2 R 7 R 5 R 3 R 6 Bus_0 Bus_1 Bus_2 Larger distance: PE_7 PE_1 10/26/2021 Platform Design H. Corporaal and B. Mesman 32

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 R 1 PE_5 PE_6 PE_7 R 4 R 2 R 7 R 5 R 3 PE_7 PE_5 R 6 Bus_0 Bus_1 Bus_2 Priority PE_6 PE_2 10/26/2021 Platform Design H. Corporaal and B. Mesman 33

DC-SIMD: arbitration PE Read: V des-add data write: give priority to further PES PEn

DC-SIMD: arbitration PE Read: V des-add data write: give priority to further PES PEn PEid PEn+1 xor Read data src-add PEn+2 Next reg. ab v 00 n+2 01 n+1 10 n 11 V des-add data n+2 : 2. v Select (ab) a=v’. 2’ Buffer instruction: b=a’. v’+a. 1’ 10/26/2021 src-add n+1 : (2+v). 1 n Platform Design : (1+2+v). 0 H. Corporaal and B. Mesman 34

Conclusions • SIMD nicely matches – Image applications: data-level parallelism – VLSI efficiency: copy-paste

Conclusions • SIMD nicely matches – Image applications: data-level parallelism – VLSI efficiency: copy-paste of simple elements • So – Very efficient architecture for image processing – Low power! Also by trading off clock vs performance – High memory bandwidth with a single memory port • But – – Programmer is burdened with vector thinking and code rewriting Compilers are not good at recognizing opportunities for vector executions How to provide the data: vector memories and register files Need for a “control” processor for control code and if-then-else • Communication is a problem: – Unable to perform indirect PE addressing-> DC-SIMD 10/26/2021 Platform Design H. Corporaal and B. Mesman 35