Platformbased Design Exploiting ILP VLIW architectures part a

Platform-based Design Exploiting ILP VLIW architectures (part a) TU/e 5 kk 70 Henk Corporaal Bart Mesman Platform Design H. Corporaal and B. Mesman

What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture Instruction format: operation 1 operation 2 operation 3 operation 4 operation 5 6/19/2021 Platform Design H. Corporaal and B. Mesman 2

VLIW: Topics Overview • Enhance performance: architecture methods • Instruction Level Parallelism – Limits on ILP • VLIW – Examples • Clustering • Code generation • Hands-on 6/19/2021 Platform Design H. Corporaal and B. Mesman 3

Enhance performance: 3 architecture methods • (Super)-pipelining • Powerful instructions – MD-technique • multiple data operands per operation – MO-technique • multiple operations per instruction • Multiple instruction issue 6/19/2021 Platform Design H. Corporaal and B. Mesman 4

Architecture methods Pipelined Execution of Instructions IF: Instruction Fetch INSTRUCTION CYCLE 1 1 2 3 4 2 IF 3 DC IF 4 RF DC IF 5 EX RF DC IF 6 WB EX RF DC 7 DC: Instruction Decode 8 RF: Register Fetch WB EX RF WB EX EX: Execute instruction WB WB: Write Result Register Simple 5 -stage pipeline Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one • More efficient Hardware Problems • Hazards: pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required 6/19/2021 Platform Design H. Corporaal and B. Mesman 5

Architecture methods Pipelined Execution of Instructions Superpipelining: • Split one or more of the critical pipeline stages * 6/19/2021 Platform Design H. Corporaal and B. Mesman 6

Architecture methods Powerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: Assembly: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; set ldv mulvi ldv addv stv c = a + 5*b 6/19/2021 Platform Design H. Corporaal and B. Mesman vl, 64 v 1, 0(r 2) v 2, v 1, 5 v 1, 0(r 1) v 3, v 1, v 2 v 3, 0(r 3) 7

Architecture methods Powerful Instructions (1) SIMD computing SIMD Execution Method time • Nodes used for independent operations • Mesh or hypercube connectivity • Exploit data locality of e. g. image processing applications • Dense encoding (few instruction bits needed) node 1 node 2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n 6/19/2021 Platform Design H. Corporaal and B. Mesman 8

Architecture methods Powerful Instructions (1) • Sub-word parallelism – SIMD on restricted scale: – Used for Multi-media instructions • Examples – MMX, SUN-VIS, HP MAX-2, AMDK 7/Athlon 3 Dnow, Trimedia II – Example: i=1. . 4|ai-bi| * 6/19/2021 Platform Design H. Corporaal and B. Mesman * * * 9

Architecture methods Powerful Instructions (2) MO-technique: multiple operations per instruction • CISC (Complex Instruction Set Computer) • VLIW (Very Long Instruction Word) field FU 1 instruction sub r 8, r 5, 3 FU 2 and r 1, r 5, 12 FU 3 mul r 6, r 5, r 2 FU 4 ld r 3, 0(r 5) FU 5 bnez r 5, 13 VLIW instruction example 6/19/2021 Platform Design H. Corporaal and B. Mesman 10

VLIW architecture: central Register File Register file Exec unit 1 unit 2 unit 3 Issue slot 1 6/19/2021 Exec unit 4 unit 5 unit 6 Issue slot 2 Platform Design H. Corporaal and B. Mesman Exec unit 7 unit 8 unit 9 Issue slot 3 11

TM 1000 DSPCPU 5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2 DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP compare 1 FP div/sqrt 6/19/2021 Register file (128 regs, 32 bit, 15 ports) Exec unit Exec unit Data cache (16 k. B) Instruction register (5 issue slots) PC Platform Design Instruction cache (32 k. B) H. Corporaal and B. Mesman 12

Tri. Media TM 32 A processor 0. 18 micron area : 16. 9 mm 2 200 MHz (typ) 1. 4 W 7 m. W/MHz I/O INTERFACE TAG Platform Design H. Corporaal and B. Mesman DSPMUL 2 DSPMUL 1 IFMUL 1 (FLOAT) IFMUL 2 (FLOAT) FCOMP 2 ALU 1 ALU 4 SHIFTER 1 ALU 2 FTOUGH 1 DSPALU 2 FALU 0 FALU 3 ALU 0 SHIFTER 0 DSPALU 0 (MIPS= 0. 9 m. W/MHz) SEQUENCER / DECODE TAG 6/19/2021 TAG I-Cache D-cache 13

Architecture methods: Powerful Instructions (2) VLIW Characteristics • Only RISC like operation support Þ Short cycle times • Flexible: Can implement any FU mixture • Extensible • Tight inter FU connectivity required • Large instructions (up to 1000 bits) • Not binary compatible • But good compilers exist 6/19/2021 Platform Design H. Corporaal and B. Mesman 14

Architecture methods Multiple instruction issue (per cycle) Who guarantees semantic correctness? – can instructions be executed in parallel • User specifies multiple instruction streams – MIMD (Multiple Instruction Multiple Data) • Run-time detection of ready instructions – Superscalar • Compile into dataflow representation – Dataflow processors 6/19/2021 Platform Design H. Corporaal and B. Mesman 15

Multiple instruction issue Example code a : = b + 15; Three Approaches Translation to DDG (Data Dependence Graph) c : = 3. 14 * d; e : = c / f; &d 3. 14 &f &b ld 15 + &a ld &e ld &c * / st st st 6/19/2021 Platform Design H. Corporaal and B. Mesman 16

Generated Code Instr. Sequential Code Dataflow Code I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 ld(M(&b) addi 15 st M(&a) ld M(&d) muli 3. 14 st M(&c) ld M(&f) div st M(&e) ld addi st ld muli st ld div st r 1, M(&b) r 1, 15 r 1, M(&a) r 1, M(&d) r 1, 3. 14 r 1, M(&c) r 2, M(&f) r 1, r 2 r 1, M(&e) -> I 2 -> I 3 -> I 5 -> I 6, I 8 -> I 9 Notes: • An MIMD may execute two streams: (1) I 1 -I 3 (2) I 4 -I 9 – No dependencies between streams; in practice communication and synchronization required between streams • A superscalar issues multiple instructions from sequential stream – Obey dependencies (True and name dependencies) – Reverse engineering of DDG needed at run-time • Dataflow code is direct representation of DDG 6/19/2021 Platform Design H. Corporaal and B. Mesman 17

Result Tokens Multiple Instruction Issue: Data flow processor Token Matching Token Store Instruction Generate Instruction Store Reservation Stations FU-1 6/19/2021 FU-2 Platform Design FU-K H. Corporaal and B. Mesman 18

Instruction Pipeline Overview CISC IF DC RF EX RISC IF DC/RF EX WB IF 1 IF 2 DC 1 DC 2 ISSUE RF 1 RF 2 EX 1 EX 2 ROB WB 1 WB 2 IF 3 DC 3 ISSUE RF 3 EX 3 ROB WB 3 IFk DCk ISSUE RFk EXk ROB WBk Superpipelined VLIW 6/19/2021 IF IF 1 IF 2 --- IFs RF 1 RF 2 EX 1 EX 2 WB 1 WB 2 RFk EXk WBk DC RF DC Platform Design H. Corporaal and B. Mesman EX 1 DATAFLOW Superscalar WB EX 2 --- EX 5 WB RF 1 EX 1 WB 1 RF 2 EX 2 WB 2 RFk EXk WBk 19

Four dimensional representation of the architecture design space <I, O, D, S> SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar 0. 1 RISC MIMD 10 Dataflow 100 Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ 6/19/2021 Superpipelining Degree ‘S’ Platform Design H. Corporaal and B. Mesman 20

Architecture design space Typical values of K (# of functional units or processor nodes), and <I, O, D, S> for different architectures Architecture K I O D S Mpar CISC RISC VLIW Superscalar Superpipelined Vector SIMD MIMD Dataflow 0. 2 1 1 3 1 0. 1 1 32 10 1. 2 1 10 1 1 1 1 64 128 1 1. 2 3 5 1. 2 0. 26 1. 2 12 3. 6 3 32 154 38 12 1 1 10 3 1 7 128 32 10 S(architecture) = f(Op) * lt (Op) Op I_set Mpar = I*O*D*S 6/19/2021 Platform Design H. Corporaal and B. Mesman 21

Overview • Enhance performance: architecture methods • Instruction Level Parallelism – limits on ILP • VLIW – Examples • Clustering • Code generation • Hands-on 6/19/2021 Platform Design H. Corporaal and B. Mesman 22

General organization of an ILP architecture FU-4 Data memory FU-3 Register file Instruction decode unit Instruction fetch unit Instruction memory FU-2 Bypassing network FU-1 CPU FU-5 6/19/2021 Platform Design H. Corporaal and B. Mesman 23

Motivation for ILP • Increasing VLSI densities; decreasing feature size • Increasing performance requirements • New application areas, like – multi-media (image, audio, video, 3 -D) – intelligent search and filtering engines – neural, fuzzy, genetic computing • More functionality • Use of existing Code (Compatibility) • Low Power: P = f. CVdd 2 6/19/2021 Platform Design H. Corporaal and B. Mesman 24

Low power through parallelism • Sequential Processor – – Switching capacitance C Frequency f Voltage V P = f. CV 2 • Parallel Processor (two times the number of units) – – 6/19/2021 Switching capacitance 2 C Frequency f/2 Voltage V’ < V P = f/2 2 C V’ 2 = f. CV’ 2 Platform Design H. Corporaal and B. Mesman 25

Measuring and exploiting available ILP • How much ILP is there in applications? • How to measure parallelism within applications? – Using existing compiler – Using trace analysis • Track all the real data dependencies (Ra. Ws) of instructions from issue window – register dependence – memory dependence • Check for correct branch prediction – if prediction correct continue – if wrong, flush schedule and start in next cycle 6/19/2021 Platform Design H. Corporaal and B. Mesman 26

Trace analysis Program Compiled code Trace set r 1, 0 set r 2, 3 set r 3, &A st r 1, 0(r 3) add r 1, 1 r 3, 4 For i : = 0. . 2 set r 1, 0 add A[i] : = i; set r 2, 3 brne r 1, r 2, Loop set r 3, &A st r 1, 0(r 3) add r 1, 1 add r 3, 4 brne r 1, r 2, Loop S : = X+3; Loop: brne r 1, r 2, Loop st r 1, 0(r 3) add r 1, 1 add r 3, 4 r 1, r 5, 3 brne r 1, r 2, Loop How parallel can this code be executed? 6/19/2021 Platform Design H. Corporaal and B. Mesman add r 1, r 5, 3 27

Trace analysis Parallel Trace set r 1, 0 set r 2, 3 set r 3, &A st r 1, 0(r 3) add r 1, 1 add r 3, 4 brne r 1, r 2, Loop st r 1, 0(r 3) add r 1, 1 add r 3, 4 brne r 1, r 2, Loop add r 1, r 5, 3 Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2. 7 6/19/2021 Platform Design H. Corporaal and B. Mesman 28

Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction – Perfect => all program instructions available for execution 3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal Also: – – unlimited number of instructions issued/cycle (unlimited resources), and unlimited instruction window perfect caches 1 cycle latency for all instructions (FP *, /) Programs were compiled using MIPS compiler with maximum optimization level 6/19/2021 Platform Design H. Corporaal and B. Mesman 29

Upper Limit to ILP: Ideal Processor FP: 75 - 150 IPC Integer: 18 - 60 6/19/2021 Platform Design H. Corporaal and B. Mesman 30

Window Size and Branch Impact • Change from infinite window to examine 2000 FP: 15 - 45 and issue at most 64 instructions per cycle IPC Integer: 6 – 12 6/19/2021 Platform Design BHT(512) H. Corporaal and B. Mesman Perfect Tournament Profile No prediction 31

Impact of Limited Renaming Registers • Changes: 2000 instr. window, 64 instr. issue, 8 K 2 -level predictor (slightly better than tournament predictor) FP: 11 - 45 IPC Integer: 5 - 15 6/19/2021 Platform Design 256 H. Corporaal and B. Mesman Infinite 128 64 32 32

Memory Address Alias Impact • Changes: 2000 instr. window, 64 instr. issue, 8 K 2 -level predictor, 256 renaming registers IPC FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 Perfect Global/stack perfect Inspection None 6/19/2021 Platform Design H. Corporaal and B. Mesman 33

Window Size Impact • Assumptions: Perfect disambiguation, 1 K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window IPC FP: 8 - 45 6/19/2021 Integer: 6 - 12 Platform Design 128 H. Corporaal and B. Mesman Infinite 256 64 32 16 8 4 34

How to Exceed ILP Limits of This Study? • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory • Unnecessary dependences – compiler did not unroll loops so iteration variable dependence • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction – Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis 6/19/2021 Platform Design H. Corporaal and B. Mesman 35

Conclusions • Amount of parallelism is limited – higher in Multi-Media – higher in kernels • Trace analysis detects all types of parallelism – task, data and operation types • Detected parallelism depends on – quality of compiler – hardware – source-code transformations 6/19/2021 Platform Design H. Corporaal and B. Mesman 36

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW – Examples • • C 6 TM IA-64: Itanium, . . TTA • Clustering • Code generation • Hands-on 6/19/2021 Platform Design H. Corporaal and B. Mesman 37

VLIW concept A VLIW architecture with 7 FUs Instruction Memory Instruction register Function Int FU units Int FU LD/ST FP FU Floating Point Register File Int Register File Data Memory 6/19/2021 Platform Design H. Corporaal and B. Mesman 38

VLIW characteristics • • Multiple operations per instruction One instruction per cycle issued (at most) Compiler is in control Only RISC like operation support – Short cycle times – Easier to compile for • Flexible: Can implement any FU mixture • Extensible / Scalable However: • tight inter FU connectivity required • not binary compatible !! – (new long instruction format) 6/19/2021 Platform Design H. Corporaal and B. Mesman 39

Veloci. TI C 6 x datapath 6/19/2021 Platform Design H. Corporaal and B. Mesman 40

VLIW example: TMS 320 C 62 Veloci. TI Processor • 8 operations (of 32 -bit) per instruction (256 bit) • Two clusters – 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs) – 2 x 16 registers – One bus available to write in register file of other cluster • • • 6/19/2021 Flexible addressing modes (like circular addressing) Flexible instruction packing All instruction conditional 5 ns, 200 MHz, 0. 25 um, 5 -layer CMOS 128 KB on-chip RAM Platform Design H. Corporaal and B. Mesman 41

VLIW example: Trimedia TM 1000 DSPCPU 5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2 DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP compare 1 FP div/sqrt 6/19/2021 Register file (128 regs, 32 bit, 15 ports) Exec unit Exec unit Data cache (16 k. B) Instruction register (5 issue slots) PC Platform Design Instruction cache (32 k. B) H. Corporaal and B. Mesman 42

Intel Architecture IA-64 Explicit Parallel Instruction Computer (EPIC) • IA-64 architecture -> Itanium, first realization Register model: • 128 64 -bit int x bits, stack, rotating • 128 82 -bit floating point, rotating • 64 1 -bit boolean • 8 64 -bit branch target address • system control registers 6/19/2021 Platform Design H. Corporaal and B. Mesman 43

EPIC Architecture: IA-64 • Instructions grouped in 128 -bit bundles – 3 * 41 -bit instruction – 5 template bits, indicate type and stop location • Each 41 -bit instruction – starts with 4 -bit opcode, and – ends with 6 -bit guard (boolean) register-id • Supports speculative loads 6/19/2021 Platform Design H. Corporaal and B. Mesman 44

Itanium 6/19/2021 Platform Design H. Corporaal and B. Mesman 45

Itanium 2: Mc. Kinley 6/19/2021 Platform Design H. Corporaal and B. Mesman 46

EPIC Architecture: IA-64 • EPIC allows for more binary compatibility then a plain VLIW: – Function unit assignment performed at run-time – Lock when FU results not available • See other website for more info on IA-64: – www. ics. ele. tue. nl/~heco/courses/ACA – (look at related material) 6/19/2021 Platform Design H. Corporaal and B. Mesman 47