CSE 431 Computer Architecture Fall 2005 Lecture 17

  • Slides: 18
Download presentation
CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors Mary Jane Irwin (

CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors Mary Jane Irwin ( www. cse. psu. edu/~mji ) www. cse. psu. edu/~cg 431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] CSE 431 L 17 VLIW Processors. 1 Irwin, PSU, 2005

Review: Multi-Issue Datapath Responsibilities q Must handle, with a combination of hardware and software

Review: Multi-Issue Datapath Responsibilities q Must handle, with a combination of hardware and software l Data dependencies – aka data hazards - True data dependencies (read after write) – – Use data forwarding hardware Use compiler scheduling - Storage dependence (aka name dependence) – l Use register renaming to solve both » Antidependencies (write after read) » Output dependencies (write after write) Procedural dependencies – aka control hazards - Use aggressive branch prediction (speculation) - Use predication l Resource conflicts – aka structural hazards - Use resource duplication or resource pipelining to reduce (or eliminate) resource conflicts - Use arbitration for result and commit buses and register file read and write ports CSE 431 L 17 VLIW Processors. 2 Irwin, PSU, 2005

Review: Multiple-Issue Processor Styles q q Dynamic multiple-issue processors (aka superscalar) l Decisions on

Review: Multiple-Issue Processor Styles q q Dynamic multiple-issue processors (aka superscalar) l Decisions on which instructions to execute simultaneously (in the range of 2 to 8 in 2005) are being made dynamically (at run time by the hardware) l E. g. , IBM Power 2, Pentium 4, MIPS R 10 K, HP PA 8500 IBM Static multiple-issue processors (aka VLIW) l Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler) l E. g. , Intel Itanium and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer) - 128 bit “bundles” containing 3 instructions each 41 bits + 5 bit template field (specifies which FU each instr needs) - Five functional units (Int. ALU, MMedia, DMem, FPALU, Branch) - Extensive support for speculation and predication CSE 431 L 17 VLIW Processors. 3 Irwin, PSU, 2005

History of VLIW Processors q Started with (horizontal) microprogramming l Very wide microinstructions used

History of VLIW Processors q Started with (horizontal) microprogramming l Very wide microinstructions used to directly generate control signals in single-issue processors (e. g. , IBM 360 series) q VLIW for multi-issue processors first appeared in the Multiflow and Cydrome (in the early 1980’s) q Current commercial VLIW processors l Intel i 860 RISC (dual mode: scalar and VLIW) l Intel I-64 (EPIC: Itanium and Itanium 2) l Transmeta Crusoe l Lucent/Motorola Star. Core l ADI Tiger. SHARC l Infineon (Siemens) Carmel CSE 431 L 17 VLIW Processors. 4 Irwin, PSU, 2005

Static Multiple Issue Machines (VLIW) q q Static multiple-issue processors (aka VLIW) use the

Static Multiple Issue Machines (VLIW) q q Static multiple-issue processors (aka VLIW) use the compiler to decide which instructions to issue and execute simultaneously l Issue packet – the set of instructions that are bundled together and issued in one clock cycle – think of it as one large instruction with multiple operations l The mix of instructions in the packet (bundle) is usually restricted – a single “instruction” with several predefined fields l The compiler does static branch prediction and code scheduling to reduce (control) or eliminate (data) hazards VLIW’s have l Multiple functional units (like SS processors) l Multi-ported register files (again like SS processors) l Wide program bus CSE 431 L 17 VLIW Processors. 5 Irwin, PSU, 2005

An Example: A VLIW MIPS q Consider a 2 -issue MIPS with a 2

An Example: A VLIW MIPS q Consider a 2 -issue MIPS with a 2 instr bundle 64 bits ALU Op (R format) or Branch (I format) q Instructions are always fetched, decoded, and issued in pairs l q Load or Store (I format) If one instr of the pair can not be used, it is replaced with a noop Need 4 read ports and 2 write ports and a separate memory address adder CSE 431 L 17 VLIW Processors. 6 Irwin, PSU, 2005

A MIPS VLIW (2 -issue) Datapath q Add No hazard hardware (so no load

A MIPS VLIW (2 -issue) Datapath q Add No hazard hardware (so no load use allowed) Add ALU 4 PC Instruction Memory Register File Write Addr Write Data Add Data Memory Sign Extend CSE 431 L 17 VLIW Processors. 7 Irwin, PSU, 2005

Code Scheduling Example q Consider the following loop code lp: q lw addu sw

Code Scheduling Example q Consider the following loop code lp: q lw addu sw addi bne $t 0, 0($s 1) $t 0, $s 2 $t 0, 0($s 1) $s 1, -4 $s 1, $0, lp # # # $t 0=array element add scalar in $s 2 store result decrement pointer branch if $s 1 != 0 Must “schedule” the instructions to avoid pipeline stalls l Instructions in one bundle must be independent l Must separate load use instructions from their loads by one cycle l Notice that the first two instructions have a load use dependency, the next two and last two have data dependencies l Assume branches are perfectly predicted by the hardware CSE 431 L 17 VLIW Processors. 8 Irwin, PSU, 2005

The Scheduled Code (Not Unrolled) ALU or branch lp: q lw Data transfer $t

The Scheduled Code (Not Unrolled) ALU or branch lp: q lw Data transfer $t 0, 0($s 1) CC 1 addi $s 1, -4 2 addu $t 0, $s 2 3 bne $s 1, $0, lp sw $t 0, 4($s 1) 4 Four clock cycles to execute 5 instructions for a l CPI of 0. 8 (versus the best case of 0. 5) l IPC of 1. 25 (versus the best case of 2. 0) l noops don’t count towards performance !! CSE 431 L 17 VLIW Processors. 9 Irwin, PSU, 2005

Loop Unrolling q Loop unrolling – multiple copies of the loop body are made

Loop Unrolling q Loop unrolling – multiple copies of the loop body are made and instructions from different iterations are scheduled together as a way to increase ILP q Apply loop unrolling (4 times for our example) and then schedule the resulting code q l Eliminate unnecessary loop overhead instructions l Schedule so as to avoid load use hazards During unrolling the compiler applies register renaming to eliminate all data dependencies that are not true dependencies CSE 431 L 17 VLIW Processors. 10 Irwin, PSU, 2005

Unrolled Code Example lp: lw lw addu sw sw addi bne CSE 431 L

Unrolled Code Example lp: lw lw addu sw sw addi bne CSE 431 L 17 VLIW Processors. 11 $t 0, 0($s 1) $t 1, -4($s 1) $t 2, -8($s 1) $t 3, -12($s 1) $t 0, $s 2 $t 1, $s 2 $t 2, $s 2 $t 3, $s 2 $t 0, 0($s 1) $t 1, -4($s 1) $t 2, -8($s 1) $t 3, -12($s 1) $s 1, -16 $s 1, $0, lp # $t 0=array element # $t 1=array element # $t 2=array element # $t 3=array element # add scalar in $s 2 # store result # decrement pointer # branch if $s 1 != 0 Irwin, PSU, 2005

The Scheduled Code (Unrolled) lp: ALU or branch addi $s 1, -16 lw $t

The Scheduled Code (Unrolled) lp: ALU or branch addi $s 1, -16 lw $t 1, 12($s 1) 2 1 addu $t 0, $s 2 lw $t 2, 8($s 1) 3 addu $t 1, $s 2 lw $t 3, 4($s 1) 4 addu $t 2, $s 2 sw $t 0, 16($s 1) 5 addu $t 3, $s 2 sw $t 1, 12($s 1) 6 sw $t 2, 8($s 1) 7 sw $t 3, 4($s 1) 8 bne q CC lw Data transfer $t 0, 0($s 1) $s 1, $0, lp Eight clock cycles to execute 14 instructions for a l CPI of 0. 57 (versus the best case of 0. 5) l IPC of 1. 8 (versus the best case of 2. 0) CSE 431 L 17 VLIW Processors. 12 Irwin, PSU, 2005

Speculation q Speculation is used to allow execution of future instr’s that (may) depend

Speculation q Speculation is used to allow execution of future instr’s that (may) depend on the speculated instruction l l q Speculate on the outcome of a conditional branch (branch prediction) Speculate that a store (for which we don’t yet know the address) that precedes a load does not refer to the same address, allowing the load to be scheduled before the store (load speculation) Must have (hardware and/or software) mechanisms for l l Checking to see if the guess was correct Recovering from the effects of the instructions that were executed speculatively if the guess was incorrect - In a VLIW processor the compiler can insert additional instr’s that check the accuracy of the speculation and can provide a fix-up routine to use when the speculation was incorrect q Ignore and/or buffer exceptions created by speculatively executed instructions until it is clear that they should really occur CSE 431 L 17 VLIW Processors. 13 Irwin, PSU, 2005

Predication q Predication can be used to eliminate branches by making the execution of

Predication q Predication can be used to eliminate branches by making the execution of an instruction dependent on a “predicate”, e. g. , if (p) {statement 1} else {statement 2} would normally compile using two branches. With predication it would compile as (p) statement 1 (~p) statement 2 q The use of (condition) indicates that the instruction is committed only if condition is true q Predication can be used to speculate as well as to eliminate branches CSE 431 L 17 VLIW Processors. 14 Irwin, PSU, 2005

Compiler Support for VLIW Processors q The compiler packs groups of independent instructions into

Compiler Support for VLIW Processors q The compiler packs groups of independent instructions into the bundle l Done by code re-ordering (trace scheduling) q The compiler uses loop unrolling to expose more ILP q The compiler uses register renaming to solve name dependencies and ensures no load use hazards occur q While superscalars use dynamic prediction, VLIW’s primarily depend on the compiler for branch prediction q l Loop unrolling reduces the number of conditional branches l Predication eliminates if-the-else branch structures by replacing them with predicated instructions The compiler predicts memory bank references to help minimize memory bank conflicts CSE 431 L 17 VLIW Processors. 15 Irwin, PSU, 2005

VLIW Advantages & Disadvantages (aot SS) q Advantages l l Simpler hardware (potentially less

VLIW Advantages & Disadvantages (aot SS) q Advantages l l Simpler hardware (potentially less power hungry) Potentially more scalable - Allow more instr’s per VLIW bundle and add more FUs q Disadvantages l Programmer/compiler complexity and longer compilation times - Deep pipelines and long latencies can be confusing (making peak performance elusive) l l Lock step operation, i. e. , on hazard all future issues stall until hazard is resolved (hence need for predication) Object (binary) code incompatibility Needs lots of program memory bandwidth Code bloat - Noops are a waste of program memory space - Loop unrolling to expose more ILP uses more program memory space CSE 431 L 17 VLIW Processors. 16 Irwin, PSU, 2005

CISC vs RISC vs SS vs VLIW CISC RISC Superscalar VLIW Instr size variable

CISC vs RISC vs SS vs VLIW CISC RISC Superscalar VLIW Instr size variable size fixed size (but large) Instr format variable format fixed format Registers few, some special many GP GP and rename (RUU) many, many GP Memory reference embedded in many instr’s load/store Key Issues decode complexity data forwarding, hazards hardware dependency resolution (compiler) code scheduling Instruction flow IF ID EX M WB IF ID CSE 431 L 17 VLIW Processors. 17 EX M WB IF ID EX M WB IF ID EX M WB Irwin, PSU, 2005

Next Lecture and Reminders q Next lecture - Reading assignment – PH 7. 1

Next Lecture and Reminders q Next lecture - Reading assignment – PH 7. 1 q Reminders l HW 4 due November 3 l Final exam (tentatively) schedule - Tuesday, December 13 th, 2: 30 -4: 20, Location TBD CSE 431 L 17 VLIW Processors. 18 Irwin, PSU, 2005