CENG 450 Computer Systems and Architecture Lecture 12

  • Slides: 35
Download presentation
CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi amirali@ece. uvic. ca 1

CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi amirali@ece. uvic. ca 1

Multiple Issue • Multiple Issue is the ability of the processor to start more

Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors 2

1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single

1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single instruction issue) x. Hazards x. Superpipelining? Diminishing returns (hazards + overhead) z How can we make the CPI = 0. 5? y Multiple instructions in every pipeline stage (super-scalar) z 1 2 3 4 5 6 7 z z z Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Inst 5 IF IF ID ID IF IF EX EX ID ID IF IF MEM EX EX ID ID WB WB MEM EX EX WB WB MEM WB WB 3

Superscalar Vs. VLIW z Religious debate, similar to RISC vs. CISC y Wisconsin +

Superscalar Vs. VLIW z Religious debate, similar to RISC vs. CISC y Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) y Q. Who can schedule code better, hardware or software? 4

Hardware Scheduling y High branch prediction accuracy y Dynamic information on latencies (cache misses)

Hardware Scheduling y High branch prediction accuracy y Dynamic information on latencies (cache misses) y Dynamic information on memory dependences y Easy to speculate (& recover from mis-speculation) y Works for generic, non-loop, irregular code y Ex: databases, desktop applications, compilers y Limited reorder buffer size limits “lookahead” y High cost/complexity y Slow clock 5

Software Scheduling y Large scheduling scope (full program), large “lookahead” x. Can handle very

Software Scheduling y Large scheduling scope (full program), large “lookahead” x. Can handle very long latencies y Simple hardware with fast clock y Only works well for “regular” codes (scientific, FORTRAN) y Low branch prediction accuracy x. Can improve by profiling y No information on latencies like cache misses x. Can improve by profiling y Pain to speculate and recover from mis-speculation x. Can improve with hardware support 6

Superscalar Processors z Pioneer: IBM (America => RIOS, RS/6000, Power-1) y Superscalar instruction combinations

Superscalar Processors z Pioneer: IBM (America => RIOS, RS/6000, Power-1) y Superscalar instruction combinations x 1 ALU or memory or branch + 1 FP (RS/6000) x. Any 1 + 1 ALU (Pentium) x. Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium II) z Impact of superscalar y More opportunity for hazards (why? ) y More performance loss due to hazards (why? ) 7

Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by

Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM Power. PC, Sun Ultra. Sparc, DEC Alpha, HP 8000 8

Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and

Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and jump prediction y Multiple instructions per cycle, multiple branches per cycle? z Scheduling and hazard elimination y Dynamic scheduling y Not necessarily: Alpha 21064 & Pentium were statically scheduled y Register renaming to eliminate WAR and WAW z Parallel functional units, paths/buses/multiple register ports z High performance memory systems z Speculative execution 9

SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great

SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together y CPI >= 1? x. Overcome with superscalar y Superscalar increases hazards x. Overcome with dynamic scheduling y RAW dependences still a problem? x. Overcome with a large window x. Branches a problem for filling large window? x. Overcome with speculation 10

The Big Picture issue Static program Fetch & branch predict execution & Reorder &

The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit 11

Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch

Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 12

Register renaming methods z z z First Method: Physical register file vs. logical (architectural)

Register renaming methods z z z First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of log. Reg use a free list of physical registers Physical register file bigger than log register file z Second Method: z physical register file same size as logical z Also, use a buffer w/ one entry per inst. Reorder buffer. 13

Register renaming: first method Mapping table r 0 R 8 r 1 R 7

Register renaming: first method Mapping table r 0 R 8 r 1 R 7 r 2 R 5 r 3 R 1 r 3 R 2 r 4 R 9 R 2 R 6 R 13 Free List Add r 3, 4 R 6 R 13 Free List 14

More Realistic HW: Register Impact z Effect of limiting the number of renaming registers

More Realistic HW: Register Impact z Effect of limiting the number of renaming registers FP: 11 - 45 IPC Integer: 5 - 15 15

Reorder Buffer z Place data in entry when execution finished Reserve entry at tail

Reorder Buffer z Place data in entry when execution finished Reserve entry at tail when dispatched Remove from head when complete Bypass to other instructions when needed 16

register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6,

register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6, 4 add rob 8, rob 6, 4 r 0 R 8 r 1 R 7 r 2 R 5 r 3 rob 6 r 3 rob 8 r 4 R 9 7 Reorder buffer 6 r 3 8 0 …. . R 3 7 6 0 R 3 0 …. Reorder buffer 17

Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch

Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 18

Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue;

Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue; one per inst. type No out-of-order inside queues Queues issue out of order 19

Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or

Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or big pool) NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo From Instruction Dispatch 20

Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2

Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination 21

Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation

Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation To memory Address compare Hazard Control stores Store address buffer 22

Example z MIPS R 10000, Alpha 21264, AMD k 5 : self study z

Example z MIPS R 10000, Alpha 21264, AMD k 5 : self study z READ THE PAPER. 23

VLIW z VLIW: Very long instruction word y In-order pipe, but each “instruction” is

VLIW z VLIW: Very long instruction word y In-order pipe, but each “instruction” is N instructions (VLIW) x. Typically “slotted” (I. e. , 1 st must be ALU, 2 nd must be load, etc. , ) y VLIW travels down pipe as a unit y Compiler packs independent instructions into VLIW IF ID ALU Ad FP MEM FP WB WB 24

Very Long Instruction Word z VLIW - issues a fixed number of instructions formatted

Very Long Instruction Word z VLIW - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions z Fixed number of instructions (4 -16) scheduled by the compiler; put operators into wide templates y Started with microcode (“horizontal microcode”) y Joint HP/Intel agreement in 1999/2000 y Intel Architecture-64 (IA-64) 64 -bit address /Itanium y Explicitly Parallel Instruction Computer (EPIC) y Transmeta: translates X 86 to VLIW y Many embedded controllers (TI, Motorola) are VLIW 25

Pure VLIW: What Does VLIW Mean? y All latencies fixed y All instructions in

Pure VLIW: What Does VLIW Mean? y All latencies fixed y All instructions in VLIW issue at once y No hardware interlocks at all y Compiler responsible for scheduling entire pipeline x. Includes stall cycles x. Possible if you know structure of pipeline and latencies exactly 26

Problems with Pure VLIW z Latencies are not fixed (e. g. , caches) y

Problems with Pure VLIW z Latencies are not fixed (e. g. , caches) y Option I: don’t use caches (forget it) y Option II: stall whole pipeline on a miss? y Option III: stall instructions waiting for memory? (need out-oforder) z Different implementations y Different pipe depths, different latencies y New pipeline may produce wrong results (code stalls in wrong place) y Recompile for new implementations? z Code compatibility is very important, made Intel what it is 27

Key: Static Scheduling z VLIW relies on the fact that software can schedule code

Key: Static Scheduling z VLIW relies on the fact that software can schedule code well y Loop unrolling (we have seen this one already) x. Code growth x. Poor scheduling along unrolled copies 28

Limits to Multi-Issue Machines z Inherent limitations of ILP y 1 branch in 5

Limits to Multi-Issue Machines z Inherent limitations of ILP y 1 branch in 5 instructions => how to keep a 5 -way VLIW busy? y Latencies of units => many operations must be scheduled y Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy. z Difficulties in building HW y Duplicate Functional Units to get parallel execution y Increase ports to Register File y Increase ports to memory y Decoding Superscalar and impact on clock rate, pipeline depth: y Complexity-effective designs 29

Limits to Multi-Issue Machines z Limitations specific to either Superscalar or VLIW implementation y.

Limits to Multi-Issue Machines z Limitations specific to either Superscalar or VLIW implementation y. Decode issue in Superscalar y. VLIW code size: unroll loops + wasted fields in VLIW y. VLIW lock step => 1 hazard & all instructions stall y. VLIW & binary compatibility 30

Multiple Issue Challenges z While Integer/FP split is simple for the HW, get CPI

Multiple Issue Challenges z While Integer/FP split is simple for the HW, get CPI of 0. 5 only for programs with: y Exactly 50% FP operations y No hazards z If more instructions issue at same time, greater difficulty of decode and issue y Even 2 -scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue z VLIW: tradeoff instruction space for simple decoding y The long instruction word has room for many operations y By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel x E. g. , 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide y Need compiling technique that schedules across several branches 31

HW Support for More ILP z How is this used in practice? z Rather

HW Support for More ILP z How is this used in practice? z Rather than predicting the direction of a branch, execute the instructions on both side!! z We early on know the target of a branch, long before we know it if will be taken or not. z So begin fetching/executing at that new Target PC. z But also continue fetching/executing as if the branch NOT taken. 32

Studies of ILP z Conflicting studies of amount of improvement available x. Benchmarks (vectorized

Studies of ILP z Conflicting studies of amount of improvement available x. Benchmarks (vectorized FP Fortran vs. integer C programs) x. Hardware sophistication x. Compiler sophistication z How much ILP is available using existing mechanisms with increasing HW budgets? z Do we need to invent new HW/SW mechanisms to keep on processor performance curve? 33

Summary z Static ILP y Simple, advanced loads, predication hardware (fast clock), complex compilers

Summary z Static ILP y Simple, advanced loads, predication hardware (fast clock), complex compilers y VLIW 34

Summary z Dynamic ILP y Instruction buffer x. Split ID into two stages one

Summary z Dynamic ILP y Instruction buffer x. Split ID into two stages one for in-order and other for out-of -order issue y Socreboard xout-of-order, doesn’t deal with WAR/WAW hazards y Tomasulo’s algorithm x. Uses register renaming to eliminate WAR/WAW hazards y Dynamic scheduling + speculation y Superscalar 35