CENG 450 Computer Systems and Architecture Lecture 12
![CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi amirali@ece. uvic. ca 1 CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi amirali@ece. uvic. ca 1](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-1.jpg)
![Multiple Issue • Multiple Issue is the ability of the processor to start more Multiple Issue • Multiple Issue is the ability of the processor to start more](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-2.jpg)
![1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single 1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-3.jpg)
![Superscalar Vs. VLIW z Religious debate, similar to RISC vs. CISC y Wisconsin + Superscalar Vs. VLIW z Religious debate, similar to RISC vs. CISC y Wisconsin +](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-4.jpg)
![Hardware Scheduling y High branch prediction accuracy y Dynamic information on latencies (cache misses) Hardware Scheduling y High branch prediction accuracy y Dynamic information on latencies (cache misses)](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-5.jpg)
![Software Scheduling y Large scheduling scope (full program), large “lookahead” x. Can handle very Software Scheduling y Large scheduling scope (full program), large “lookahead” x. Can handle very](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-6.jpg)
![Superscalar Processors z Pioneer: IBM (America => RIOS, RS/6000, Power-1) y Superscalar instruction combinations Superscalar Processors z Pioneer: IBM (America => RIOS, RS/6000, Power-1) y Superscalar instruction combinations](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-7.jpg)
![Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-8.jpg)
![Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-9.jpg)
![SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-10.jpg)
![The Big Picture issue Static program Fetch & branch predict execution & Reorder & The Big Picture issue Static program Fetch & branch predict execution & Reorder &](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-11.jpg)
![Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-12.jpg)
![Register renaming methods z z z First Method: Physical register file vs. logical (architectural) Register renaming methods z z z First Method: Physical register file vs. logical (architectural)](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-13.jpg)
![Register renaming: first method Mapping table r 0 R 8 r 1 R 7 Register renaming: first method Mapping table r 0 R 8 r 1 R 7](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-14.jpg)
![More Realistic HW: Register Impact z Effect of limiting the number of renaming registers More Realistic HW: Register Impact z Effect of limiting the number of renaming registers](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-15.jpg)
![Reorder Buffer z Place data in entry when execution finished Reserve entry at tail Reorder Buffer z Place data in entry when execution finished Reserve entry at tail](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-16.jpg)
![register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6, register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6,](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-17.jpg)
![Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-18.jpg)
![Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue; Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue;](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-19.jpg)
![Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-20.jpg)
![Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-21.jpg)
![Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-22.jpg)
![Example z MIPS R 10000, Alpha 21264, AMD k 5 : self study z Example z MIPS R 10000, Alpha 21264, AMD k 5 : self study z](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-23.jpg)
![VLIW z VLIW: Very long instruction word y In-order pipe, but each “instruction” is VLIW z VLIW: Very long instruction word y In-order pipe, but each “instruction” is](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-24.jpg)
![Very Long Instruction Word z VLIW - issues a fixed number of instructions formatted Very Long Instruction Word z VLIW - issues a fixed number of instructions formatted](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-25.jpg)
![Pure VLIW: What Does VLIW Mean? y All latencies fixed y All instructions in Pure VLIW: What Does VLIW Mean? y All latencies fixed y All instructions in](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-26.jpg)
![Problems with Pure VLIW z Latencies are not fixed (e. g. , caches) y Problems with Pure VLIW z Latencies are not fixed (e. g. , caches) y](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-27.jpg)
![Key: Static Scheduling z VLIW relies on the fact that software can schedule code Key: Static Scheduling z VLIW relies on the fact that software can schedule code](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-28.jpg)
![Limits to Multi-Issue Machines z Inherent limitations of ILP y 1 branch in 5 Limits to Multi-Issue Machines z Inherent limitations of ILP y 1 branch in 5](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-29.jpg)
![Limits to Multi-Issue Machines z Limitations specific to either Superscalar or VLIW implementation y. Limits to Multi-Issue Machines z Limitations specific to either Superscalar or VLIW implementation y.](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-30.jpg)
![Multiple Issue Challenges z While Integer/FP split is simple for the HW, get CPI Multiple Issue Challenges z While Integer/FP split is simple for the HW, get CPI](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-31.jpg)
![HW Support for More ILP z How is this used in practice? z Rather HW Support for More ILP z How is this used in practice? z Rather](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-32.jpg)
![Studies of ILP z Conflicting studies of amount of improvement available x. Benchmarks (vectorized Studies of ILP z Conflicting studies of amount of improvement available x. Benchmarks (vectorized](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-33.jpg)
![Summary z Static ILP y Simple, advanced loads, predication hardware (fast clock), complex compilers Summary z Static ILP y Simple, advanced loads, predication hardware (fast clock), complex compilers](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-34.jpg)
![Summary z Dynamic ILP y Instruction buffer x. Split ID into two stages one Summary z Dynamic ILP y Instruction buffer x. Split ID into two stages one](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-35.jpg)
- Slides: 35
![CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi amiraliece uvic ca 1 CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi amirali@ece. uvic. ca 1](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-1.jpg)
CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi amirali@ece. uvic. ca 1
![Multiple Issue Multiple Issue is the ability of the processor to start more Multiple Issue • Multiple Issue is the ability of the processor to start more](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-2.jpg)
Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors 2
![1990s Superscalar Processors z Bottleneck CPI 1 y Limit on scalar performance single 1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-3.jpg)
1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single instruction issue) x. Hazards x. Superpipelining? Diminishing returns (hazards + overhead) z How can we make the CPI = 0. 5? y Multiple instructions in every pipeline stage (super-scalar) z 1 2 3 4 5 6 7 z z z Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Inst 5 IF IF ID ID IF IF EX EX ID ID IF IF MEM EX EX ID ID WB WB MEM EX EX WB WB MEM WB WB 3
![Superscalar Vs VLIW z Religious debate similar to RISC vs CISC y Wisconsin Superscalar Vs. VLIW z Religious debate, similar to RISC vs. CISC y Wisconsin +](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-4.jpg)
Superscalar Vs. VLIW z Religious debate, similar to RISC vs. CISC y Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) y Q. Who can schedule code better, hardware or software? 4
![Hardware Scheduling y High branch prediction accuracy y Dynamic information on latencies cache misses Hardware Scheduling y High branch prediction accuracy y Dynamic information on latencies (cache misses)](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-5.jpg)
Hardware Scheduling y High branch prediction accuracy y Dynamic information on latencies (cache misses) y Dynamic information on memory dependences y Easy to speculate (& recover from mis-speculation) y Works for generic, non-loop, irregular code y Ex: databases, desktop applications, compilers y Limited reorder buffer size limits “lookahead” y High cost/complexity y Slow clock 5
![Software Scheduling y Large scheduling scope full program large lookahead x Can handle very Software Scheduling y Large scheduling scope (full program), large “lookahead” x. Can handle very](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-6.jpg)
Software Scheduling y Large scheduling scope (full program), large “lookahead” x. Can handle very long latencies y Simple hardware with fast clock y Only works well for “regular” codes (scientific, FORTRAN) y Low branch prediction accuracy x. Can improve by profiling y No information on latencies like cache misses x. Can improve by profiling y Pain to speculate and recover from mis-speculation x. Can improve with hardware support 6
![Superscalar Processors z Pioneer IBM America RIOS RS6000 Power1 y Superscalar instruction combinations Superscalar Processors z Pioneer: IBM (America => RIOS, RS/6000, Power-1) y Superscalar instruction combinations](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-7.jpg)
Superscalar Processors z Pioneer: IBM (America => RIOS, RS/6000, Power-1) y Superscalar instruction combinations x 1 ALU or memory or branch + 1 FP (RS/6000) x. Any 1 + 1 ALU (Pentium) x. Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium II) z Impact of superscalar y More opportunity for hazards (why? ) y More performance loss due to hazards (why? ) 7
![Superscalar Processors Issues varying number of instructions per clock Scheduling Static by Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-8.jpg)
Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM Power. PC, Sun Ultra. Sparc, DEC Alpha, HP 8000 8
![Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-9.jpg)
Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and jump prediction y Multiple instructions per cycle, multiple branches per cycle? z Scheduling and hazard elimination y Dynamic scheduling y Not necessarily: Alpha 21064 & Pentium were statically scheduled y Register renaming to eliminate WAR and WAW z Parallel functional units, paths/buses/multiple register ports z High performance memory systems z Speculative execution 9
![SS DS Speculation z Superscalar Dynamic scheduling Speculation Three great SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-10.jpg)
SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together y CPI >= 1? x. Overcome with superscalar y Superscalar increases hazards x. Overcome with dynamic scheduling y RAW dependences still a problem? x. Overcome with a large window x. Branches a problem for filling large window? x. Overcome with speculation 10
![The Big Picture issue Static program Fetch branch predict execution Reorder The Big Picture issue Static program Fetch & branch predict execution & Reorder &](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-11.jpg)
The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit 11
![Superscalar Microarchitecture Floating point register file Predecode Inst Cache Inst buffer Decode rename dispatch Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-12.jpg)
Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 12
![Register renaming methods z z z First Method Physical register file vs logical architectural Register renaming methods z z z First Method: Physical register file vs. logical (architectural)](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-13.jpg)
Register renaming methods z z z First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of log. Reg use a free list of physical registers Physical register file bigger than log register file z Second Method: z physical register file same size as logical z Also, use a buffer w/ one entry per inst. Reorder buffer. 13
![Register renaming first method Mapping table r 0 R 8 r 1 R 7 Register renaming: first method Mapping table r 0 R 8 r 1 R 7](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-14.jpg)
Register renaming: first method Mapping table r 0 R 8 r 1 R 7 r 2 R 5 r 3 R 1 r 3 R 2 r 4 R 9 R 2 R 6 R 13 Free List Add r 3, 4 R 6 R 13 Free List 14
![More Realistic HW Register Impact z Effect of limiting the number of renaming registers More Realistic HW: Register Impact z Effect of limiting the number of renaming registers](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-15.jpg)
More Realistic HW: Register Impact z Effect of limiting the number of renaming registers FP: 11 - 45 IPC Integer: 5 - 15 15
![Reorder Buffer z Place data in entry when execution finished Reserve entry at tail Reorder Buffer z Place data in entry when execution finished Reserve entry at tail](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-16.jpg)
Reorder Buffer z Place data in entry when execution finished Reserve entry at tail when dispatched Remove from head when complete Bypass to other instructions when needed 16
![register renaming reorder buffer Before add r 3 4 Add r 3 rob 6 register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6,](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-17.jpg)
register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6, 4 add rob 8, rob 6, 4 r 0 R 8 r 1 R 7 r 2 R 5 r 3 rob 6 r 3 rob 8 r 4 R 9 7 Reorder buffer 6 r 3 8 0 …. . R 3 7 6 0 R 3 0 …. Reorder buffer 17
![Instruction Buffers Floating point register file Predecode Inst Cache Inst buffer Decode rename dispatch Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-18.jpg)
Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 18
![Issue Buffer Organization z a Single shared queue No outoforder No Renaming bMultiple queue Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue;](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-19.jpg)
Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue; one per inst. type No out-of-order inside queues Queues issue out of order 19
![Issue Buffer Organization z z c Multiple reservation stations one per instruction type or Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-20.jpg)
Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or big pool) NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo From Instruction Dispatch 20
![Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-21.jpg)
Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination 21
![Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add translation Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-22.jpg)
Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation To memory Address compare Hazard Control stores Store address buffer 22
![Example z MIPS R 10000 Alpha 21264 AMD k 5 self study z Example z MIPS R 10000, Alpha 21264, AMD k 5 : self study z](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-23.jpg)
Example z MIPS R 10000, Alpha 21264, AMD k 5 : self study z READ THE PAPER. 23
![VLIW z VLIW Very long instruction word y Inorder pipe but each instruction is VLIW z VLIW: Very long instruction word y In-order pipe, but each “instruction” is](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-24.jpg)
VLIW z VLIW: Very long instruction word y In-order pipe, but each “instruction” is N instructions (VLIW) x. Typically “slotted” (I. e. , 1 st must be ALU, 2 nd must be load, etc. , ) y VLIW travels down pipe as a unit y Compiler packs independent instructions into VLIW IF ID ALU Ad FP MEM FP WB WB 24
![Very Long Instruction Word z VLIW issues a fixed number of instructions formatted Very Long Instruction Word z VLIW - issues a fixed number of instructions formatted](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-25.jpg)
Very Long Instruction Word z VLIW - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions z Fixed number of instructions (4 -16) scheduled by the compiler; put operators into wide templates y Started with microcode (“horizontal microcode”) y Joint HP/Intel agreement in 1999/2000 y Intel Architecture-64 (IA-64) 64 -bit address /Itanium y Explicitly Parallel Instruction Computer (EPIC) y Transmeta: translates X 86 to VLIW y Many embedded controllers (TI, Motorola) are VLIW 25
![Pure VLIW What Does VLIW Mean y All latencies fixed y All instructions in Pure VLIW: What Does VLIW Mean? y All latencies fixed y All instructions in](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-26.jpg)
Pure VLIW: What Does VLIW Mean? y All latencies fixed y All instructions in VLIW issue at once y No hardware interlocks at all y Compiler responsible for scheduling entire pipeline x. Includes stall cycles x. Possible if you know structure of pipeline and latencies exactly 26
![Problems with Pure VLIW z Latencies are not fixed e g caches y Problems with Pure VLIW z Latencies are not fixed (e. g. , caches) y](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-27.jpg)
Problems with Pure VLIW z Latencies are not fixed (e. g. , caches) y Option I: don’t use caches (forget it) y Option II: stall whole pipeline on a miss? y Option III: stall instructions waiting for memory? (need out-oforder) z Different implementations y Different pipe depths, different latencies y New pipeline may produce wrong results (code stalls in wrong place) y Recompile for new implementations? z Code compatibility is very important, made Intel what it is 27
![Key Static Scheduling z VLIW relies on the fact that software can schedule code Key: Static Scheduling z VLIW relies on the fact that software can schedule code](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-28.jpg)
Key: Static Scheduling z VLIW relies on the fact that software can schedule code well y Loop unrolling (we have seen this one already) x. Code growth x. Poor scheduling along unrolled copies 28
![Limits to MultiIssue Machines z Inherent limitations of ILP y 1 branch in 5 Limits to Multi-Issue Machines z Inherent limitations of ILP y 1 branch in 5](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-29.jpg)
Limits to Multi-Issue Machines z Inherent limitations of ILP y 1 branch in 5 instructions => how to keep a 5 -way VLIW busy? y Latencies of units => many operations must be scheduled y Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy. z Difficulties in building HW y Duplicate Functional Units to get parallel execution y Increase ports to Register File y Increase ports to memory y Decoding Superscalar and impact on clock rate, pipeline depth: y Complexity-effective designs 29
![Limits to MultiIssue Machines z Limitations specific to either Superscalar or VLIW implementation y Limits to Multi-Issue Machines z Limitations specific to either Superscalar or VLIW implementation y.](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-30.jpg)
Limits to Multi-Issue Machines z Limitations specific to either Superscalar or VLIW implementation y. Decode issue in Superscalar y. VLIW code size: unroll loops + wasted fields in VLIW y. VLIW lock step => 1 hazard & all instructions stall y. VLIW & binary compatibility 30
![Multiple Issue Challenges z While IntegerFP split is simple for the HW get CPI Multiple Issue Challenges z While Integer/FP split is simple for the HW, get CPI](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-31.jpg)
Multiple Issue Challenges z While Integer/FP split is simple for the HW, get CPI of 0. 5 only for programs with: y Exactly 50% FP operations y No hazards z If more instructions issue at same time, greater difficulty of decode and issue y Even 2 -scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue z VLIW: tradeoff instruction space for simple decoding y The long instruction word has room for many operations y By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel x E. g. , 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide y Need compiling technique that schedules across several branches 31
![HW Support for More ILP z How is this used in practice z Rather HW Support for More ILP z How is this used in practice? z Rather](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-32.jpg)
HW Support for More ILP z How is this used in practice? z Rather than predicting the direction of a branch, execute the instructions on both side!! z We early on know the target of a branch, long before we know it if will be taken or not. z So begin fetching/executing at that new Target PC. z But also continue fetching/executing as if the branch NOT taken. 32
![Studies of ILP z Conflicting studies of amount of improvement available x Benchmarks vectorized Studies of ILP z Conflicting studies of amount of improvement available x. Benchmarks (vectorized](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-33.jpg)
Studies of ILP z Conflicting studies of amount of improvement available x. Benchmarks (vectorized FP Fortran vs. integer C programs) x. Hardware sophistication x. Compiler sophistication z How much ILP is available using existing mechanisms with increasing HW budgets? z Do we need to invent new HW/SW mechanisms to keep on processor performance curve? 33
![Summary z Static ILP y Simple advanced loads predication hardware fast clock complex compilers Summary z Static ILP y Simple, advanced loads, predication hardware (fast clock), complex compilers](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-34.jpg)
Summary z Static ILP y Simple, advanced loads, predication hardware (fast clock), complex compilers y VLIW 34
![Summary z Dynamic ILP y Instruction buffer x Split ID into two stages one Summary z Dynamic ILP y Instruction buffer x. Split ID into two stages one](https://slidetodoc.com/presentation_image_h/e0a471e792ea986bacf6e7539f15b1a6/image-35.jpg)
Summary z Dynamic ILP y Instruction buffer x. Split ID into two stages one for in-order and other for out-of -order issue y Socreboard xout-of-order, doesn’t deal with WAR/WAW hazards y Tomasulo’s algorithm x. Uses register renaming to eliminate WAR/WAW hazards y Dynamic scheduling + speculation y Superscalar 35
Trigonometri adalah
Computer architecture notes
Computer architecture lecture
01:640:244 lecture notes - lecture 15: plat, idah, farad
Computer architecture vs organization
3 bus architecture
Fgi and fgo in computer architecture
Operating system lecture notes
Articulators
Lecture sound systems
Computer security 161 cryptocurrency lecture
Computer aided drug design lecture notes
Emine ceng
R.p.e. certificate
Cow ceng
Imeche qualification checker
Ceng 334
Ceng 213
Meltem imamoğlu
Citp vs ceng
Ceng 213
What is the difference between ethics and law
Ceng 3420
505
Ceng 351
Ceng334
Iyte ceng
Ceng 241
Double float
Ceng 491
Vray roadmap
Ceng 477
Ceng 241
Nem ceng
Ceng 112
Gtü ceng