CS 704 Advanced Computer Architecture Lecture 17 Instruction

  • Slides: 42
Download presentation
CS 704 Advanced Computer Architecture Lecture 17 Instruction Level Parallelism (High-performance Instructions delivery -

CS 704 Advanced Computer Architecture Lecture 17 Instruction Level Parallelism (High-performance Instructions delivery - Multiple Issue) Prof. Dr. M. Ashraf Chughtai

Recap: MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2

Recap: MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 2

High-Performance Processors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 3

High-Performance Processors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 3

Reducing branch penalties for High. Performance Processors Branch Target Buffer Integrated Instruction Fetch Units

Reducing branch penalties for High. Performance Processors Branch Target Buffer Integrated Instruction Fetch Units Return Address Predictors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 4

1: Branch Target Buffer MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic

1: Branch Target Buffer MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 5

2: Integrated Instruction Fetch Units Integrated Branch Prediction Instruction Prefetch Instruction memory access and

2: Integrated Instruction Fetch Units Integrated Branch Prediction Instruction Prefetch Instruction memory access and buffering MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 6

2: Integrated Instruction Fetch Units …. . Cont’d Integrated Branch Prediction The Branch-predictor is

2: Integrated Instruction Fetch Units …. . Cont’d Integrated Branch Prediction The Branch-predictor is included in the Instruction Fetch Unit So, it predicts and drive the fetch -pipe MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 7

2: Integrated Instruction Fetch Units …. . Cont’d Instruction Prefetch An instruction pre-fetch queue

2: Integrated Instruction Fetch Units …. . Cont’d Instruction Prefetch An instruction pre-fetch queue is part of IIFU The queue holds multiple instructions and deliver more than one instructions in one cycle MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 8

2: Integrated Instruction Fetch Units …. . Cont’d Instruction-Memory access and buffering Fetching multiple

2: Integrated Instruction Fetch Units …. . Cont’d Instruction-Memory access and buffering Fetching multiple instructions per clock cycle may require accessing multiple cache lines, which is a complex operation IIFU facilitates to overcome these complexities and hides the cost of crossing cache-blocks IIFU also provides instruction buffering and on-demand issue MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 9

3: Return Address Predictors The Return-Address predictor predicts the indirect jumps, i. e. ,

3: Return Address Predictors The Return-Address predictor predicts the indirect jumps, i. e. , the jumps whose address varies at rum time High-level language programs generate such jumps for indirect procedure calls and select or case statements MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 10

Summary: Minimizing Control Hazard Penalties MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism

Summary: Minimizing Control Hazard Penalties MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 11

Multiple Instruction-Issue Processors All of the schemes described so far can at best achieve

Multiple Instruction-Issue Processors All of the schemes described so far can at best achieve 1 instruction/cycle There exist two variations to these schemes: - Superscalar processors - Very Long Instruction Word (VLIW) processors MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 12

1: Superscalar – Static Scheduling– Dynamic Scheduling MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction

1: Superscalar – Static Scheduling– Dynamic Scheduling MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 13

Superscalar § The statically scheduled processors use inorder execution § The dynamically scheduled use

Superscalar § The statically scheduled processors use inorder execution § The dynamically scheduled use out-of-order execution § Superscalar concept has been used in: – IBM Power 2 – Sun Ultra SPARC – Pentium III/4 – DEC Alpha – HP 8000 MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 14

2: Very Long Instruction Words – VLIW processor MAC/VU-Advanced Computer Architecture Lecture 17 –

2: Very Long Instruction Words – VLIW processor MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 15

2: VLIW / EPIC Machines MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism

2: VLIW / EPIC Machines MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 16

2: VLIW Processors VLIW includes new features for: - predication, - rotating registers and

2: VLIW Processors VLIW includes new features for: - predication, - rotating registers and - speculations, etc. Typical implementations are: – i 860, Trimedia, Itanium We will talk about statically scheduled superscalar today and about compiling for VLIW/EPIC later MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 17

Statically Scheduled Superscalar Processor MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic

Statically Scheduled Superscalar Processor MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 18

Statically Scheduled Superscalar Processor Instruction Issue Process: § The multiple instruction issue is a

Statically Scheduled Superscalar Processor Instruction Issue Process: § The multiple instruction issue is a complex process § During instruction fetch, the pipeline receives all the instruction that could potentially issue, called Issue-packet (it may have say from 1 to 4 instructions) MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 19

Example: Statically Scheduled Superscalar MIPS Processor As an example let us consider a MIPS

Example: Statically Scheduled Superscalar MIPS Processor As an example let us consider a MIPS superscalar that has: Number of Instructions issue/clock: 2 instructions - 1 FP operation, 1 Integer operations (The integer operations include Load/store to integer or FP register, branch and Integer ALU operation) Issuing two instructions per cycle would require Fetch and Decode 64 -bits/clock cycle MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 20

Example: … Cont’d Fetching two instructions need careful handling of the cache, as either

Example: … Cont’d Fetching two instructions need careful handling of the cache, as either the first instruction may be at end of the cache block or the second instruction my be at the beginning of the cache block Hazard detection The restriction of one FP and one Integer makes the hazard checking simple. MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 21

Example: … Cont’d We simply have to determine the likelihood of hazards between two

Example: … Cont’d We simply have to determine the likelihood of hazards between two instructions in an issuepacket If this situation exist then the Simple solution is to treat this as a structural hazard (issue only 1 of them) However, the only difficulties arise when Integer Instruction is a FP load/store/move instruction it may create contention of the FP port and create RAW hazard when second instruction of the pair depends on the first MAC/VU-Advanced Lecture 17 – Instruction Level Computer Architecture Parallelism -Dynamic (6) 22

Example Issuing: If placement is not a problem, then fetch and issue is completed

Example Issuing: If placement is not a problem, then fetch and issue is completed in three steps” Fetch Two instructions from the cache Determine whether 0, 1 or 2 instructions can issue Issue them to the correct functional unit MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 23

Example superscalar pipeline in operation Let us see how the instructions look like when

Example superscalar pipeline in operation Let us see how the instructions look like when the go in pair in a pipe EX EX WB EX EX EX MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) WB EX EX 24

Dynamic Scheduling in Superscalar Processors 1. Extending Tomasulo’s concept to support two instruction-issue superscalar

Dynamic Scheduling in Superscalar Processors 1. Extending Tomasulo’s concept to support two instruction-issue superscalar pipeline § Here, we do not want to issue instruction to reservation station out of order, as this may lead to the violation of program semantics. § Further, to gain full advantage of Dynamic scheduling remove the constraints of issuing one FP and integer instruction in a clock. MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 25

Dynamic Scheduling in Superscalar Processors 2: Alternatively, Separate the data structure of FP and

Dynamic Scheduling in Superscalar Processors 2: Alternatively, Separate the data structure of FP and integer registers and simultaneously issue both instructions to their respective reservation stations, as long as two issued instructions do not access same registers. MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 26

Dynamic Scheduling in Superscalar Processors One approach is: to run this step (assigning a

Dynamic Scheduling in Superscalar Processors One approach is: to run this step (assigning a reservation station and update control) in half a clock cycle, so that two instructions can be processed in one clock cycle. Second approach is: to build logic necessary to handle two instructions at once, including any dependence between the instructions. MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 27

Dynamic Scheduling in Superscalar … cont’d Modern superscalar processors issue four or more instructions

Dynamic Scheduling in Superscalar … cont’d Modern superscalar processors issue four or more instructions per clock cycle and: often included both approaches In addition it is speculated that the Branch prediction is integrated into a dynamically scheduled pipeline. This referred to as Hardware-based speculation MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 28

Example Let us consider a most general 2 -issue dynamically scheduled processor and see

Example Let us consider a most general 2 -issue dynamically scheduled processor and see how a simple loop, which we considered for single-issue Tomasulo, executes on this processor Recall that our example loop adds a scalar in F 2 to each element of a vector in memory MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 29

Example Loop: L. D ADD. D S. D DADDUI BNE MAC/VU-Advanced Computer Architecture F

Example Loop: L. D ADD. D S. D DADDUI BNE MAC/VU-Advanced Computer Architecture F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) R 1, #-8 R 1, R 2, LOOP Lecture 17 – Instruction Level Parallelism -Dynamic (6) ; F 0=array element ; add scalar in F 2 ; store result ; decrement pointer ; 8 bytes (per DW) ; branch R 1!= R 2 30

Example Let us create a table showing when each instruction issues, begins execution, and

Example Let us create a table showing when each instruction issues, begins execution, and write its result to CDB for first three iterations using 2 -issue version of Tomasulo’s pipeline usingle issue processor Assume that Both FP and integer operation can be issued on every clock cycle, even if they are dependent MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 31

Example One integer functional unit is used for both ALU operations and effective address

Example One integer functional unit is used for both ALU operations and effective address calculations and a separate pipeline FP functional until for each operation type Issue and write result take one cycle each. There is dynamic branch prediction hardware and a separate functional unit to evaluate branch conditions There is one clock for integer ALU, two cycles for load, and three cycles for FP add. MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 32

Les us have a look on he clock cycle of issue, execution, and writing

Les us have a look on he clock cycle of issue, execution, and writing result for a dual version of Tomasulo’s pipeline MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 33

Example Thus, sustaining one iteration every three cycles would lead to an IPC of

Example Thus, sustaining one iteration every three cycles would lead to an IPC of 5/3=1. 67 (5 instructions in 3 clocks) MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 34

Example completion rate is: 15/16=0. 94 – 15 instructions execute in 16 cycles MAC/VU-Advanced

Example completion rate is: 15/16=0. 94 – 15 instructions execute in 16 cycles MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 35

Resource usage table MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6)

Resource usage table MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 36

Another example Overcoming the single integer pipe bottleneck Now let us consider another example

Another example Overcoming the single integer pipe bottleneck Now let us consider another example with 2 -issue version of the Tomasulo's pipeline to overcome single-integer unit pipe bottleneck MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 37

Example 2 In this example, we consider the execution of the same loop, as

Example 2 In this example, we consider the execution of the same loop, as used in the previous example, but using 2 -issue version of Tomasulo’s pipeline with 2 issue processor that has wider CDBs (2 CDBs) MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 38

Example 2 … Cont’d Similar to the previous example, the activities table, similar to

Example 2 … Cont’d Similar to the previous example, the activities table, similar to the previous table, shows the clock cycles of issue, execution and writing result for the dual-issue version of the Tomasulo’s pipeline Notice that dual-issue Tomasulo pipe has: - separate functional units for Integer ALU and effective address calculation; and - wider CDB MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 39

Activity Table for Example 2 MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism

Activity Table for Example 2 MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 40

Summary MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 41

Summary MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 41

Aslam-u-Alacum and Allah Hafiz MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic

Aslam-u-Alacum and Allah Hafiz MAC/VU-Advanced Computer Architecture Lecture 17 – Instruction Level Parallelism -Dynamic (6) 42