CS 704 Advanced Computer Architecture Lecture 11 Computer

  • Slides: 48
Download presentation
CS 704 Advanced Computer Architecture Lecture 11 Computer Hardware Design (Pipeline and Instruction Level

CS 704 Advanced Computer Architecture Lecture 11 Computer Hardware Design (Pipeline and Instruction Level Parallelism) Prof. Dr. M. Ashraf Chughtai

Today’s Topics Recap Lecture 10 Structural Hazards Data Hazards Control Hazards MAC/VU Advanced Computer

Today’s Topics Recap Lecture 10 Structural Hazards Data Hazards Control Hazards MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 2

Recap: Lecture 10 Multi cycle datapath verses pipeline datapath Key components of pipeline data

Recap: Lecture 10 Multi cycle datapath verses pipeline datapath Key components of pipeline data path Performance enhancement due to pipeline Introduction to hazards in pipelined datapath MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 3

Structural Hazards Attempt to use the same resource two different ways at the same

Structural Hazards Attempt to use the same resource two different ways at the same time, e. g. , Single memory port is accessed for instruction fetch and data read in the same clock cycle would be a structural hazard …. Example : next slide MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 4

Single Memory is a Structural Hazard Time (clock cycles) Instr 5 Mem Reg Mem

Single Memory is a Structural Hazard Time (clock cycles) Instr 5 Mem Reg Mem Reg ALU Instr 4 Reg ALU Instr 3 Mem Reg ALU Instr 2 Mem ALU O r d e r Instr 1 Load Mem Reg ALU I n s t r. Mem Reg Two memory read operations in the 4 th cycle: The LOAD instruction accesses memory to read data and the 4 th instruction fetched from the same memory MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 5

Single Memory is a Structural Hazard Time (clock cycles) Stall Instr 4 Reg Mem

Single Memory is a Structural Hazard Time (clock cycles) Stall Instr 4 Reg Mem Reg ALU Instr 3 Mem ALU ADD Reg Bubble Instr 2 Mem ALU O r d e r Instr 1 Load Mem Reg ALU I n s t r. Mem Reg Insert stall (bubble) to avoid memory structural hazard MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 6

Structural Hazards Structural hazard exists when Single write port of register accessed for two

Structural Hazards Structural hazard exists when Single write port of register accessed for two WB operations in same clock cycle – this situation does not exist in 5 stage pipeline But it may exist in 4 and 5 stage multi cycle pipeline Explanation next………………… MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 7

Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1 st lw Ifetch Reg/Dec 2 nd lw Ifetch 3 rd lw Exec Mem Wr Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Inst. Fetch, Dec/Reg. Rd, ALU for Exec, Data Mem and Register File’s Write port for the Wr stage Here, we have separate register’s read and write ports so registers read and write is allowed at the same time Each functional unit is used once MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 8

The Four Stages of R type R-type Cycle 1 Cycle 2 Cycle 3 Cycle

The Four Stages of R type R-type Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Wr R-type instruction does not access data memory, so it only takes 4 clocks, or say 4 stages to complete Here, the ALU is used to operate on the register operands The result is written in to the register during WB stage MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 9

Pipelining the R type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle

Pipelining the R type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch R-type Reg/Dec Exec Ifetch Reg/Dec Load Ops! We have a problem! Wr R-type Ifetch Wr Exec Mem Wr Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr We have pipeline conflict or structural hazard: – Two instructions try to write to the register file at the same time! – Only one write port MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 10

Important Observation Each functional unit can only be used once per instruction Each functional

Important Observation Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: – Load uses Register File’s Write Port during its 5 th stage – R-type uses Register File’s Write Port during its 4 th stage Two possible solutions ………. Next MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 11

Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle

Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Load Reg/Dec Exec Ifetch Reg/Dec R-type Ifetch Wr Exec Mem Reg/Dec Exec Wr Wr R-type Ifetch Reg/Dec Pipeline Exec R-type Ifetch Bubble Reg/Dec Ifetch Wr Exec Reg/Dec Wr Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle – The control logic can be complex. – Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 12

Solution 2: Delay R type’s Write by One Cycle Delay R type’s register write

Solution 2: Delay R type’s Write by One Cycle Delay R type’s register write by one cycle: – Now R type instructions also use Reg File’s write port at Stage 5 – Mem stage is a NO OP stage: nothing is being done. 1 R-type Ifetch Cycle 1 Cycle 2 2 Reg/Dec 3 Exec 4 Mem 5 Wr Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch R-type Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Reg/Dec Exec Mem Load R-type Ifetch MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) Wr 13

Eliminating Structural Hazards? Structural hazards can be eliminated or minimized by either using the

Eliminating Structural Hazards? Structural hazards can be eliminated or minimized by either using the stall operation or adding multiple functional units Time Program Flow Load IFetch Dcd 2 nd Inst. Exec IFetch Dcd Mem WB Exec Mem 3 rd Inst IFetch Dcd 4 th Inst stall 5 th Inst. MAC/VU Advanced Computer Architecture IFetch Dcd WB Exec IFetch Dcd Lecture 11 –Computer Hardware Design (5) Mem Exec WB Mem WB 14

Example: Dual port vs. Single port Machine A: Dual ported memory Machine B: Single

Example: Dual port vs. Single port Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1. 05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed Speed. Up. A = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth Speed. Up. B = Pipeline Depth/(1 + 0. 4 x 1) x (clockunpipe/(clockunpipe / 1. 05) = (Pipeline Depth/1. 4) x 1. 05

Stall degrades the performance Here, is an example: Suppose data reference instructions constitute 40%

Stall degrades the performance Here, is an example: Suppose data reference instructions constitute 40% of mix, and processor with structural hazard has clock rate 1. 05 times higher than the processor without hazard The Average Instruction time = CPI x Clock Cycle Time = (1 + 0. 4 x 1) x clock cycle time Ideal / 1. 05 = 1. 4 / 1. 05 x clock cycle time Ideal = 1. 3 x clock cycle time Ideal The processor without structural hazard is 1. 3 times faster than with Structural hazard MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 16

Additional Functional Units increase cost Memory structural hazard is removed by - using two

Additional Functional Units increase cost Memory structural hazard is removed by - using two Cache memory units: Instruction memory Data Memory Two write ports in register file allow 4 -stage and 5 -stage pipe mix MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 17

Data Hazards Attempt to use item before it is ready; e. g. , One

Data Hazards Attempt to use item before it is ready; e. g. , One sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer Instruction depends on result of prior instruction still in the pipeline MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 18

Data Hazards Pipelining changes the relative timing of instruction by overlapping their execution This

Data Hazards Pipelining changes the relative timing of instruction by overlapping their execution This overlap introduces the Data and Control Hazard Data Hazard occurs when order of operand read/write is changed viz-z-viz sequential access to the operands, which gives rise to data dependency Let us consider an example …… MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 19

Example Data Hazard on R 1 Add R 1 , R 2, R 3

Example Data Hazard on R 1 Add R 1 , R 2, R 3 Sub R 4, R 1 , R 3 And R 6, R 1 , R 7 Or R 8, R 1 , R 9 Xor R 10, R 11 MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 20

Data Hazard due to Dependencies backwards in time are hazards Time (clock cycles) ME

Data Hazard due to Dependencies backwards in time are hazards Time (clock cycles) ME W Dm. M Reg. B Im Reg ALU Dm Im Reg ALU O Or R 8, R 1, R 9 r d Xor R 10, R 11 e r Im ID/R Reg F ALU Add R 1, R 2, R 3 I n s Sub R 4, R 1, R 3 t r. And R 6, R 1, R 7 IF E X Reg Reg Dm Reg Add instruction provide its results to sub after 3 cycles, to and after 2 and to Or after 1 clock cycles MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 21

Data Hazard Solution #1 - Stall stall cycles after next IF and decode, before

Data Hazard Solution #1 - Stall stall cycles after next IF and decode, before the register read Time (clock cycles) Stall sub r 4, r 1, r 3 and r 6, r 1, r 7 or r 8, r 1, r 9 Im Reg Dm Im Reg ALU Dm Im ALU O r d e r WB ALU I n s t r. add r 1, r 2, r 3 ID/RF EX MEM ALU IF xor r 10, r 11 MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) Reg Reg Dm Reg 22

XOR: No Data Hazard here, as register is read after being written Time (clock

XOR: No Data Hazard here, as register is read after being written Time (clock cycles) Reg Im Reg Dm Im Reg ALU or r 8, r 1, r 9 Dm ALU and r 6, r 1, r 7 Reg ALU O r d e r sub r 4, r 1, r 3 Im WB ALU I n s t r. add r 1, r 2, r 3 MEM ALU IF EX ID/RF xor r 10, r 11 MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) Reg Reg Dm Reg 23

Data Hazard Solution - Forwarding “Forward” result from one stage to another From the

Data Hazard Solution - Forwarding “Forward” result from one stage to another From the EX/MEM pipeline register to Sub ALU stage, MEM/WB pipeline register to AND ALU stage Time (clock cycles) IF or r 8, r 1, r 9 Dm Im Reg ALU and r 6, r 1, r 7 Reg ALU Im sub r 4, r 1, r 3 Dm ALU Reg ALU O r d e r Im WB ALU I n s t r. add r 1, r 2, r 3 ID/RF EX MEM xor r 10, r 11 MAC/VU Advanced Computer Architecture No forwarding As register is written in the first half and read in the second half cycle Lecture 11 –Computer Hardware Design (5) Reg Reg Dm Reg 24

Forwarding (or Bypassing): What about Loads? sub r 4, r 1, r 3 Im

Forwarding (or Bypassing): What about Loads? sub r 4, r 1, r 3 Im EX MEM Reg Dm Im Reg ALU lw r 1, 0(r 2) ID/RF ALU Time (clock cycles) IF WB Reg Dm Reg Dependencies backwards in time are hazards In this case, we Can’t solve with forwarding: Must delay/stall instruction dependent on loads MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 25

Control Hazards Control hazard occurs when one attempt to take action before condition is

Control Hazards Control hazard occurs when one attempt to take action before condition is evaluated Explanation: When Branch instructions is executed it may or may not change the PC to something other than PC+4 Branch Taken: Branch changes the PC+4 to new target Branch Not Taken: Branch does not change the PC+4 MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 26

Control Hazards The simple way to deal with the branch is: freeze the pipeline

Control Hazards The simple way to deal with the branch is: freeze the pipeline holding any instruction after the branch instruction and flush the pipeline to delete the instructions after the branch if condition is evaluated, and branch is to take MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 27

Control Hazard: Example BEQ Taken BEQ Reg IF LOAD Mem Reg Reg ALU ADD

Control Hazard: Example BEQ Taken BEQ Reg IF LOAD Mem Reg Reg ALU ADD IF ALU O r d e r Time (clock cycles) ALU I n s t r. Mem IF Instruction Fetched At the begin of ID stage of BEQ instruction MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) Reg Condition Evaluated at the end of EXE stage of BEQ instruction 28

Explanation Branch Hazard Here, If the BEQ is taken then The next instruction address

Explanation Branch Hazard Here, If the BEQ is taken then The next instruction address is determined after evaluating the branch condition in the EX stage; but the next instruction (LOAD) is fetched in the ID stage, i. e. , before the PC+4 is changed to new target address This gives rise to Branch or Control Hazard MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 29

Dealing with Branches Stall Redo Fetch after branch Delayed branching Branch prediction Multiple Streams

Dealing with Branches Stall Redo Fetch after branch Delayed branching Branch prediction Multiple Streams MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 30

Solution#1 Stall Result of the condition evaluation is available after the EXE stage and

Solution#1 Stall Result of the condition evaluation is available after the EXE stage and the target address is available in the next stage Thus 3 stall cycles ADD IF BEQ MAC/VU Advanced Computer Architecture IF Mem Reg Mem Stall Reg IF Lecture 11 –Computer Hardware Design (5) Reg ALU Target Inst. Reg ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg 31

Reducing number of Stall ADD IF BEQ IF Mem Reg Mem Stall IF Reg

Reducing number of Stall ADD IF BEQ IF Mem Reg Mem Stall IF Reg ALU Target Inst Reg ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg Extra H/W to evaluate condition at the end of ID stage of BEQ instruction Explanation: next slide MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 32

Reducing number of Stall Here, you can see that if we move up decision

Reducing number of Stall Here, you can see that if we move up decision to the end of ID stage (2 nd stage) by adding hardware to compare the registers being read. The number of stalls reduces to 2 clock cycles per branch instruction It can further be reduced to 1 in case of BEQZ or BNEZ if zero register is tested after Instruction Fetch MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 33

Solution# 2 Redo Branch successor + 1 IFetch after Branch Dcd Exec Mem IFetch

Solution# 2 Redo Branch successor + 1 IFetch after Branch Dcd Exec Mem IFetch Dcd IFetch WB Exec Dcd Mem Exec WB Mem WB We know that once a branch has been detected during the Instruction decode /Register read stage, the next instruction fetch cycle should essentially be a stall, if we assume that branch is taken Next slide please ……………. . MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 34

Solution# 2 Redo Fetch after Branch However, the instruction fetched in this cycle never

Solution# 2 Redo Fetch after Branch However, the instruction fetched in this cycle never performs useful work, and is ignored Therefore, re-fetch the Branch successor instruction will is provide the correct instruction. Indeed, the second fetch is not essential branch is not taken Impact: 1 clock cycles per branch instruction if branch is un taken MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 35

Solution# 3 Delayed Add Beq IF Reg Mem Reg IF Mem Reg ALU Load

Solution# 3 Delayed Add Beq IF Reg Mem Reg IF Mem Reg ALU Load Reg ALU Misc IF ALU O r d e r Time (clock cycles) ALU I n s t r. Branch – S/W method Mem Reg Redefine branch behavior to take place after the next instruction by introducing other instruction (may be No-OP) which is always executed Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 36

Solution#4 Prediction This techniques suggest that for a branch instruction we guess one direction

Solution#4 Prediction This techniques suggest that for a branch instruction we guess one direction of the branch, to begin, then back up if wrong The two possible predictions are: - Predict Branch not-taken - Predict branch taken MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 37

Branch Prediction Flowchart MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 38

Branch Prediction Flowchart MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 38

1. Predict – Branch not taken This scheme is implemented assuming every branch as

1. Predict – Branch not taken This scheme is implemented assuming every branch as branch Not-taken So the processor continues to fetch branch as normal instructions Sequence when branch is not-taken Branch Inst ‘i’ IF Inst. ‘i+1’ Inst. ‘i+2’ MAC/VU Advanced Computer Architecture ID EX MEM WB IF ID EX MEM Lecture 11 –Computer Hardware Design (5) WB 39

Predict Branch not taken … Cont’d We the decision has been made, and the

Predict Branch not taken … Cont’d We the decision has been made, and the branch is taken, then fetch operations are turned into NO-OP and fetch is restarted at the target address Sequence when branch is taken Taken Branch Inst ‘i’ IF ID EX MEM WB Inst. ‘i+1’ IF Idle Branch target +1’ ID MAC/VU Advanced Computer Architecture IF ID EX MEM WB IF EX MEM WB Lecture 11 –Computer Hardware Design (5) 40

2. Predict - Branch taken An alternative way is to treat every branch as

2. Predict - Branch taken An alternative way is to treat every branch as Branch taken As soon as the target address is computed, we assume that the branch is to be taken and start fetching and executing at the target In a five stage pipeline the target address and condition evaluation are available at the same time, so this technique is of no use. Let us consider this example of a LOOP to explain the concept: MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 41

2. Predict - Branch taken i=0 Loop: ……. i = i+1 IF i ≠

2. Predict - Branch taken i=0 Loop: ……. i = i+1 IF i ≠ 1001 THEN Loop ……. . Here, the branch is taken for 1000 time, so the prediction “Branch Taken” fails 1 in 1000, hence no stall for 1000 times Further, the compiler can improve performance by organizing the code so that the most frequent path matches the hardware choice MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 42

Solution #5 Multiple Streams Have two pipelines Pre fetch each branch into a separate

Solution #5 Multiple Streams Have two pipelines Pre fetch each branch into a separate pipeline Use appropriate pipeline Results Leads to bus & register contention Multiple branches lead to further pipelines being needed MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 43

Solution# 5 Pre-fetch Branch Target of branch is pre-fetched in addition to instructions following

Solution# 5 Pre-fetch Branch Target of branch is pre-fetched in addition to instructions following branch Keep target until branch is executed Used by IBM 360/91 MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 44

Summary Type of hazards in pipelined datapath Structural hazards occur when same resource is

Summary Type of hazards in pipelined datapath Structural hazards occur when same resource is accessed by more than one instructions One memory port or one register write port It can be removed by using either multiple resources or inserting stall Stall degrades the pipeline performance MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 45

Summary Data Hazards occur when attempt is made to read invalid data Data hazard

Summary Data Hazards occur when attempt is made to read invalid data Data hazard can be removed by using stall and forwarding techniques Control hazards occur when an attempt is made to branch prior to the evaluation of the condition Four ways to handle control hazards MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 46

Summary – 4 ways to handle control hazard 1: 2: Stall until branch direction

Summary – 4 ways to handle control hazard 1: 2: Stall until branch direction is clear Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken PC+4 already calculated, so use it to get next instruction 3: 4: Predict Branch Taken Delayed Branch Define branch to take place AFTER a following instruction 1 slot delay allows proper decision and branch target address in 5 stage pipeline MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 47

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 48