CS 704 Advanced Computer Architecture Lecture 12 Instruction

  • Slides: 38
Download presentation
CS 704 Advanced Computer Architecture Lecture 12 Instruction Level Parallelism (Introduction to multi cycle

CS 704 Advanced Computer Architecture Lecture 12 Instruction Level Parallelism (Introduction to multi cycle pipelined datapath) Prof. Dr. M. Ashraf Chughtai

Today’s Topics Recap: Pipelining Basics Longer Pipelines – FP Instructions Loop Level Parallelism FP

Today’s Topics Recap: Pipelining Basics Longer Pipelines – FP Instructions Loop Level Parallelism FP Loop Hazards Summary MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 2

Recap: Pipelined datapath and control In the previous lecture we reviewed the pipelined datapath

Recap: Pipelined datapath and control In the previous lecture we reviewed the pipelined datapath to understand the basics of ILP – overlap among the instruction execution to enhance performance Key components of pipeline data path Performance enhancement due to pipeline: – Pipelining helps instruction bandwidth but not latency MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 3

Recap: Pipeline Hazards Structural hazards MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1)

Recap: Pipeline Hazards Structural hazards MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 4

Recap: Pipeline Hazards …. . Cont’d Data Hazards MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction

Recap: Pipeline Hazards …. . Cont’d Data Hazards MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 5

Recap: Three Generic Data Hazards Read After Write (RAW): (dependence) – instr. J tries

Recap: Three Generic Data Hazards Read After Write (RAW): (dependence) – instr. J tries to read operand before instri writes it; i: add r 1, r 2, r 3 j: sub r 4, r 1, r 3 MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 6

Recap: Three Generic Data Hazards Write After Read (WAR): anti-dependence – i: sub r

Recap: Three Generic Data Hazards Write After Read (WAR): anti-dependence – i: sub r 4, r 1, r 3 j: add r 1, r 2, r 3 - Also called Name dependence(renaming) MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 7

Recap: Three Generic Data Hazards • Write After Write (WAW) i: sub r 1,

Recap: Three Generic Data Hazards • Write After Write (WAW) i: sub r 1, r 4, r 3 j: add r 1, r 2, r 3 MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 8

Recap: Pipeline Hazards …. . Cont’d Control hazards How to overcome Hazards? Stall MAC/VU-Advanced

Recap: Pipeline Hazards …. . Cont’d Control hazards How to overcome Hazards? Stall MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 9

Recap: How to remove Hazards? Structural Hazard: Multiple functional units Data Hazard : Forwarding

Recap: How to remove Hazards? Structural Hazard: Multiple functional units Data Hazard : Forwarding or bypassing Control Hazards: Predict, delay branch MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 10

Instruction Level Parallelism – clock speed – number of instructions that can execute in

Instruction Level Parallelism – clock speed – number of instructions that can execute in parallel, i. e. , increasing ILP MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 11

How to achieve Instruction Level Parallelism? A superscalar processor: - - pre-fetch and decode

How to achieve Instruction Level Parallelism? A superscalar processor: - - pre-fetch and decode - Start several branch instruction streams - Finally, discard all but the correct stream MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 12

Superscalar Design MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 13

Superscalar Design MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 13

MIPS Longer Pipelines – FP Instructions MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism

MIPS Longer Pipelines – FP Instructions MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 14

MIPS Longer Pipelines – FP Instructions For example to ADD two FP minimum four

MIPS Longer Pipelines – FP Instructions For example to ADD two FP minimum four steps are performed in the following sequence: MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 15

Flow diagram of MIPS FP Adder Draw flow diagram of pp 284 MAC/VU-Advanced Computer

Flow diagram of MIPS FP Adder Draw flow diagram of pp 284 MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 16

Steps for FP Addition Step 1: Exponents of two numbers are compared, the smaller

Steps for FP Addition Step 1: Exponents of two numbers are compared, the smaller number is shifted to the right to till its exponent matches to the larger exponent Step 2: Add the significands Step 3: Normalize the sum – shift right and increment or shift left and decrement Step 4: If no overflow or underflow then round the significand to number of bits Stop if further normalization is not required, otherwise go to step 3 MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 17

MIPS Longer Pipelines …… Cont’d - The latency of functional unit is defined as:

MIPS Longer Pipelines …… Cont’d - The latency of functional unit is defined as: the number of cycles between the instructions that produces a result and the one that uses the result of the operation - The initiation or repeat interval is defined as: the number of cycles that must elapse between issuing two operations (repeat of an operation) of the same type MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 18

MIPS Longer Pipelines …… Cont’d Latency Initiation (repeat) Interval Integer ALU Data Memory (Int

MIPS Longer Pipelines …… Cont’d Latency Initiation (repeat) Interval Integer ALU Data Memory (Int / FP Load) FP ADD FP/ Integer Multiply FP/Integer Divide MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) =0 =1 =3 =6 = 24 1 1 25 19

Typical MIPS FP Pipeline Let us consider a typical MIPS FP pipeline with three

Typical MIPS FP Pipeline Let us consider a typical MIPS FP pipeline with three un-pipelined FP functional units Insert Fig. A. 29 (page A-48) Explanation next please MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 20

Typical MIPS FP Pipeline MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 21

Typical MIPS FP Pipeline MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 21

MIPS FP Pipeline with Pipelined FUs The previous FP pipeline can be extended by

MIPS FP Pipeline with Pipelined FUs The previous FP pipeline can be extended by adding additional pipeline stages in the functional units Insert Fig. A. 31(page A-50) Explanation next please MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 22

Working of extended FP Pipeline MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1)

Working of extended FP Pipeline MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 23

Working of extended FP Pipeline Note that additional pipeline register have been inserted between

Working of extended FP Pipeline Note that additional pipeline register have been inserted between intervening stage, e. g. , A 1/A 2, A 2/A 3, …. . Furthermore, ID/EX register must be expanded to connect ID to A 1, M 1, EX and DIV Function Units Here, the FP divide FP is not pipelined but it requires 24 clock cycles to complete MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 24

FP Pipeline Timing: Example MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 25

FP Pipeline Timing: Example MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 25

Hazards in Longer Latency Pipeline All the functional units are not fully pipelined. So

Hazards in Longer Latency Pipeline All the functional units are not fully pipelined. So structural hazard may occur. Instructions have varying running time, so more than one register write may occur. Instructions are no longer reaching WB stage in order so WAW data hazard may occur. WAR hazards are not possible since registers are read in ID stage. Stall for RAW data hazard may be more frequent because of longer latency of operations. MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 26

FP Pipeline Hazards - RAW Clock Cycle Number INST L. D F 4, 0(R

FP Pipeline Hazards - RAW Clock Cycle Number INST L. D F 4, 0(R 2) MUL. D F 0, F 4, F 6 ADD. D F 2, F 0, F 8 S. DF 2, 0(R 2) MAC/VU-Advanced Computer Architecture 1 2 IFID IFst Lecture 12 –Instruction Level Parallelism (1) 3 EX st ID st 4 Me M 1 st st 5 WB M 2 st st 6 M 3 st st 27

FP Pipeline Structural Hazard Clock Cycle Number 1 IF 2 ID 3 M 1

FP Pipeline Structural Hazard Clock Cycle Number 1 IF 2 ID 3 M 1 4 M 2 5 M 3 IF IF ID ID Ex Ex Me Me WB WB IF ID A 1 A 2 A 3 IF IF ID ID Ex Ex Me Me WB WB IF ID EX Me WB MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 6 M 4 7 M 5 8 M 6 A 4 Me WB 9 M 7 28

Conclusion about FP Pipeline 1: Structural Hazard – wait until required functional unit is

Conclusion about FP Pipeline 1: Structural Hazard – wait until required functional unit is available 2: Check for RAW data hazard : wait until the source registers are not listed as pending destinations register that will not be available 3. Check for WAW: determine if any instruction in A 1, A 2, …D, M 1, M 2 , …. has same destination as this instruction MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 29

Precise Exceptions: Out-of-order Completion!. In the program: DIV. D F 0, F 2, F

Precise Exceptions: Out-of-order Completion!. In the program: DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 12, F 14 MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 30

Overcoming the Data Hazard by Scheduling Static Scheduling – Compiler based Dynamic Scheduling –

Overcoming the Data Hazard by Scheduling Static Scheduling – Compiler based Dynamic Scheduling – Hardware based Statically Scheduled Pipeline: MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 31

Dynamic Scheduling Overcoming the Data Hazard Dynamically Scheduled Pipeline: Advantages: - Allows to handle

Dynamic Scheduling Overcoming the Data Hazard Dynamically Scheduled Pipeline: Advantages: - Allows to handle cases where dependence is unknown at the compile time - Allows code compiled for one pipeline to run on other pipe line MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 32

Concept of Dynamic Scheduling. . Cont’d In the program: DIV. D F 0, F

Concept of Dynamic Scheduling. . Cont’d In the program: DIV. D F 0, F 2, F 4 ADD. D F 10, F 8 SUB. D F 12, F 8, F 14 MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 33

Problems of Out-of-order execution WAR and WAW In the program: DIV. D F 0,

Problems of Out-of-order execution WAR and WAW In the program: DIV. D F 0, F 2, F 4 ADD. D F 6, F 0, F 8 SUB. D F 8, F 10, F 14 MUL. D F 6, F 10, F 8 MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 34

Exception due to out of order execution Already completed instructions Not Yet completed instructions

Exception due to out of order execution Already completed instructions Not Yet completed instructions MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 35

Overcoming Exceptions Split the ID pipe stage into two: Issue: Decode instructions and check

Overcoming Exceptions Split the ID pipe stage into two: Issue: Decode instructions and check for structural hazard Read Operand: Wait until no data hazards, then read MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 36

Summary We have talked about longer FP pipelines MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction

Summary We have talked about longer FP pipelines MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 37

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1)

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 12 –Instruction Level Parallelism (1) 38