Machine Organization CS 570 Lecture 4 Pipelining Jeremy

  • Slides: 29
Download presentation
Machine Organization (CS 570) Lecture 4: Pipelining* Jeremy R. Johnson Wed. Oct. 18, 2000

Machine Organization (CS 570) Lecture 4: Pipelining* Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived from material in the text (Chap. 3). All figures from Computer Architecture: A Quantitative Approach, Second Edition, by John Hennessy and David Patterson, are copyrighted material (COPYRIGHT 1996 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED). Oct. 18, 2000 Machine Organization 1

Introduction • Objective: To understand pipelining and the enhanced performance it provides • Pipelining

Introduction • Objective: To understand pipelining and the enhanced performance it provides • Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Instructions are broken down into stages and while one instruction is executing one stage another instruction can simultaneously execute another stage. • Topics – – – Review DLX Simple Implementation of DLX Basic Pipeline for DLX Pipeline hazards Floating point pipeline Oct. 18, 2000 Machine Organization 2

Instruction Format • R-type Instruction (register format - add, sub, …) op rs rt

Instruction Format • R-type Instruction (register format - add, sub, …) op rs rt rd func • I-type Instruction (immediate format - load, store, branch, immediate) op rs rt Immediate • J-type Instruction (jump, jal) op Oct. 18, 2000 offset added to PC Machine Organization 3

Implementation Stages • Instruction Fetch Cycle (IF) – IR Mem[PC] – NPC PC +

Implementation Stages • Instruction Fetch Cycle (IF) – IR Mem[PC] – NPC PC + 4 • Instruction Decode/Register Fetch Cycle (ID) – A Regs[IR 6. . 10] – B Regs[IR 11. . 15] – Imm ((IR 16)16 ## IR 16. . 31) Oct. 18, 2000 Machine Organization 4

Implementation Stages • Execution/Effective Address Cycle (EX) – Memory Reference: • ALUOutput A +

Implementation Stages • Execution/Effective Address Cycle (EX) – Memory Reference: • ALUOutput A + Imm; – Register-Register ALU Instruction: • ALUOutput A func B; – Register-Immediate ALU Instruction: • ALUOutput A op Imm; – Branch: • ALUOutput NPC + Imm; • Cond (A op 0) ; Oct. 18, 2000 Machine Organization 5

Implementation Stages • Memory Access/Branch Completion Cycle (MEM) – Memory Reference: • LMD Mem[ALUOutput];

Implementation Stages • Memory Access/Branch Completion Cycle (MEM) – Memory Reference: • LMD Mem[ALUOutput]; or • Mem[ALUOutput] B; – Branch: • if (Cond) PC ALUOutput; • Write-back Cycle (WB) – Register-Register ALU Instruction: • Regs[IR 16. . 20] ALUOutput; – Register-Immediate ALU Instruction: • Regs[IR 11. . 15] ALUOutput; – Load Instruction: • Regs[IR 11. . 15] LMD; Oct. 18, 2000 Machine Organization 6

DLX Datapath Oct. 18, 2000 Machine Organization 7

DLX Datapath Oct. 18, 2000 Machine Organization 7

Simple DLX Pipeline • Each stage (clock-cycle) becomes a pipeline stage • Overlap execution

Simple DLX Pipeline • Each stage (clock-cycle) becomes a pipeline stage • Overlap execution of instructions • Add registers between stages Oct. 18, 2000 Machine Organization 8

Overlap of Functional Units Oct. 18, 2000 Machine Organization 9

Overlap of Functional Units Oct. 18, 2000 Machine Organization 9

Pipelined Datapath Oct. 18, 2000 Machine Organization 10

Pipelined Datapath Oct. 18, 2000 Machine Organization 10

Pipeline Performance • Expect speedup equal to the number of pipe stages – assumes

Pipeline Performance • Expect speedup equal to the number of pipe stages – assumes equal sized tasks – no additional overhead due to pipelining • Speedup from pipelining (reduce CPI or decrease clock) = Avg. inst. Ex. time unpipelined/ Avg. inst. Ex. Time pipelined • Example: 10 ns clock without pipelining, 11 ns with pipelining (account for overhead). ALU (40%), Branch (20%) take 4 cycles, Memory (20%) takes 5. • Speedup = 10 ns ((. 4 +. 2) 4 +. 2 5)/ 11 ns = 44/11 = 4 Oct. 18, 2000 Machine Organization 11

Pipeline Hazards • Situations in pipelining when the next instruction cannot execute in the

Pipeline Hazards • Situations in pipelining when the next instruction cannot execute in the following clock cycle • Structural hazards – hardware can not support the combination of instructions that we want to execute in the same cycle • Control hazards – need to make a decision based on the results of one instruction while others are executing • Data hazards – an instruction depends on a the results of a previous instruction still in the pipeline Oct. 18, 2000 Machine Organization 12

Pipeline Performance II • Must account for hazards – Hazards introduce stall cycles in

Pipeline Performance II • Must account for hazards – Hazards introduce stall cycles in the pipeline = Avg. inst. Ex. time unpipelined/ Avg. inst. Ex. Time pipelined = CPI unpipelined Clock cycle unpipelined / CPI pipelined Clock cycle pipelined = CPI unpipelined/(1 + Pipeline stall cycles per. Inst. ) Clock cycle unpipelined/Clock cycle pipelined Pipeline Depth/(1 + Pipeline stall cycles per. Inst. ) Oct. 18, 2000 Machine Organization 13

Structural Hazards • Problem: conflict in resources • Example: Suppose that instruction and data

Structural Hazards • Problem: conflict in resources • Example: Suppose that instruction and data memory was shared in single-cycle pipeline. Data access conflicts with instruction fetch • Solution: remove conflicting stages, redesign resources to separate resources, or replicate resources Oct. 18, 2000 Machine Organization 14

Structural Hazard Oct. 18, 2000 Machine Organization 15

Structural Hazard Oct. 18, 2000 Machine Organization 15

Data Hazards • Problem: Instruction depends on the result of a previous instruction still

Data Hazards • Problem: Instruction depends on the result of a previous instruction still in the pipeline • Example: – add R 1, R 2, R 3 – sub R 5, R 1, R 4 • Solutions: – forwarding or bypassing – instruction reordering to remove dependencies Oct. 18, 2000 Machine Organization 16

Data Hazard Example – – – add R 1, R 2, R 3 sub

Data Hazard Example – – – add R 1, R 2, R 3 sub R 4, R 1, R 5 and R 6, R 1, R 7 or R 8, R 1, R 9 xor R 10, R 11 Oct. 18, 2000 Machine Organization 17

Data Dependencies Oct. 18, 2000 Machine Organization 18

Data Dependencies Oct. 18, 2000 Machine Organization 18

Data Forwarding Oct. 18, 2000 Machine Organization 19

Data Forwarding Oct. 18, 2000 Machine Organization 19

Implementing Forwarding • Detection – e. g. EX/MEM. IR 16. . 20 = ID/EX

Implementing Forwarding • Detection – e. g. EX/MEM. IR 16. . 20 = ID/EX 6. . 10 • Use multiplexor to select forwarded results Oct. 18, 2000 Machine Organization 20

Data Hazard with Stall – – lw R 1, 0(R 2) sub R 4,

Data Hazard with Stall – – lw R 1, 0(R 2) sub R 4, R 1, R 5 and R 6, R 1, R 7 or R 8, R 1, R 9 Oct. 18, 2000 Machine Organization 21

Compiler Scheduling for Data Hazards • Data hazards are naturally generated – C=A+B •

Compiler Scheduling for Data Hazards • Data hazards are naturally generated – C=A+B • • lw R 1, A lw R 2, B add R 3, R 1, R 2 sw C, R 3 • Compiler can reorder instructions to remove dependencies – a = b + c; d = e - f; • • Oct. 18, 2000 lw R 1, b lw R 2, c lw R 3, e add R 5, R 1, R 2 lw R 4, f sw a, R 5 sub R 6, R 3, R 4 sw d, R 6 Machine Organization 22

Effectiveness of Scheduling Oct. 18, 2000 Machine Organization 23

Effectiveness of Scheduling Oct. 18, 2000 Machine Organization 23

Control Hazards • Problem: The next element to go into the pipe may depend

Control Hazards • Problem: The next element to go into the pipe may depend on currently executing instruction or we may have to wait until a stage is completed to determine the next stage • Example: branch instruction • Solutions: – Stall - operate sequentially until decision can be made (wastes time) – Predict - guess what to do next. If guess correct, operate normally, if guess is wrong clear the pipe and begin again – Compute address of branch target earlier Oct. 18, 2000 Machine Organization 24

Pipeline Stall for Branch • Stall pipeline until MEM stage, which determines new PC

Pipeline Stall for Branch • Stall pipeline until MEM stage, which determines new PC • Don’t stall until a branch is detected (ID) • 3 cycles lost per branch is significant – 30% branch frequency + ideal CPI = 1 machine with branch stalls only achieves 1/2 of ideal speedup Oct. 18, 2000 Machine Organization 25

Computing the Taken PC Earlier • Can detect branch condition (BEQZ, BNEZ) during ID

Computing the Taken PC Earlier • Can detect branch condition (BEQZ, BNEZ) during ID • Need extra adder to compute branch target during ID • This reduces stall to one cycle Oct. 18, 2000 Machine Organization 26

Compile Time Branch Prediction • Assume either that the branch is taken or not

Compile Time Branch Prediction • Assume either that the branch is taken or not taken • Proceed under this assumption - if wrong “back out” and start over. Oct. 18, 2000 Machine Organization 27

Delayed Branch • Instruction after branch (branch delay slot) is executed no matter what

Delayed Branch • Instruction after branch (branch delay slot) is executed no matter what the outcome of the branch is • Requires that the instruction in the branch delay slot is safe to execute independent of branch • Effectiveness depends on compiler Oct. 18, 2000 Machine Organization 28

Designing Instruction Sets (MIPS) for Pipelining • Want to break down instruction execution into

Designing Instruction Sets (MIPS) for Pipelining • Want to break down instruction execution into a reasonable number of stages of roughly equal complexity • All instructions the same length – easier to fetch and decode • Few instruction formats (source register fields are located in the same place) – can begin reading registers at the same time instruction is decoded • Memory operands appear only in loads and stores – calculate address during execute stage and access memory following stage - otherwise expand to addr stage, mem stage and ex stage • Operands must be aligned in memory – don’t have to worry about a single data transfer instruction requireing two data memory accesses; hence, it requires a single pipeline stage Oct. 18, 2000 Machine Organization 29