COMP 206 Computer Architecture and Implementation Montek Singh

  • Slides: 29
Download presentation
COMP 206: Computer Architecture and Implementation Montek Singh Mon, Sep 19, 2005 Topic: Pipelining

COMP 206: Computer Architecture and Implementation Montek Singh Mon, Sep 19, 2005 Topic: Pipelining (Intermediate Concepts) 1

Outline ã Pipelining basics (contd. ) ã Pipelining example ã Pipelining notation and terminology

Outline ã Pipelining basics (contd. ) ã Pipelining example ã Pipelining notation and terminology ã Hazards l Structural hazards l Data hazards l Hazard resolution Reading: Appendix A (HP 3) 2

How About Control Signals? Key Observation: Control Signals at Stage N = Func (Instr.

How About Control Signals? Key Observation: Control Signals at Stage N = Func (Instr. at Stage N) for N = Exec, Mem, or Wr. B. l Control Signals at Exec Stage = Func(Load’s Exec) l What about Ifetch and Reg/Dec? Ifetch Reg/Dec Wr Reg. Wr 1 0 Rt Rw Di Rd 0 1 Reg. Dst=0 Branch Zero Data Mem RA Do WA Di ALUSrc=1 Mem. Wr 1 Mux Rt RFile Exec Unit Mem/Wr Register IUnit I Rb Imm 16 bus. A bus. B Ex/Mem: Load’s Address Ra PC+4 ID/Ex Register IF/ID: PC+4 PC A PC+4 Imm 16 Rs Exec ALUOp=Add Ext. Op=1 0 Memto. Reg 3

Pipeline Control “Main Control”: generates control signals during Reg/Dec Ø Control signals for Exec

Pipeline Control “Main Control”: generates control signals during Reg/Dec Ø Control signals for Exec (Ext. Op, ALUSrc, . . . ) are used 1 cycle later Ø Control signals for Mem (Mem. Wr, Branch) are used 2 cycles later Ø Control signals for Wr. B (Memto. Reg, Mem. Wr) are used 3 cycles later Reg/Dec Clk Mem. Wr Branch Memto. Reg. Wr Reg. Dst Mem. Wr Branch Memto. Reg. Wr Clk Wr. B Clk Mem. Wr Branch Memto. Reg. Wr Mem/Wr Register Reg. Dst Mem Ex/Mem Register Main Control Clk Ext. Op ALUSrc ALUOp ID/Ex Register IF/ID Register Ext. Op ALUSrc ALUOp Exec Memto. Reg. Wr 4

A More Extensive Pipelining Example Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle

A More Extensive Pipelining Example Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock 0: Load Ifetch 4: R-type Reg/Dec Exec Mem Wr. B Ifetch Reg/Dec Exec Mem Wr. B 8: Store Ifetch Reg/Dec Exec Mem Wr. B Ifetch Reg/Dec Exec Mem 12: Beq (target is 1000) End of Cycle 4 End of Cycle 5 End of Cycle 6 Wr. B End of Cycle 7 ã End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s Ifetch ã End of Cycle 5: Load’s Wr. B, R-type’s Mem, Store’s Exec, Beq’s Reg ã End of Cycle 6: R-type’s Wr. B, Store’s Mem, Beq’s Exec ã End of Cycle 7: Store’s Wr. B, Beq’s Mem 5

Pipelining Example: End of Cycle 4 0: Load’s Mem 4: R-type’s Exec 8: Store’s

Pipelining Example: End of Cycle 4 0: Load’s Mem 4: R-type’s Exec 8: Store’s Reg 12: Beq’s Ifet Reg. Wr=0 12: Beq’s Ifetch 4: R-type’s Exec 0: Load’s Mem ALUOp=R-type Ext. Op=x Branch=0 Clk 1 0 Ra Rb RFile Rt Rw Di Rd 0 1 Reg. Dst=1 ALUSrc=0 Zero Data Mem RA Do WA Di Clk Mem. Wr=0 1 Mux Rt Imm 16 bus. A bus. B Exec Unit Mem/Wr: Load’s Dout Rs ID/Ex: Store’s bus. A & B IUnit I IF/ID: Beq Instruction PC+4 PC = 16 A Imm 16 PC+4 Ex/Mem: R-type’s Result PC+4 0 Memto. Reg=x 6

CPU Designs: Summary ã Disadvantages of the Single Cycle Processor l Long cycle time

CPU Designs: Summary ã Disadvantages of the Single Cycle Processor l Long cycle time l Cycle time wasted for the faster instructions ã Multiple Clock Cycle Processor l Divide the instructions into smaller steps l Execute each step (instead of the entire instruction) in 1 cycle ã Pipelined Processor l Natural enhancement of the multiple clock cycle processor l Each functional unit used only once per instruction l If an instruction is going to use a functional unit: Ø it must use it at the same stage as all other instructions l Pipeline Control: Ø each stage’s control signal depends ONLY on the instruction that is currently in that stage 10

Single Cycle vs. Multiple Cycle vs. Pipelined Cycle 1 Cycle 2 Clk Single Cycle

Single Cycle vs. Multiple Cycle vs. Pipelined Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Store Reg Exec Mem Wr Ifetch R-type Reg Exec Mem Ifetch Pipelined Implementation: Load Ifetch Reg Store Ifetch R-type Exec Mem Wr Reg Exec Mem Wr Ifetch Reg Exec Mem Wr 11

Pipelining: Notation, Terminology etc. ã Time l Discrete time steps l Represented as 1,

Pipelining: Notation, Terminology etc. ã Time l Discrete time steps l Represented as 1, 2, 3, … ã Space l Pipe stages or segments (things that do processing) l Represented as P, Q, R, S (or F, D, X, M, W for the MIPS pipeline) ã Operands l Instructions or data items l Things that flow through, and are processed by, the pipeline l Represented as a, b, c, … ã In drawing pipelines, we conceal the obvious fact that each operand undergoes some changes in each pipe stage 12

Notations for Describing Pipelines • Space-time diagram, or Gantt chart • Reservation table by

Notations for Describing Pipelines • Space-time diagram, or Gantt chart • Reservation table by stages • Rows represent pipeline stages • Unbounded one way • Notation of HP 3 • Reservation table by instructions • Rows represent operands • Unbounded both ways 13

Basic Terms ã Filling a pipeline ã Flushing or draining a pipeline ã Stage

Basic Terms ã Filling a pipeline ã Flushing or draining a pipeline ã Stage or segment delay l Each stage may have a different stage delay ã Beat time (= max stage delay), or clock cycle time ã Number of stages ã End-to-end latency l number of stages × beat time ã Stages are separated by latches (registers) 14

Speedup & Throughput of a Pipeline 15

Speedup & Throughput of a Pipeline 15

Pipeline Hazards: Structural Hazard ã A relation between two instructions indicating that: l the

Pipeline Hazards: Structural Hazard ã A relation between two instructions indicating that: l the two instructions may want to use the same hardware resource (function unit, register file port, shared bus, cache port, etc. ) l …at the same time ã In principle, eliminated by duplicating resources l Low hardware utilization l Increased cost ã MIPS pipeline as designed so far does not have structural hazard l But we had to avoid it (see example later) ã Usually occurs when a functional unit is not fully pipelined (e. g. , in floating point pipeline) 16

Example: Unified I- and D-Memory These diagrams are invalid: structural hazard on single memory

Example: Unified I- and D-Memory These diagrams are invalid: structural hazard on single memory port Pipeline diagrams with hazards resolved 17

Resolving Structural Hazards ã Early resolution (scheduling) l Done well before the collision could

Resolving Structural Hazards ã Early resolution (scheduling) l Done well before the collision could occur, and usually at a place different from where the collision could happen l Example: instructions are delayed in the ID stage ã Late resolution l Done at the place where the collision might happen l Done just before the collision is about to happen l Example: Using an arbiter or a priority encoder Ø One instruction wins Ø Others are denied access, stall, and wait for their next chance ã Why allow structural hazards in the first place? l Reduce cost l Reduce unit latency (by avoiding pipeline latch delays) l Hazards may be infrequent (“make common case fast”) 18

Example: Cost of Structural Hazard Suppose that 40% of instruction mix are loads or

Example: Cost of Structural Hazard Suppose that 40% of instruction mix are loads or stores, and that the ideal CPI of the pipelined machine is 1. Assume that the machine with the structural hazard has a clock rate that is 5% higher than the clock rate of the machine without the hazard. Which pipeline is faster, and by how much? 19

Data Hazard: Setup Instruction u D(u): domain of instruction u The set of all

Data Hazard: Setup Instruction u D(u): domain of instruction u The set of all memory locations, registers (including implicit ones), flags, condition codes etc. that may be read by instruction u R(u): range of instruction u The set of all memory locations, registers (including implicit ones), flags, condition codes etc. that may be written by instruction u Instruction v u < v is a relation that means that instruction u precedes instruction v in the original program order (i. e. , on an unpipelined machine) • The relation < is irreflexive, anti-symmetric, and transitive 20

Data Hazard: Definition Given two instructions u and v, such that u < v,

Data Hazard: Definition Given two instructions u and v, such that u < v, there is a data hazard between them if any of the following conditions holds: The existence of one of these conditions means that a change in the order of reading/writing operands by the instructions from the order seen by sequentially executing instructions on an unpipelined machine could violate the intended semantics 21

Why Data Hazards Occur ã Pipelining changes relative timing of instructions ã Reads and

Why Data Hazards Occur ã Pipelining changes relative timing of instructions ã Reads and writes occur at fixed positions of the pipeline ã So, if two instructions are “too close” (function of pipeline structure), order of reads and writes could change and produce incorrect values XOR XOR R 2, R 1, R 2, R 1 ã This instruction sequence exchanges values in R 1 and R 2 ã On unpipelined MIPS, back-to-back execution of sequence produces correct results ã On current pipelined MIPS, initiation of sequence in consecutive cycles produces incorrect results l Reads are early, writes are late, so RAW hazards would be violated 22

Data Dependence and Hazards ã True (value, flow) dependence between instructions u and v

Data Dependence and Hazards ã True (value, flow) dependence between instructions u and v means u produces a result value that v uses l This is a producer-consumer relationship l This is a dependence based on values, not on the names of the containers of the values ã Every true dependence is a RAW hazard ã Not every RAW hazard is a true dependence l Any RAW hazard that cannot be removed by renaming is a true dependence Original program 1: A = B+C 2: A = D+E 3: G = A+H True dependence: (2, 3) RAW hazard: (1, 3), (2, 3) Renamed Program 1: X = B+C 2: A = D+E 3: G = A+H True dependence: (2, 3) RAW hazard: (2, 3) 23

More on Hazards ã RAW hazards corresponding to value dependences are most difficult to

More on Hazards ã RAW hazards corresponding to value dependences are most difficult to deal with, since they can never be eliminated l The second instruction is waiting for information produced by the first instruction ã WAR and WAW hazards are name dependences l Two instructions happen to use the same register (name), although they don’t have to l Can often be eliminated by renaming, either in software or hardware Ø Implies the use of additional resources, hence additional cost Ø Renaming is not always possible: implicit operands such as accumulator, PC, or condition codes cannot be renamed l These hazards don’t cause problems for MIPS pipeline Ø Relative timing does not change even with pipelined execution, because reads occur early and writes occur late in pipeline 24

The Precedence Relation ã Consider a straight line program in original program order ã

The Precedence Relation ã Consider a straight line program in original program order ã Define a relation D (the dependence relation) between pairs of instructions (u, v) as follows: l D(u, v) if and only if (u < v), and there is a WAR, WAW, or RAW hazard between instructions u and v l D is irreflexive and anti-symmetric but not transitive ã Define the precedence relation P as the transitive closure of the dependence relation D l P is irreflexive, anti-symmetric, and transitive l Represent P by graph of its transitive reduction Ø precedence graph ADD R 4, R 5, R 6 ADD R 3, R 4, R 5 ADD R 2, R 3, R 7 l If P(u, v), then u must precede v in execution Ø the two instructions cannot be interchanged, and in a pipeline they must maintain a “sufficient” distance 25

Example of Precedence Relation 1: ADD 2: SW 3: LW 4: LW 5: ADD

Example of Precedence Relation 1: ADD 2: SW 3: LW 4: LW 5: ADD 6: MUL R 1, R 7, R 8 2000(R 9), R 8 R 3, 0(R 1) R 4, 3000(R 9) R 5, R 3, R 4 R 6, R 5 Assume that registers R 7, R 8, R 9 are already initialized such that (R 7)+(R 8) = (R 9)+2000 holds 1 2 3 4 5 6 4, 1, 2, 3, 5, 6 1, 4, 2, 3, 5, 6 1, 2, 4, 3, 5, 6 1, 2, 3, 4, 5, 6 4, 2, 1, 3, 5, 6 2, 4, 1, 3, 5, 6 2, 1, 4, 3, 5, 6 2, 1, 3, 4, 5, 6 These eight sequences of the six instructions can result in correct execution, because they respect the sequencing constraints of the precedence graph. We still have to ensure that they maintain “sufficient” distance in the instruction pipeline, which depends on the structure of the pipeline and the latencies of the operations. 26

Data Hazard: Effect on Pipelining 1: ADD 2: SUB 3: AND 4: OR 5:

Data Hazard: Effect on Pipelining 1: ADD 2: SUB 3: AND 4: OR 5: XOR R 1, R 2, R 3 R 4, R 5, R 1 R 6, R 1, R 7 R 8, R 1, R 9 R 10, R 11 RAW hazards (1, 2), (1, 3), (1, 4), (1, 5) 27

Value Forwarding/Bypassing ã There is slack in how soon a value is actually available

Value Forwarding/Bypassing ã There is slack in how soon a value is actually available and how late it is actually required in the pipeline l Result of R-type available at end of X stage l Operand of dependent R-type not needed until beginning of X stage ã Communication of values among instructions happens through register file l Globally known names of containers of values l Accessed at fixed stages of pipeline (read in D, written in W) ã Forwarding/bypassing/short-circuiting corresponds to establishing a direct path between the producer of a value and its consumer, bypassing the container l Allows us to exploit slack l Requires additional resources (forwarding paths and controller) Ø Identify all forwarding paths needed on MIPS (Figure in book is incomplete) 28

Forwarding: Example 2 1: ADD 2: LW 3: SW R 1, R 2, R

Forwarding: Example 2 1: ADD 2: LW 3: SW R 1, R 2, R 3 R 4, 0(R 1) 12(R 1), R 4 Execution without forwarding: stall as necessary Execution with forwarding 29

Forwarding & Stalling: Example 3 L 1: LW L 2: LW A: ADD S:

Forwarding & Stalling: Example 3 L 1: LW L 2: LW A: ADD S: SW R 2, 40(R 8) R 3, 60(R 8) R 4, R 2, R 3 60(R 8), R 4 Without forwarding: stall where needed Cannot go backward in time! Cannot jump ahead in time! • Load has a latency of one cycle that cannot be hidden, as seen between L 2 and A Bad attempt at forwarding! With forwarding: need a stall! 30

Forwarding & Stalling: Example 4 L: LW S: SUB A: AND O: OR R

Forwarding & Stalling: Example 4 L: LW S: SUB A: AND O: OR R 1, R 4, R 6, R 8, 0(R 1) R 1, R 5 R 1, R 7 R 1, R 9 No forwarding needed from L to A: can resolve this by writing register file in first half of cycle and reading it in second half of cycle. 31

Load Data Forwarding MEM/WB EX/MEM ID/EX Forward B Registers Data Memory Forward A Rd

Load Data Forwarding MEM/WB EX/MEM ID/EX Forward B Registers Data Memory Forward A Rd Rt Rs Forwarding Unit 32