361 Computer Architecture Lecture 12 Designing a Pipeline

Overview of a Multiple Cycle Implementation ° The root of the single cycle processor’s

Multiple Cycle Processor ° MCP: A functional unit to be used more than once

Outline of Today’s Lecture ° Recap and Introduction ° Introduction to the Concept of

Pipelining is Natural! ° Laundry Example ° Sammy, Marc, Griffy, Albert each have one

Sequential Laundry 6 PM T a s k A 7 8 9 10 11

Pipelined Laundry: Start work ASAP 6 PM 7 8 9 30 30 T a

Pipelining Lessons 6 PM T a s k O r d e r pipeline.

Why Pipeline? ° Suppose we execute 100 instructions ° Single Cycle Machine • 45

Timing Diagram of a Load Instruction Fetch Address Data Memory Reg Wr Reg. Fetch

The Five Stages of Load Cycle 1 Cycle 2 Load Ifetch Reg/Dec Cycle 3

Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Conventional Pipelined Execution Representation Time IFetch Dcd Exec IFetch Dcd Mem WB Exec Mem

Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation:

Why Pipeline? Because the resources are there! Time (clock cycles) Inst 3 pipeline. 15

Can pipelining get us into trouble? ° Yes: Pipeline Hazards • structural hazards: attempt

Single Memory is a Structural Hazard Time (clock cycles) Instr 4 Reg Mem Reg

Structural Hazards limit performance ° Example: if 1. 3 memory accesses per instruction and

Pipelining the R-type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4

The Four Stages of R-type Cycle 1 Cycle 2 R-type Ifetch Reg/Dec Cycle 3

Important Observation ° Each functional unit can only be used once per instruction °

Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle

Solution 2: Delay R-type’s Write by One Cycle ° Delay R-type’s register write by

The Four Stages of Store Cycle 1 Cycle 2 Store Ifetch Reg/Dec Cycle 3

The Four Stages of Beq Cycle 1 Cycle 2 Beq Ifetch Reg/Dec Cycle 3

A Pipelined Datapath Clk Ifetch Reg/Dec Ext. Op Reg. Wr Mem ALUOp Wr Branch

The Instruction Fetch Stage ° Location 10: lw $1, 0 x 100($2) $1 <-

A Detail View of the Instruction Unit ° Location 10: lw $1, 0 x

The Decode / Register Fetch Stage ° Location 10: lw $1, 0 x 100($2)

Load’s Address Calculation Stage ° Location 10: lw $1, 0 x 100($2) $1 <-

A Detail View of the Execution Unit You are here! Clk Exec Mem Adder

Load’s Memory Access Stage ° Location 10: lw $1, 0 x 100($2) $1 <-

Load’s Write Back Stage ° Location 10: lw $1, 0 x 100($2) $1 <-

How About Control Signals? ° Key Observation: Control Signals at Stage N = Func

Pipeline Control ° The Main Control generates the control signals during Reg/Dec • Control

Beginning of the Wr’s Stage: A Real World Problem Clk Reg. Adr Wr. Adr

The Pipeline Problem ° Multiple Cycle design prevents race condition between Addr and Wr.

Synchronize Register File & Synchronize Memory ° Solution: And the Write Enable signal with

A More Extensive Pipelining Example Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle

Pipelining Example: End of Cycle 4 ° 0: Load’s Mem 4: R-type’s Exec 8:

Pipelining Example: End of Cycle 5 ° 0: Lw’s Wr 4: R’s Mem 8:

Pipelining Example: End of Cycle 6 ° 4: R’s Wr 8: Store’s Mem 12:

Pipelining Example: End of Cycle 7 ° 8: Store’s Wr 12: Beq’s Mem 16:

The Delay Branch Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

The Delay Load Phenomenon Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Summary ° Disadvantages of the Single Cycle Processor • Long cycle time • Cycle

Slides: 46

Download presentation

361 Computer Architecture Lecture 12: Designing a Pipeline Processor pipeline. 1

Overview of a Multiple Cycle Implementation ° The root of the single cycle processor’s problems: • The cycle time has to be long enough for the slowest instruction ° Solution: • Break the instruction into smaller steps • Execute each step (instead of the entire instruction) in one cycle - Cycle time: time it takes to execute the longest step - Keep all the steps to have similar length • This is the essence of the multiple cycle processor ° The advantages of the multiple cycle processor: • Cycle time is much shorter • Different instructions take different number of cycles to complete - Load takes five cycles - Jump only takes three cycles • Allows a functional unit to be used more than once per instruction pipeline. 2

Multiple Cycle Processor ° MCP: A functional unit to be used more than once per instruction PCWr. Cond Zero Mem. Wr IRWr Reg. Dst ALUSel. A Reg. Wr 1 32 PC 32 1 Wr. Adr 32 Din Dout 32 32 32 Rt 0 5 Rd Mux Ideal Memory Rt 5 Rb bus. A Reg File 32 bus. W bus. B 32 1 Extend Ext. Op 32 1 Rw 1 Mux 0 Imm 16 pipeline. 3 Ra << 2 4 0 1 32 32 2 3 32 Memto. Reg Zero ALU Mux RAdr Rs Target 32 0 0 Mux 0 32 Instruction Reg 32 Br. Wr Mux Ior. D PCSrc ALU Control ALUOp ALUSel. B

Outline of Today’s Lecture ° Recap and Introduction ° Introduction to the Concept of Pipelined Processor ° Pipelined Datapath and Pipelined Control ° How to Avoid Race Condition in a Pipeline Design? ° Pipeline Example: Instructions Interaction ° Summary pipeline. 4

Pipelining is Natural! ° Laundry Example ° Sammy, Marc, Griffy, Albert each have one load of clothes to wash, dry, and fold ° Washer takes 30 minutes ° Dryer takes 30 minutes ° “Folder” takes 30 minutes ° “Stasher” takes 30 minutes to put clothes into drawers pipeline. 5 A B C D

Sequential Laundry 6 PM T a s k A 7 8 9 10 11 12 C D ° Sequential laundry takes 8 hours for 4 loads ° If they learned pipelining, how long would laundry take? pipeline. 6 2 AM 30 30 30 30 Time B O r d e r 1

Pipelined Laundry: Start work ASAP 6 PM 7 8 9 30 30 T a s k 10 11 Time A B C O r d e r D ° Pipelined laundry takes 3. 5 hours for 4 loads! pipeline. 7 12 1 2 AM

Pipelining Lessons 6 PM T a s k O r d e r pipeline. 8 7 8 9 Time 30 30 ° Pipelining doesn’t help latency of single task, it helps throughput of entire workload ° Multiple tasks operating simultaneously using different resources A ° Potential speedup = Number pipe stages B ° Pipeline rate limited by slowest pipeline stage C D ° Unbalanced lengths of pipe stages reduces speedup ° Time to “fill” pipeline and time to “drain” it reduces speedup ° Stall for Dependences

Why Pipeline? ° Suppose we execute 100 instructions ° Single Cycle Machine • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns ° Multicycle Machine • 10 ns/cycle x 4. 6 CPI (due to inst mix) x 100 inst = 4600 ns ° Ideal pipelined machine • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns pipeline. 9

Timing Diagram of a Load Instruction Fetch Address Data Memory Reg Wr Reg. Fetch Clk PC Instr Decode / Old Value Clk-to-Q New Value Instruction Memory Access Time New Value Rs, Rt, Rd, Op, Func Old Value ALUctr Old Value Ext. Op Old Value New Value ALUSrc Old Value New Value Reg. Wr Old Value New Value bus. B Register File Access Time New Value Old Value Delay through Extender & Mux Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time bus. W pipeline. 10 Old Value New Register File Write Time bus. A Delay through Control Logic New Value

The Five Stages of Load Cycle 1 Cycle 2 Load Ifetch Reg/Dec Cycle 3 Cycle 4 Cycle 5 Exec Mem Wr ° Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory ° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: Calculate the memory address ° Mem: Read the data from the Data Memory ° Wr: Write the data back to the register file pipeline. 11

Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1 st lw Ifetch Reg/Dec 2 nd lw Ifetch 3 rd lw Exec Mem Wr Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr ° The five independent functional units in the pipeline datapath are: • Instruction Memory for the Ifetch stage • Register File’s Read ports (bus A and bus. B) for the Reg/Dec stage • ALU for the Exec stage • Data Memory for the Mem stage • Register File’s Write port (bus W) for the Wr stage ° One instruction enters the pipeline every cycle • One instruction comes out of the pipeline (complete) every cycle • The “Effective” Cycles per Instruction (CPI) is 1 pipeline. 12

Conventional Pipelined Execution Representation Time IFetch Dcd Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem IFetch Dcd Program Flow pipeline. 13 IFetch Dcd WB

Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Pipeline Implementation: Load Ifetch Reg Store Ifetch Exec Mem Wr Reg Exec Mem R-type Ifetch pipeline. 14 Reg Exec Wr Mem Wr Exec Mem R-type Ifetch

Why Pipeline? Because the resources are there! Time (clock cycles) Inst 3 pipeline. 15 Im Dm Reg Dm Im Reg Reg Dm ALU Inst 4 Reg ALU Inst 2 Im Dm ALU Inst 1 Reg ALU O r d e r Inst 0 Im ALU I n s t r. Reg Dm Reg

Can pipelining get us into trouble? ° Yes: Pipeline Hazards • structural hazards: attempt to use the same resource two different ways at the same time - E. g. , combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) • data hazards: attempt to use item before it is ready - E. g. , one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer - instruction depends on result of prior instruction still in the pipeline • control hazards: attempt to make a decision before condition is evaulated - E. g. , washing football uniforms and need to get proper detergent level; need to see after dryer before next load in - branch instructions ° Can always resolve hazards by waiting • pipeline control must detect the hazard • take action (or delay action) to resolve hazards pipeline. 16

Single Memory is a Structural Hazard Time (clock cycles) Instr 4 Reg Mem Reg Mem Reg ALU Instr 3 Reg ALU Instr 2 Mem ALU Instr 1 Reg ALU O r d e r Load Mem ALU I n s t r. Mem Reg Detection is easy in this case! (right half highlight means read, left half write) pipeline. 17

Structural Hazards limit performance ° Example: if 1. 3 memory accesses per instruction and only one memory access per cycle then • average CPI 1. 3 • otherwise resource is more than 100% utilized • More on Hazards later pipeline. 18

Pipelining the R-type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch R-type Reg/Dec Exec Ifetch Reg/Dec Load Ops! We have a problem! Wr R-type Ifetch Wr Exec Mem Wr Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr ° We have a problem: • Two instructions try to write to the register file at the same time! pipeline. 19

The Four Stages of R-type Cycle 1 Cycle 2 R-type Ifetch Reg/Dec Cycle 3 Cycle 4 Exec Wr ° Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory ° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: ALU operates on the two register operands ° Wr: Write the ALU output back to the register file pipeline. 20

Important Observation ° Each functional unit can only be used once per instruction ° Each functional unit must be used at the same stage for all instructions: • Load uses Register File’s Write Port during its 5 th stage Load 1 Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr • R-type uses Register File’s Write Port during its 4 th stage 1 R-type Ifetch pipeline. 21 2 Reg/Dec 3 Exec 4 Wr

Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Load Reg/Dec Exec Ifetch Reg/Dec R-type Ifetch Wr Exec Mem Reg/Dec Exec Wr Wr R-type Ifetch Reg/Dec Pipeline Exec R-type Ifetch Bubble Reg/Dec Ifetch Wr Exec Reg/Dec Wr Exec ° Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle • The control logic can be complex ° No instruction is completed during Cycle 5: • The “Effective” CPI for load is >1 pipeline. 22

Solution 2: Delay R-type’s Write by One Cycle ° Delay R-type’s register write by one cycle: • Now R-type instructions also use Reg File’s write port at Stage 5 • Mem stage is a NOOP stage: nothing is being done 1 2 R-type Ifetch Cycle 1 Cycle 2 Reg/Dec 3 Exec 4 Mem 5 Wr Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch R-type Reg/Dec Mem Exec Wr Ifetch Reg/Dec Exec Mem Wr Reg/Dec Mem Exec Load R-type Ifetch pipeline. 23 Wr

The Four Stages of Store Cycle 1 Cycle 2 Store Ifetch Reg/Dec Cycle 3 Cycle 4 Exec Mem Wr ° Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory ° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: Calculate the memory address ° Mem: Write the data into the Data Memory pipeline. 24

The Four Stages of Beq Cycle 1 Cycle 2 Beq Ifetch Reg/Dec Cycle 3 Cycle 4 Exec Mem Wr ° Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory ° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: ALU compares the two register operands • Adder calculates the branch target address ° Mem: If the registers we compared in the Exec stage are the same, • Write the branch target address into the PC pipeline. 25

A Pipelined Datapath Clk Ifetch Reg/Dec Ext. Op Reg. Wr Mem ALUOp Wr Branch 1 0 PC Imm 16 Ra Rt RFile Rw Di Rd 0 Data Mem RA Do WA Di 1 Reg. Dst pipeline. 26 Exec Unit Zero ALUSrc Mem. Wr Memto. Reg 1 Mux Rt Rb Imm 16 bus. A bus. B Mem/Wr Register Rs ID/Ex Register IUnit IF/ID Register A I PC+4 Ex/Mem Register PC+4 0

The Instruction Fetch Stage ° Location 10: lw $1, 0 x 100($2) $1 <- Mem[($2) + 0 x 100] You are here! Clk Ifetch Reg/Dec Ext. Op Reg. Wr Mem ALUOp Branch 1 0 Imm 16 Ra Rt RFile Rw Di Rd 0 RA Do WA Di 1 Reg. Dst pipeline. 27 Exec Unit Data Mem ALUSrc Mem. Wr Memto. Reg 1 Mux Rt Rb Zero Mem/Wr Register Rs Imm 16 bus. A bus. B Ex/Mem Register IUnit I PC+4 ID/Ex Register A IF/ID: lw $1, 100 ($2) PC = 14 PC+4 0

A Detail View of the Instruction Unit ° Location 10: lw $1, 0 x 100($2) You are here! Clk Ifetch Reg/Dec 1 0 “ 4” Address Instruction Memory Instruction pipeline. 28 IF/ID: lw $1, 100 ($2) Adder PC = 14 10

The Decode / Register Fetch Stage ° Location 10: lw $1, 0 x 100($2) $1 <- Mem[($2) + 0 x 100] You are here! Clk Ifetch Reg/Dec Ext. Op Reg. Wr Mem ALUOp Branch 1 0 Imm 16 Rt Rd RFile Rw Di Reg. Dst pipeline. 29 Exec Unit 0 Zero Data Mem RA Do WA Di 1 ALUSrc Mem. Wr Memto. Reg 1 Mux IUnit I Rt Rb Imm 16 bus. A bus. B Mem/Wr Register Ra PC+4 Ex/Mem Register Rs ID/Ex: Reg. 2 & 0 x 100 IF/ID: PC+4 PC A PC+4 0

Load’s Address Calculation Stage ° Location 10: lw $1, 0 x 100($2) $1 <- Mem[($2) + 0 x 100] You are here! Clk Ifetch Reg/Dec Reg. Wr Exec ALUOp=Add Ext. Op=1 Mem Branch 1 0 Imm 16 Ra Rt Rd RFile Rw Di Exec Unit 0 1 Reg. Dst=0 pipeline. 30 Data Mem RA Do WA Di ALUSrc=1 Mem. Wr Memto. Reg 1 Mux IUnit I Rt Rb Zero Mem/Wr Register Rs Imm 16 bus. A bus. B Ex/Mem: Load’s Address PC+4 ID/Ex Register IF/ID: PC+4 PC A PC+4 0

A Detail View of the Execution Unit You are here! Clk Exec Mem Adder bus. A 32 32 Zero bus. B 0 Extender 16 Ext. Op=1 Mux 32 imm 16 pipeline. 31 32 ALU ID/Ex Register PC+4 Target 32 1 ALUSrc=1 3 ALUout 32 ALUctr ALU Control 3 Ex/Mem: Load’s Memory Address << 2 ALUOp=Add

Load’s Memory Access Stage ° Location 10: lw $1, 0 x 100($2) $1 <- Mem[($2) + 0 x 100] You are here! Clk Ifetch Reg/Dec Ext. Op Reg. Wr Mem ALUOp Branch=0 1 0 Imm 16 Ra Rt RFile Rw Di Rd 0 Data Mem RA Do WA Di 1 Reg. Dst pipeline. 32 Exec Unit Zero ALUSrc Mem. Wr=0 Memto. Reg 1 Mux IUnit I Rt Rb Ex/Mem Register Rs Imm 16 bus. A bus. B Mem/Wr: Load’s Data PC+4 ID/Ex Register IF/ID: PC+4 PC A PC+4 0

Load’s Write Back Stage ° Location 10: lw $1, 0 x 100($2) $1 <- Mem[($2) + 0 x 100] You are somewhere out there! Clk Ifetch Reg/Dec Ext. Op Reg. Wr=1 Mem ALUOp Wr Branch 1 0 PC+4 Imm 16 Rt RFile Rw Di Rd 0 RA Do WA Di 1 0 1 Reg. Dst pipeline. 33 Exec Unit Data Mem Mux IUnit I Rt Rb Zero Mem/Wr Register Ra Imm 16 bus. A bus. B Ex/Mem Register Rs ID/Ex Register IF/ID: PC+4 PC A PC+4 ALUSrc Mem. Wr Memto. Reg=1

How About Control Signals? ° Key Observation: Control Signals at Stage N = Func (Instr. at Stage N) • N = Exec, Mem, or Wr ° Example: Controls Signals at Exec Stage = Func(Load’s Exec) Ifetch Reg/Dec Wr Reg. Wr Exec ALUOp=Add Ext. Op=1 Mem Branch 1 0 Imm 16 Ra Rt Rd RFile Rw Di Exec Unit 0 1 Reg. Dst=0 pipeline. 34 Data Mem RA Do WA Di ALUSrc=1 Mem. Wr Memto. Reg 1 Mux IUnit I Rt Rb Zero Mem/Wr Register Rs Imm 16 bus. A bus. B Ex/Mem: Load’s Address PC+4 ID/Ex Register IF/ID: PC+4 PC A PC+4 0

Pipeline Control ° The Main Control generates the control signals during Reg/Dec • Control signals for Exec (Ext. Op, ALUSrc, . . . ) are used 1 cycle later • Control signals for Mem (Mem. Wr Branch) are used 2 cycles later • Control signals for Wr (Memto. Reg Mem. Wr) are used 3 cycles later Reg/Dec Mem. Wr Branch Memto. Reg. Wr pipeline. 35 Reg. Dst Mem. Wr Branch Memto. Reg Reg. Wr Wr Mem/Wr Register Reg. Dst Ext. Op ALUSrc ALUOp Mem Ex/Mem Register Main Control ID/Ex Register IF/ID Register Ext. Op ALUSrc ALUOp Exec Memto. Reg. Wr

Beginning of the Wr’s Stage: A Real World Problem Clk Reg. Adr Wr. Adr Reg. Wr Mem. Wr Reg. Wr’s Clk-to-Q Mem. Wr’s Clk-to-Q Reg. Adr Data Reg File Ex/Mem Mem/Wr Reg. Wr Wr. Adr’s Clk-to-Q Mem. Wr Wr. Adr Data Memory ° At the beginning of the Wr stage, we have a problem if: • Reg. Adr’s (Rd or Rt) Clk-to-Q > Reg. Wr’s Clk-to-Q ° Similarly, at the beginning of the Mem stage, we have a problem if: • Wr. Adr’s Clk-to-Q > Mem. Wr’s Clk-to-Q ° We have a race condition between Address and Write Enable! pipeline. 36

The Pipeline Problem ° Multiple Cycle design prevents race condition between Addr and Wr. En: • Make sure Address is stable by the end of Cycle N • Asserts Wr. En during Cycle N + 1 ° This approach can NOT be used in the pipeline design because: • Must be able to write the register file every cycle • Must be able write the data memory every cycle Clock Store Ifetch Reg/Dec Store Ifetch Exec Mem Wr Reg/Dec Exec Mem R-type Ifetch pipeline. 37 Wr

Synchronize Register File & Synchronize Memory ° Solution: And the Write Enable signal with the Clock • This is the ONLY place where gating the clock is used • MUST consult circuit expert to ensure no timing violation: - Example: Clock High Time > Write Access Delay Synchronize Memory and Register File Clk Address, Data, and Wr. En must be stable at least 1 set-up time before the Clk edge I_Addr I_Wr. En Write occurs at the cycle following the clock edge that captures the signals C_Wr. En C_Wr. En I_Wr. En Address Data pipeline. 38 I_Addr I_Data Reg File or Memory Address Data Clk Reg File or Memory

A More Extensive Pipelining Example Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock 0: Load Ifetch Reg/Dec 4: R-type Ifetch Exec Mem Wr Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem 8: Store Ifetch 12: Beq (target is 1000) End of Cycle 4 End of Cycle 5 Wr End of Cycle 6 Cycle 7 ° End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s Ifetch ° End of Cycle 5: Load’s Wr, R-type’s Mem, Store’s Exec, Beq’s Reg ° End of Cycle 6: R-type’s Wr, Store’s Mem, Beq’s Exec ° End of Cycle 7: Store’s Wr, Beq’s Mem pipeline. 39

Pipelining Example: End of Cycle 4 ° 0: Load’s Mem 4: R-type’s Exec 8: Store’s Reg 12: Beq’s Ifet Reg. Wr=0 8: Store’s Reg 4: R-type’s Exec ALUOp=R-type Ext. Op=x 0: Load’s Mem Branch=0 Clk 1 0 Ra Rt Rd RFile Rw Di Exec Unit 0 1 Reg. Dst=1 ALUSrc=0 Zero Clk Mem. Wr=0 Data Mem RA Do WA Di 1 Mux Rt Rb Imm 16 bus. A bus. B Mem/Wr: Load’s Dout Rs PC+4 Ex/Mem: R-type’s Result Imm 16 ID/Ex: Store’s bus. A & B IUnit IF/ID: Beq Instruction PC = 16 PC+4 A I pipeline. 40 12: Beq’s Ifetch 0 Memto. Reg=x

Pipelining Example: End of Cycle 5 ° 0: Lw’s Wr 4: R’s Mem 8: Store’s Exec 12: Beq’s Reg 16: R’s Ifetch 12: Beq’s Reg 0: Load’s Wr 16: R’s Ifet Reg. Wr=1 8: Store’s Exec ALUOp=Add Ext. Op=1 Branch=0 Clk 1 0 Ra Rt Rd RFile Rw Di Exec Unit 0 1 Reg. Dst=x ALUSrc=1 Zero Clk Mem. Wr=0 Data Mem RA Do WA Di 1 Mux Rt Rb Imm 16 bus. A bus. B Mem/Wr: R-type’s Result Rs PC+4 Ex/Mem: Store’s Address Imm 16 ID/Ex: Beq’s bus. A & B IUnit IF/ID: Instruction @ 16 PC = 20 PC+4 A I pipeline. 41 4: R-type’s Mem 0 Memto. Reg=1

Pipelining Example: End of Cycle 6 ° 4: R’s Wr 8: Store’s Mem 12: Beq’s Exec 16: R’s Reg 20: R’s Ifet 16: R-type’s Reg 20: R-type’s Ifet 4: R-type’s Wr Reg. Wr=1 12: Beq’s Exec 8: Store’s Mem ALUOp=Sub Ext. Op=1 Branch=0 Clk 1 0 Ra Rt Rd Rb RFile Rw Di Exec Unit 0 1 Reg. Dst=x ALUSrc=0 pipeline. 42 Zero Clk Mem. Wr=1 Data Mem RA Do WA Di 1 Mux Rt Imm 16 bus. A bus. B Mem/Wr: Nothing for St Rs PC+4 Ex/Mem: Beq’s Results IUnit I Imm 16 ID/Ex: R-type’s bus. A & B A IF/ID: Instruction @ 20 PC = 24 PC+4 0 Memto. Reg=0

Pipelining Example: End of Cycle 7 ° 8: Store’s Wr 12: Beq’s Mem 16: R’s Exec 20: R’s Reg 24: R’s Ifet 20: R-type’s Reg 24: R-type’s Ifet 8: Store’s Wr Reg. Wr=0 16: R-type’s Exec ALUOp=R-type Ext. Op=x Branch=1 Clk 1 0 Ra Rt Rd Rb RFile Rw Di Exec Unit 0 1 Reg. Dst=1 ALUSrc=0 Zero Clk Mem. Wr=0 Data Mem RA Do WA Di 1 Mux Rt Imm 16 bus. A bus. B Mem/Wr: Nothing for Beq Rs PC+4 Ex/Mem: Rtype’s Results Imm 16 ID/Ex: R-type’s bus. A & B IUnit IF/ID: Instruction @ 24 PC+4 PC = 1000 PC+4 A I pipeline. 43 12: Beq’s Mem 0 Memto. Reg=x

The Delay Branch Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Clk 12: Beq Ifetch Reg/Dec Exec (target is 1000) 16: R-type Ifetch Reg/Dec 20: R-type Ifetch 24: R-type Mem Wr Exec Mem Wr Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem 1000: Target of Br Wr ° Although Beq is fetched during Cycle 4: • Target address is NOT written into the PC until the end of Cycle 7 • Branch’s target is NOT fetched until Cycle 8 • 3 -instruction delay before the branch take effect ° This is referred to as Branch Hazard: • Clever design techniques can reduce the delay to ONE instruction pipeline. 44

The Delay Load Phenomenon Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock I 0: Load Ifetch Plus 1 Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Plus 2 Plus 3 Plus 4 Wr ° Although Load is fetched during Cycle 1: • The data is NOT written into the Reg File until the end of Cycle 5 • We cannot read this value from the Reg File until Cycle 6 • 3 -instruction delay before the load take effect ° This is referred to as Data Hazard: • Clever design techniques can reduce the delay to ONE instruction pipeline. 45

Summary ° Disadvantages of the Single Cycle Processor • Long cycle time • Cycle time is too long for all instructions except the Load ° Multiple Clock Cycle Processor: • Divide the instructions into smaller steps • Execute each step (instead of the entire instruction) in one cycle ° Pipeline Processor: • Natural enhancement of the multiple clock cycle processor • Each functional unit can only be used once per instruction • If a instruction is going to use a functional unit: - it must use it at the same stage as all other instructions • Pipeline Control: - Each stage’s control signal depends ONLY on the instruction that is currently in that stage pipeline. 46