14 332 331 Computer Architecture and Assembly Language
14: 332: 331 Computer Architecture and Assembly Language Fall 2003 Lecture 18 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s UCB CS 152 slides and Mary Jane Irwin’s PSU CSE 331 slides] 331 Lec 18. 1 Fall 2003
Head’s Up q This week’s material l Introduction to pipelining - Reading assignment – PH 6. 1 q Reminders l HW#6 deadline? ? ? l Next week’s material l I/O, exceptions, and interrupts - Reading assignment – PH 5. 6, 8. 5, and A. 7 through A. 8 331 Lec 18. 2 Fall 2003
Review: Multicycle Data and Control Path 1 Memory Address Read Data (Instr. or Data) 1 1 Write Data 0 MDR Write Data 2 Shift left 2 28 2 0 1 zero ALU 4 0 Instr[15 -0] Sign Extend 32 Instr[5 -0] 331 Lec 18. 3 Shift left 2 Instr[25 -0] Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read IR PC Instr[31 -26] 0 PC[31 -28] 0 1 2 3 ALU control Fall 2003 ALUout Mem. Read Mem. Write Memto. Reg IRWrite PCSource ALUOp Control ALUSrc. B FSM ALUSrc. A Reg. Write Reg. Dst A Ior. D B PCWrite. Cond PCWrite
Review: RTL Summary Step Instr fetch Decode Execute Memory access Writeback 331 Lec 18. 4 R-type Mem Ref Branch Jump IR = Memory[PC]; PC = PC + 4; A = Reg[IR[25 -21]]; B = Reg[IR[20 -16]]; ALUOut = PC +(sign-extend(IR[15 -0])<< 2); ALUOut = A op B; ALUOut = A + sign-extend (IR[15 -0]); if (A==B) PC = ALUOut; PC = PC[31 -28] ||(IR[25 -0] << 2); Reg[IR[15 MDR = 11]] = Memory[ALUOut]; ALUOut; or Memory[ALUOut] = B; Reg[IR[20 -16]] = MDR; Fall 2003
Review: Multicycle Datapath FSM Decode Ior. D=0 Instr Fetch 1 Unless otherwise assigned ALUSrc. A=0 Mem. Read; IRWrite ALUSrc. B=11 Start ALUSrc. A=0 PCWrite, IRWrite, ALUOp=00 ALUsrc. B=01 Mem. Write, Reg. Write=0 PCWrite. Cond=0 PCSource, ALUOp=00 others=X ) PCWrite ) type eq R b ) w s = = (Op = j) lw or (Op 2 p = p O O ( ( 9 6 8 ALUSrc. A=1 ALUSrc. B=10 ALUSrc. B=00 PCSource=10 Execute ALUOp=00 ALUOp=01 ALUOp=10 PCWrite. Cond=0 PCSource=01 PCWrite. Cond=0 (Op PCWrite. Cond = (Op = lw) sw ) 0 3 Memory Access Mem. Read Ior. D=1 PCWrite. Cond=0 5 Mem. Write Ior. D=1 PCWrite. Cond=0 7 Reg. Dst=1 Reg. Write Memto. Reg=0 PCWrite. Cond=0 4 Reg. Dst=0 Reg. Write Memto. Reg=1 PCWrite. Cond=0 Write Back 331 Lec 18. 5 Fall 2003
Combinational control logic Outputs Review: FSM Implementation Op 5 Op 4 Op 3 Op 2 Op 1 Op 0 Inputs State Reg PCWrite. Cond Ior. D Mem. Read Mem. Write IRWrite Memto. Reg PCSource ALUOp ALUSource. B ALUSource. A Reg. Write Reg. Dst Next State Inst[31 -26] System Clock 331 Lec 18. 6 Fall 2003
Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction Cycle 1 Cycle 2 Clk Single Cycle Implementation: lw q sw Waste Is wasteful of area since some functional units must (e. g. , adders) be duplicated since they can not be shared during a clock cycle but q Is simple and easy to understand 331 Lec 18. 7 Fall 2003
Multicycle Advantages & Disadvantages q Uses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step l l q balance the amount of work to be done in each step restrict each step to use only one major functional unit Multicycle implementations allow l functional units to be used more than once per instruction as long as they are used on different clock cycles l faster clock rates different instructions to take a different number of clock cycles l but q Requires additional internal state registers, muxes, and more complicated (FSM) control 331 Lec 18. 8 Fall 2003
The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB q IFetch: Instruction Fetch and Update PC q Dec: Registers Fetch and Instruction Decode q Exec: Execute R-type; calculate memory address q Mem: Read/write the data from/to the Data Memory q WB: Write the data back to the register file 331 Lec 18. 9 Fall 2003
Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw multicycle clock slower than 1/5 th of single cycle clock due to stage flipflop overhead Multiple Cycle Implementation: Clk Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw IFetch 331 Lec 18. 10 Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Fall 2003
Pipelined MIPS Processor q Start the next instruction while still working on the current one l improves throughput - total amount of work done in a given time Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 IFetch Dec lw Exec IFetch Dec sw R-type l 331 Lec 18. 11 Mem WB Exec Mem IFetch Dec WB instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion Fall 2003
Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Pipeline Implementation: lw IFetch sw 331 Lec 18. 12 Mem wasted cycle Dec Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem R-type IFetch Exec R-type IFetch WB Fall 2003
Pipelining the MIPS ISA q What makes it easy l l q l memory operations can occur only in loads and stores l operands must be aligned in memory so a single data transfer requires only one memory access What makes it hard l l l 331 Lec 18. 13 all instructions are the same length (32 bits) few instruction formats (three) with symmetry across formats structural hazards: what if we had only one memory control hazards: what about branches data hazards: what if an instruction’s input operands depend on the output of a previous instruction Fall 2003
MIPS Pipeline Datapath Modifications q What do we need to add/modify in our MIPS datapath? l State registers between pipeline stages to isolate them IFetch Dec Exec Mem WB 1 0 Add Shift left 2 4 331 Lec 18. 14 16 Sign Extend Read Data 2 0 1 ALU Read Data Write Data Mem/WB File Write Address Exec/Mem Read Addr 2 Data 1 Write Data System Clock Data Memory Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add 32 Fall 2003 1 0
MIPS Pipeline Control Path Modifications q All control signals are determined during Decode l and held in the state registers between pipeline stages IFetch Dec Exec Mem WB 1 0 Control Add Shift left 2 4 331 Lec 18. 15 16 Sign Extend Read Data 2 0 1 ALU Read Data Write Data Mem/WB File Write Address Exec/Mem Read Addr 2 Data 1 Write Data System Clock Data Memory Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add 32 Fall 2003 1 0
Graphically Representing MIPS Pipeline q Reg ALU IM DM Reg Can help with answering questions like: l l l 331 Lec 18. 16 how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? Fall 2003
Why Pipeline? For Throughput! Time (clock cycles) IM Reg DM IM Reg ALU Inst 3 DM ALU Inst 2 Once the pipeline is full, one instruction is completed every cycle Reg ALU Inst 1 IM ALU O r d e r Inst 0 ALU I n s t r. Inst 4 Reg Reg DM Reg Time to fill the pipeline 331 Lec 18. 17 Fall 2003
Can pipelining get us into trouble? q Yes: Pipeline Hazards l l structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use item before it is ready - instruction depends on result of prior instruction still in the pipeline l control hazards: attempt to make a decision before condition is evaulated - branch instructions q Can always resolve hazards by waiting l l 331 Lec 18. 18 pipeline control must detect the hazard take action (or delay action) to resolve hazards Fall 2003
A Unified Memory Would Be a Structural Hazard Time (clock cycles) Inst 4 331 Lec 18. 19 Mem Reg Reg Mem Reg ALU Inst 3 Reg ALU Inst 2 Mem Reg ALU Inst 1 Reading data from memory Mem ALU O r d e r lw Reg ALU I n s t r. Mem Mem Reading instruction from memory Mem Reg Fall 2003
How About Register File Access? Time (clock cycles) Inst 4 331 Lec 18. 20 IM Reg DM IM Reg ALU add DM ALU Inst 2 Reg ALU Inst 1 IM ALU O r d e r add ALU I n s t r. Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. Reg Reg DM Reg Fall 2003
Branch Instructions Cause Control Hazards q Inst 4 331 Lec 18. 21 IM Reg DM IM Reg ALU Inst 3 DM ALU lw Reg ALU beq IM ALU O r d e r add ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Reg Fall 2003
One Way to “Fix” a Control Hazard DM IM Reg Can fix branch hazard by waiting – stall – but affects throughput DM Reg IM Reg DM IM Reg ALU beq IM ALU O r d e r add ALU I n s t r. stall lw Inst 3 331 Lec 18. 22 Reg DM Fall 2003 Reg
Register Usage Can Cause Data Hazards q xor r 4, r 1, r 5 331 Lec 18. 23 IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sub r 4, r 1, r 5 IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Fall 2003 Reg
One Way to “Fix” a Data Hazard Reg DM Reg IM Reg DM IM Reg ALU IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Can fix data hazard by waiting – stall – but affects throughput stall sub r 4, r 1, r 5 and r 6, r 1, r 7 331 Lec 18. 24 Reg DM Fall 2003 Reg
Loads Can Cause Data Hazards q xor r 4, r 1, r 5 331 Lec 18. 25 IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sub r 4, r 1, r 5 IM ALU O r d e r lw r 1, 100(r 2) ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Fall 2003 Reg
Stores Can Cause Data Hazards q xor r 4, r 1, r 5 331 Lec 18. 26 IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sw r 1, 100(r 5) IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Fall 2003 Reg
Other Pipeline Structures Are Possible q What about (slow) multiply operation? l let it take two cycles MUL q ALU IM Reg DM Reg What if the data memory access is twice as slow as the instruction memory? l l make the clock twice as slow or … let data memory access take two cycles (and keep the same clock rate) 331 Lec 18. 27 Reg ALU IM DM 1 DM 2 Reg Fall 2003
Sample Pipeline Alternatives q ARM 7 IM Reg PC update IM access q XScale IM IM 1 PC update BTB access start IM access Reg IM 2 DM Reg SHFT decode reg 1 access IM access 331 Lec 18. 28 ALU op DM access shift/rotate commit result (write back) ALU Strong. ARM-1 decode reg access ALU q EX DM 1 Reg DM 2 DM write reg write start DM access exception ALU op shift/rotate reg 2 access Fall 2003
Summary q All modern day processors use pipelining q Pipelining doesn’t help latency of single task, it helps throughput of entire workload l Multiple tasks operating simultaneously using different resources q Potential speedup = Number of pipe stages q Pipeline rate limited by slowest pipeline stage l l q Must detect and resolve hazards l q Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stalling negatively affects throughput To learn (much) more take CSE 431 331 Lec 18. 29 Fall 2003
- Slides: 29