Datorsystem 1 och Datorarkitektur 1 frelsning 10 mndag

Datorsystem 1 och Datorarkitektur 1 – föreläsning 10 måndag 19 November 2007

R-type Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0]

Load Word Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31

Single cycle design – fetch, decode and execute each instructions in one clock cycle

NOTE: this is a single-cycle implementation Clock Cycle time must be long enough for

Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock

Multicycle Datapath Approach q Let an instruction take more than 1 clock cycle to

Multicycle Datapath Approach, con’t At the end of a cycle Write Data l l

The Multicycle Datapath with Control Signals Address Read Data (Instr. or Data) 1 1

The Five Steps of the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle

Review: Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Cycle 1 Cycle 2

How can we make it even faster? ALU There is a point of diminishing

CRAY-1 1976 Fetch (and execute) more than one instruction at a time Superscalar processing

Das sind Zwei belegte Brote, eins mit Schinken eins mit EI! Ein Belegtes Brot

Start fetching and executing the next instruction before the current one has completed. l

A Pipelined MIPS Processor q Start the next instruction before the current one has

Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 How

MIPS Pipeline Datapath Modifications q What do we need to add/modify in our MIPS

Pipelining the MIPS ISA Is it hard to introduce pipelining to MIPS? EASY l

Graphically Representing MIPS Pipeline q Reg ALU IM DM Reg Can help with answering

Why Pipeline? For Performance! Once the pipeline is full, one instruction is completed every

Can Pipelining Get Us Into Trouble? q Yes: Pipeline Hazards l l structural hazards:

A Single Memory Would Be a Structural Hazard Time (clock cycles) q Reg Mem

How About Register File Access? Time (clock cycles) Inst 2 add $2, $1, DM

Register Usage Can Cause Data Hazards q Dependencies backward in time cause hazards Reg

Loads Can Cause Data Hazards q Reg DM IM Reg ALU sub $4, $1,

One Way to “Fix” a Data Hazard Reg DM Reg IM Reg DM IM

Another Way to “Fix” a Data Hazard or $8, $1, $9 xor $4, $1,

Forwarding with Load-use Data Hazards sub q Reg n a k r DM Reg

Branch Instructions Cause Control Hazards q Reg IM Reg DM IM Reg ALU lw

One Way to “Fix” a Control Hazard beq O r d e r stall

Another Way to “Fix” a Control Hazard beq O r d e r lw

Branch Prediciton to “Fix” a Control Hazard IM lw Predict all branches are not

Branch Prediciton to “Fix” a Control Hazard IM Reg DM Reg stall IM addi

Finns det inte något smartare sätt att ”gissa” ? Dynamic Branch Prediction. . .

Corrected Datapath to Save Reg. Write Addr q Need to preserve the destination register

MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode

Other Pipeline Structures Are Possible q What about the (slow) multiply operation? l l

Summary q All modern day processors use pipelining Pipelining doesn’t help latency of single

Slides: 43

Download presentation

Datorsystem 1 och Datorarkitektur 1 – föreläsning 10 måndag 19 November 2007

Fetch PC = PC+4 Exec Decode

R-type Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0] Mem. Read Memto. Reg Mem. Write ALUSrc Reg. Write ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15 -0] 1 PCSrc Branch Instr[31 -26] Control Unit Instruction Memory Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Load Word Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0] Mem. Read Memto. Reg Mem. Write ALUSrc Reg. Write ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15 -0] 1 PCSrc Branch Instr[31 -26] Control Unit Instruction Memory Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Single cycle design – fetch, decode and execute each instructions in one clock cycle State Combinational element 1 Logic 2 element 3 clock one clock cycle No datapath resource can be used more than once per instruction, so some must be duplicated (e. g. , separate Instruction Memory and Data Memory, several adders) Cycle time is determined by length of the longest path Foto: C. E. Delohery some rights reserved

NOTE: this is a single-cycle implementation Clock Cycle time must be long enough for the longest possible path A god candidate for the longest path? Load Word R-type instructions such as add etc only uses four functional units: Uses five functional units: 1. Instruction memory 2. Register file 3. ALU 4. Data memory 5. Register file Foto: Fort Photo some rights reserved 1. Instruction memory What about Store Word? 2. Register file 3. ALU 4. Register file

Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction l especially problematic for more complex instructions like floating point multiply Cycle 1 Cycle 2 Clk lw q Waste May be waste of area since some functional units (e. g. , adders) must be duplicated since they can not be shared during a clock cycle but q add Is simple and easy to understand

Multicycle Datapath Approach q Let an instruction take more than 1 clock cycle to complete l Break up instructions into steps where each step takes a cycle while trying to - balance the amount of work to be done in each step - restrict each cycle to use only one major functional unit l q Not every instruction takes the same number of clock cycles In addition to faster clock rates, multicycle allows functional units that can be used more than once per instruction as long as they are used on different clock cycles, as a result l l only need one memory – but only one memory access per cycle need only one ALU/adder – but only one ALU operation per cycle

Multicycle Datapath Approach, con’t At the end of a cycle Write Data l l l IR – Instruction Register A, B – regfile read data registers ALUout Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Write Data 2 A Address Read Data (Instr. or Data) B Memory IR Store values needed in a later cycle by the current instruction in an internal register (not visible to the programmer). All (except IR) hold data only between a pair of adjacent clock cycles (no write control signal needed) MDR l PC q MDR – Memory Data Register ALUout – ALU output register Data used by subsequent instructions are stored in programmer visible registers (i. e. , register file, PC, or memory)

The Multicycle Datapath with Control Signals Address Read Data (Instr. or Data) 1 1 Write Data 0 Write Data Shift left 2 Instr[25 -0] Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read IR 1 Memory MDR PC Instr[31 -26] 0 PC[31 -28] Data 2 Instr[15 -0] Sign Extend 32 Instr[5 -0] Shift left 2 2 0 1 zero ALU 4 0 28 0 1 2 3 ALU control ALUout Mem. Read Mem. Write Memto. Reg IRWrite PCSource ALUOp Control ALUSrc. B ALUSrc. A Reg. Write Reg. Dst A Ior. D B PCWrite. Cond PCWrite

The Five Steps of the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB q IFetch: Instruction Fetch and Update PC q Dec: Instruction Decode, Register Read, Sign Extend Offset q Exec: Execute R-type; Calculate Memory Address; Branch Comparison; Branch and Jump Completion q Mem: Memory Read; Memory Write Completion; Rtype Completion (Reg. File write) q WB: Memory Read Completion (Reg. File write) INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Review: Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw multicycle clock slower than 1/5 th of single cycle clock due to stage register overhead Multiple Cycle Implementation: Clk Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw IFetch sw Dec Exec Mem WB IFetch Dec Exec Mem

How can we make it even faster? ALU There is a point of diminishing returns where as much time is spent loading the state registers as doing the work. ALUout A Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data B IR Write Data MDR PC Memory Address Read Data (Instr. or Data) Split the multiple instruction cycle into smaller and smaller steps Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw IFetch sw Dec Exec Mem WB IFetch Dec Exec Mem

CRAY-1 1976 Fetch (and execute) more than one instruction at a time Superscalar processing – stay tuned. .

1977 CRAY-1 1976

Veckans Macka!

Das sind Zwei belegte Brote, eins mit Schinken eins mit EI! Ein Belegtes Brot mit Schinken Ein Belegtes Brot mit Ei EI! SCHINKEN!

Start fetching and executing the next instruction before the current one has completed. l Pipelining – (all? ) modern processors are pipelined for performance CPUtime = IC x CPI x CC

A Pipelined MIPS Processor q Start the next instruction before the current one has completed l l improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time from the start of an instruction to its completion) is not reduced for some instructions, some stages are wasted cycles Cycle 1 IFetch lw sw R-type Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Dec Exec Mem WB IFetch Dec Exec Mem Cycle 7 Cycle 8 WB - clock cycle (pipeline stage time) is limited by the slowest stage

Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 How much faster is this compared to the Multiple Cycle implementation? Clk lw sw Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw IFetch sw Dec Exec Mem WB IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch sw Dec Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem R-type IFetch WB Completing lw, sw and a R-type instruction takes only 7 cycles

MIPS Pipeline Datapath Modifications q What do we need to add/modify in our MIPS datapath? State registers between each pipeline stage to isolate them l IF: IFetch ID: Dec EX: Execute MEM: Mem. Access WB: Write. Back Add Read Addr 2 Data 1 File Write Addr Write Data 16 System Clock Sign Extend Read Data 2 32 ALU Exec/Mem Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add Data Memory Address Write Data Read Data Mem/WB Shift left 2 4

Pipelining the MIPS ISA Is it hard to introduce pipelining to MIPS? EASY l all instructions are the same length (32 bits) - can fetch in the 1 st stage and decode in the 2 nd stage few instruction formats (three) with symmetry across formats - can begin reading register file in 2 nd stage l memory operations can occur only in loads and stores - can use the execute stage to calculate memory addresses l each MIPS instruction writes at most one result (i. e. , changes the machine state) and does so near the end of the pipeline (MEM and WB) Foto: Land of the Lost some rights reserved HARD l l structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction’s input operands depend on the output of a previous instruction?

Graphically Representing MIPS Pipeline q Reg ALU IM DM Reg Can help with answering questions like: l l l How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?

Why Pipeline? For Performance! Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 Time (clock cycles) IM Reg DM IM Reg ALU Inst 3 DM ALU Inst 2 Reg ALU Inst 1 IM ALU O r d e r Inst 0 ALU I n s t r. Inst 4 Time to fill the pipeline Reg Reg DM Reg CPUtime = IC x CPI x CC

Can Pipelining Get Us Into Trouble? q Yes: Pipeline Hazards l l structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready - An instruction’s source operand(s) are produced by a prior instruction still in the pipeline l control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated - branch instructions q Can always resolve hazards by waiting l l pipeline control must detect the hazard and take action to resolve hazards

A Single Memory Would Be a Structural Hazard Time (clock cycles) q Reg Mem Reg ALU Inst 4 Reg ALU Inst 3 Mem Reading data from memory Mem ALU Inst 2 Reg ALU O r d e r Inst 1 Mem ALU I n s t r. lw Mem Mem Reading instruction from memory Mem Reg Fix with separate instr and data memories (I$ and D$)

How About Register File Access? Time (clock cycles) Inst 2 add $2, $1, DM IM Reg ALU Inst 1 Reg ALU IM ALU O r d e r add $1, ALU I n s t r. Reg Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half Reg DM Reg

Register Usage Can Cause Data Hazards q Dependencies backward in time cause hazards Reg DM IM Reg ALU $8, $1, $9 IM ALU or DM ALU and $6, $1, $7 Reg ALU sub $4, $1, $5 IM ALU add $1, xor $4, $1, $5 q Read before write data hazard Reg Reg DM Reg

Loads Can Cause Data Hazards q Reg DM IM Reg ALU sub $4, $1, $5 IM ALU $1, 4($2) ALU O r d e r lw ALU I n s t r. Dependencies backward in time cause hazards and $6, $1, $7 or $8, $1, $9 xor $4, $1, $5 q Load-use data hazard Reg Reg DM Reg

One Way to “Fix” a Data Hazard Reg DM Reg IM Reg DM IM Reg ALU IM ALU O r d e r add $1, ALU I n s t r. Can fix data hazard by waiting – stall – but impacts CPI stall sub $4, $1, $5 and $6, $1, $7 Reg DM Reg

Another Way to “Fix” a Data Hazard or $8, $1, $9 xor $4, $1, $5 IM Reg DM IM Reg ALU and $6, $1, $7 DM ALU sub $4, $1, $5 Reg ALU IM ALU O r d e r add $1, ALU I n s t r. Fix data hazards by forwarding results as soon as they are available to where they are needed Reg Reg DM Reg

Forwarding with Load-use Data Hazards sub q Reg n a k r DM Reg IM Reg o $4, $1, $5 t a l i r p ö f m r o e k n t” IM uk. Regtio DMa Reg r a m $6, $1, $7 r k i t s ” v s d n i n En u m. t s o g d i l a r j t t a ö IM DM Reg y z l f m a $8, $1, $9 h m a o t a d att e s u IM Reg d a $4, $1, $5 lo ALU xor DM ALU or Reg ALU and $1, 4($2) IM ALU O r d e r lw ALU I n s t r. Will still need one stall cycle even with forwarding Reg DM Reg

Branch Instructions Cause Control Hazards q Reg IM Reg DM IM Reg ALU lw IM ALU O r d e r beq ALU I n s t r. Dependencies backward in time cause hazards Inst 3 Inst 4 Don’t know the correct value of PC… DM Reg Reg DM Reg

One Way to “Fix” a Control Hazard beq O r d e r stall IM Reg ALU I n s t r. DM Fix branch hazard by waiting – stall – but affects CPI Reg stall Reg DM IM Reg ALU Inst 3 IM ALU lw Reg DM

Another Way to “Fix” a Control Hazard beq O r d e r lw IM Reg ALU I n s t r. DM Reg stall Extra hardware to test registers and calculate branch address… Reg ALU IM DM Reg

Branch Prediciton to “Fix” a Control Hazard IM lw Predict all branches are not taken. Reg IM Reg DM ALU O r d e r beq ALU I n s t r. Reg DM Reg

Branch Prediciton to “Fix” a Control Hazard IM Reg DM Reg stall IM addi Extra hardware to test registers and calculate branch address. Reg ALU O r d e r beq ALU I n s t r. DM Reg Only need to stall pipeline if branch is taken.

Finns det inte något smartare sätt att ”gissa” ? Dynamic Branch Prediction. . .

MIPS Pipeline Datapath Modifications q What do we need to add/modify in our MIPS datapath? State registers between each pipeline stage to isolate them l IF: IFetch ID: Dec EX: Execute MEM: Mem. Access WB: Write. Back ? t o g å n t a s s i m i v r Ha Add Read Addr 2 Data 1 File Write Addr Write Data 16 System Clock Sign Extend Read Data 2 32 ALU Exec/Mem Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add Data Memory Address Write Data Read Data Mem/WB Shift left 2 4

Corrected Datapath to Save Reg. Write Addr q Need to preserve the destination register address in the pipeline state registers IF/ID ID/EX EX/MEM Add Shift left 2 4 PC Instruction Memory Read Address Add Read Addr 1 Data Memory Register Read Addr 2 Data 1 File Write Addr Write Data 16 Sign Extend Read Data 2 32 MEM/WB ALU Address Write Data Read Data

MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode l and held in the state registers between pipeline stages ID/EX EX/MEM IF/ID Control Add Shift left 2 4 PC Instruction Memory Read Address Read Addr 1 Data Memory Register Read Addr 2 Data 1 File Write Addr Write Data 16 Sign Extend Read Data 2 32 MEM/WB Add ALU Address Write Data Read Data

Other Pipeline Structures Are Possible q What about the (slow) multiply operation? l l Make the clock twice as slow or … let it take two cycles (since it doesn’t use the DM stage) MUL q Reg ALU IM DM Reg What if the data memory access is twice as slow as the instruction memory? l l make the clock twice as slow or … let data memory access take two cycles (and keep the same clock rate) Reg ALU IM DM 1 DM 2 Reg

Summary q All modern day processors use pipelining Pipelining doesn’t help latency of single task, it helps throughput of entire workload q Potential speedup: a CPI of 1 and fast a CC q Pipeline rate limited by slowest pipeline stage q l Unbalanced pipe stages makes for inefficiencies ”start up” l q ”wind down” The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs Must detect and resolve hazards l Stalling negatively affects CPI (makes CPI less than the ideal of 1) ”bubbles”