18 447 Computer Architecture Lecture 5 SingleCycle Microarchitectures

  • Slides: 60
Download presentation
18 -447 Computer Architecture Lecture 5: Single-Cycle Microarchitectures Prof. Onur Mutlu Carnegie Mellon University

18 -447 Computer Architecture Lecture 5: Single-Cycle Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/24/2014

Assignments n Lab 1 due today n Lab 2 out (start early) n HW

Assignments n Lab 1 due today n Lab 2 out (start early) n HW 1 due next week n HW 0 issues q q Make sure your forms are correctly filled in and readable Extended deadline to resubmit: Sunday night (January 26) 2

A Single-Cycle Microarchitecture A Closer Look 3

A Single-Cycle Microarchitecture A Closer Look 3

Remember… n Single-cycle machine ASNext Combinational Logic Sequential Logic (State) AS 4

Remember… n Single-cycle machine ASNext Combinational Logic Sequential Logic (State) AS 4

Let’s Start with the State Elements n Data and control inputs **Based on original

Let’s Start with the State Elements n Data and control inputs **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 5

For Now, We Will Assume n “Magic” memory and register file n Combinational read

For Now, We Will Assume n “Magic” memory and register file n Combinational read q n output of the read data port is a combinational function of the register file contents and the corresponding read select port Synchronous write q the selected register is updated on the positive edge clock transition when write enable is asserted n n Cannot affect read output in between clock edges Single-cycle, synchronous memory q q Contrast this with memory that tells when the data is ready i. e. , Ready bit: indicating the read or write is done 6

Instruction Processing n 5 generic steps (P&H) q q q Instruction fetch (IF) Instruction

Instruction Processing n 5 generic steps (P&H) q q q Instruction fetch (IF) Instruction decode and register operand fetch (ID/RF) Execute/Evaluate memory address (EX/AG) Memory operand fetch (MEM) Store/writeback result (WB) WB IF ID/RF EX/AG MEM **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 7

What Is To Come: The Full MIPS Datapath PCSrc 1=Jump PCSrc 2=Br Taken bcond

What Is To Come: The Full MIPS Datapath PCSrc 1=Jump PCSrc 2=Br Taken bcond ALU operation **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 8 JAL, JR, JALR omitted

Single-Cycle Datapath for Arithmetic and Logical Instructions 9

Single-Cycle Datapath for Arithmetic and Logical Instructions 9

R-Type ALU Instructions n n Assembly (e. g. , register-register signed addition) ADD rdreg

R-Type ALU Instructions n n Assembly (e. g. , register-register signed addition) ADD rdreg rsreg rtreg Machine encoding 0 6 -bit n rs 5 -bit rt 5 -bit rd 5 -bit 0 5 -bit ADD 6 -bit R-type Semantics if MEM[PC] == ADD rd rs rt GPR[rd] GPR[rs] + GPR[rt] PC + 4 10

ALU Datapath 25: 21 20: 16 15: 11 1 if MEM[PC] == ADD rd

ALU Datapath 25: 21 20: 16 15: 11 1 if MEM[PC] == ADD rd rs rt GPR[rd] GPR[rs] + GPR[rt] PC + 4 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] IF ID EX MEM WB Combinational state update logic 11

I-Type ALU Instructions n n n Assembly (e. g. , register-immediate signed additions) ADDI

I-Type ALU Instructions n n n Assembly (e. g. , register-immediate signed additions) ADDI rtreg rsreg immediate 16 Machine encoding ADDI rs rt immediate 6 -bit 5 -bit 16 -bit I-type Semantics if MEM[PC] == ADDI rt rs immediate GPR[rt] GPR[rs] + sign-extend (immediate) PC + 4 12

Datapath for R and I-Type ALU Insts. 25: 21 20: 16 15: 11 Reg.

Datapath for R and I-Type ALU Insts. 25: 21 20: 16 15: 11 Reg. Dest is. Itype if MEM[PC] == ADDI rt rs immediate GPR[rt] GPR[rs] + sign-extend (immediate) PC + 4 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] ALUSrc 1 is. Itype IF ID EX MEM WB Combinational state update logic 13

Single-Cycle Datapath for Data Movement Instructions 14

Single-Cycle Datapath for Data Movement Instructions 14

Load Instructions n n Assembly (e. g. , load 4 -byte word) LW rtreg

Load Instructions n n Assembly (e. g. , load 4 -byte word) LW rtreg offset 16 (basereg) Machine encoding LW 6 -bit n base 5 -bit rt 5 -bit offset 16 -bit I-type Semantics if MEM[PC]==LW rt offset 16 (base) EA = sign-extend(offset) + GPR[base] GPR[rt] MEM[ translate(EA) ] PC + 4 15

LW Datapath add Reg. Dest is. Itype 1 if MEM[PC]==LW rt offset 16 (base)

LW Datapath add Reg. Dest is. Itype 1 if MEM[PC]==LW rt offset 16 (base) EA = sign-extend(offset) + GPR[base] GPR[rt] MEM[ translate(EA) ] PC + 4 ALUSrc is. Itype IF ID EX MEM WB Combinational state update logic 16

Store Instructions n n Assembly (e. g. , store 4 -byte word) SW rtreg

Store Instructions n n Assembly (e. g. , store 4 -byte word) SW rtreg offset 16 (basereg) Machine encoding SW 6 -bit n base 5 -bit rt 5 -bit offset 16 -bit I-type Semantics if MEM[PC]==SW rt offset 16 (base) EA = sign-extend(offset) + GPR[base] MEM[ translate(EA) ] GPR[rt] PC + 4 17

SW Datapath add Reg. Dest is. Itype 0 if MEM[PC]==SW rt offset 16 (base)

SW Datapath add Reg. Dest is. Itype 0 if MEM[PC]==SW rt offset 16 (base) EA = sign-extend(offset) + GPR[base] MEM[ translate(EA) ] GPR[rt] PC + 4 ALUSrc is. Itype IF ID EX MEM WB Combinational state update logic 18

Load-Store Datapath add is. Store Reg. Dest is. Itype !is. Store ALUSrc is. Itype

Load-Store Datapath add is. Store Reg. Dest is. Itype !is. Store ALUSrc is. Itype is. Load **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 19

Datapath for Non-Control-Flow Insts. is. Store Reg. Dest is. Itype !is. Store ALUSrc is.

Datapath for Non-Control-Flow Insts. is. Store Reg. Dest is. Itype !is. Store ALUSrc is. Itype is. Load Memto. Reg is. Load **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 20

Single-Cycle Datapath for Control Flow Instructions 21

Single-Cycle Datapath for Control Flow Instructions 21

Unconditional Jump Instructions n n Assembly J immediate 26 Machine encoding J 6 -bit

Unconditional Jump Instructions n n Assembly J immediate 26 Machine encoding J 6 -bit n immediate 26 -bit J-type Semantics if MEM[PC]==J immediate 26 target = { PC[31: 28], immediate 26, 2’b 00 } PC target 22

Unconditional Jump Datapath is. J PCSrc X 0 concat ? **Based on original figure

Unconditional Jump Datapath is. J PCSrc X 0 concat ? **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] if MEM[PC]==J immediate 26 PC = { PC[31: 28], immediate 26, 2’b 00 } 0 X ALUSrc 0 23 What about JR, JALR?

Conditional Branch Instructions n n Assembly (e. g. , branch if equal) BEQ rsreg

Conditional Branch Instructions n n Assembly (e. g. , branch if equal) BEQ rsreg rtreg immediate 16 Machine encoding BEQ 6 -bit n rs 5 -bit rt 5 -bit immediate 16 -bit I-type Semantics (assuming no branch delay slot) if MEM[PC]==BEQ rs rt immediate 16 target = PC + 4 + sign-extend(immediate) x 4 if GPR[rs]==GPR[rt] then PC target else PC + 4 24

Conditional Branch Datapath (For You to Fix) watch out PCSrc sub bcond concat 0

Conditional Branch Datapath (For You to Fix) watch out PCSrc sub bcond concat 0 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 25 How to uphold the delayed branch semantics?

Putting It All Together PCSrc 1=Jump PCSrc 2=Br Taken bcond ALU operation **Based on

Putting It All Together PCSrc 1=Jump PCSrc 2=Br Taken bcond ALU operation **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 26 JAL, JR, JALR omitted

Single-Cycle Control Logic 27

Single-Cycle Control Logic 27

Single-Cycle Hardwired Control n As combinational function of Inst=MEM[PC] 31 0 26 6 -bit

Single-Cycle Hardwired Control n As combinational function of Inst=MEM[PC] 31 0 26 6 -bit 31 opcode 5 -bit 26 6 -bit 31 opcode 6 -bit n rs 21 rs 5 -bit 26 rt 16 5 -bit 21 rt 5 -bit rd 5 -bit 16 11 shamt 6 5 -bit immediate funct 0 R-type 0 I-type 0 J-type 6 -bit 16 -bit immediate 26 -bit Consider q All R-type and I-type ALU instructions q LW and SW q BEQ, BNE, BLEZ, BGTZ q J, JR, JALR 28

Single-Bit Control Signals When De-asserted Reg. Dest ALUSrc Memto. Reg. Write When asserted Equation

Single-Bit Control Signals When De-asserted Reg. Dest ALUSrc Memto. Reg. Write When asserted Equation GPR write select according to rt, i. e. , inst[20: 16] GPR write select according to rd, i. e. , inst[15: 11] opcode==0 2 nd ALU input from 2 nd GPR read port 2 nd ALU input from sign- (opcode!=0) && extended 16 -bit (opcode!=BEQ) && immediate (opcode!=BNE) Steer ALU result to GPR write port steer memory load to GPR wr. port opcode==LW GPR write disabled GPR write enabled (opcode!=SW) && (opcode!=Bxx) && (opcode!=JR)) 29 JAL and JALR require additional Reg. Dest and Memto. Reg options

Single-Bit Control Signals When De-asserted Mem. Read Mem. Write PCSrc 1 PCSrc 2 When

Single-Bit Control Signals When De-asserted Mem. Read Mem. Write PCSrc 1 PCSrc 2 When asserted Equation Memory read disabled Memory read port return load value opcode==LW Memory write disabled Memory write enabled opcode==SW According to PCSrc 2 next PC is based on 26 bit immediate jump target (opcode==J) || next PC is based on 16 bit immediate branch target (opcode==Bxx) && next PC = PC + 4 (opcode==JAL) “bcond is satisfied” 30 JR and JALR require additional PCSrc options

ALU Control n case opcode ‘ 0’ select operation according to funct ‘ALUi’ selection

ALU Control n case opcode ‘ 0’ select operation according to funct ‘ALUi’ selection operation according to opcode ‘LW’ select addition ‘SW’ select addition ‘Bxx’ select bcond generation function __ don’t care n Example ALU operations q q ADD, SUB, AND, OR, XOR, NOR, etc. bcond on equal, not equal, LE zero, GT zero, etc. 31

R-Type ALU PCSrc 1=Jump PCSrc 2=Br Taken 1 0 bcond funct ALU operation **Based

R-Type ALU PCSrc 1=Jump PCSrc 2=Br Taken 1 0 bcond funct ALU operation **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 0 32

I-Type ALU PCSrc 1=Jump PCSrc 2=Br Taken 1 0 bcond opcode. ALU operation **Based

I-Type ALU PCSrc 1=Jump PCSrc 2=Br Taken 1 0 bcond opcode. ALU operation **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 0 33

LW PCSrc 1=Jump PCSrc 2=Br Taken 1 0 bcond Add **Based on original figure

LW PCSrc 1=Jump PCSrc 2=Br Taken 1 0 bcond Add **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] ALU operation 1 34

SW PCSrc 1=Jump PCSrc 2=Br Taken 0 1 bcond * * Add **Based on

SW PCSrc 1=Jump PCSrc 2=Br Taken 0 1 bcond * * Add **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] ALU operation 0 35

Branch Not Taken Some control signals are dependent on the processing of data PCSrc

Branch Not Taken Some control signals are dependent on the processing of data PCSrc 1=Jump PCSrc 2=Br Taken 0 0 bcond * * bcond. ALU operation **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 0 36

Branch Taken Some control signals are dependent on the processing of data PCSrc 1=Jump

Branch Taken Some control signals are dependent on the processing of data PCSrc 1=Jump PCSrc 2=Br Taken 0 0 bcond * * bcond. ALU operation **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 0 37

Jump PCSrc 1=Jump * PCSrc 2=Br Taken 0 0 bcond * **Based on original

Jump PCSrc 1=Jump * PCSrc 2=Br Taken 0 0 bcond * **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] * ALU operation 0 38

What is in That Control Box? n Combinational Logic Hardwired Control q q n

What is in That Control Box? n Combinational Logic Hardwired Control q q n Idea: Control signals generated combinationally based on instruction Necessary in a single-cycle microarchitecture… Sequential Logic Sequential/Microprogrammed Control q q Idea: A memory structure contains the control signals associated with an instruction Control Store 39

Evaluating the Single-Cycle Microarchitecture 40

Evaluating the Single-Cycle Microarchitecture 40

A Single-Cycle Microarchitecture n Is this a good idea/design? n When is this a

A Single-Cycle Microarchitecture n Is this a good idea/design? n When is this a good design? n When is this a bad design? n How can we design a better microarchitecture? 41

A Single-Cycle Microarchitecture: Analysis n Every instruction takes 1 cycle to execute q n

A Single-Cycle Microarchitecture: Analysis n Every instruction takes 1 cycle to execute q n How long each instruction takes is determined by how long the slowest instruction takes to execute q n CPI (Cycles per instruction) is strictly 1 Even though many instructions do not need that long to execute Clock cycle time of the microarchitecture is determined by how long it takes to complete the slowest instruction q Critical path of the design is determined by the processing time of the slowest instruction 42

What is the Slowest Instruction to Process? Let’s go back to the basics n

What is the Slowest Instruction to Process? Let’s go back to the basics n n q q q n All six phases of the instruction processing cycle take a single machine clock cycle to complete Fetch 1. Instruction fetch (IF) Decode 2. Instruction decode and register operand fetch (ID/RF) Evaluate Address 3. Execute/Evaluate memory address (EX/AG) Fetch Operands 4. Memory operand fetch (MEM) Execute 5. Store/writeback result (WB) Store Result Do each of the above phases take the same time (latency) for all instructions? 43

Single-Cycle Datapath Analysis n Assume q q memory units (read or write): 200 ps

Single-Cycle Datapath Analysis n Assume q q memory units (read or write): 200 ps ALU and adders: 100 ps register file (read or write): 50 ps other combinational logic: 0 ps steps IF ID EX MEM mem WB RF Delay resources mem RF ALU R-type 200 50 100 50 400 I-type 200 50 100 50 400 LW 200 50 100 200 50 600 SW 200 50 100 200 Branch 200 50 100 Jump 200 550 350 200 44

Let’s Find the Critical Path PCSrc 1=Jump PCSrc 2=Br Taken bcond ALU operation [Based

Let’s Find the Critical Path PCSrc 1=Jump PCSrc 2=Br Taken bcond ALU operation [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 45

R-Type and I-Type ALU PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 100 ps 250

R-Type and I-Type ALU PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 100 ps 250 ps 400 ps bcond 350 ps ALU operation [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 46

LW PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 100 ps 250 ps 600 ps

LW PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 100 ps 250 ps 600 ps bcond 350 ps 550 ps ALU operation [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 47

SW PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 100 ps 250 ps bcond 350

SW PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 100 ps 250 ps bcond 350 ps 550 ps ALU operation [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 48

Branch Taken PCSrc 1=Jump 200 ps 100 ps PCSrc 2=Br Taken 350 ps 200

Branch Taken PCSrc 1=Jump 200 ps 100 ps PCSrc 2=Br Taken 350 ps 200 ps 350 ps 250 ps bcond ALU operation [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 49

Jump PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 200 ps bcond ALU operation [Based

Jump PCSrc 1=Jump 100 ps PCSrc 2=Br Taken 200 ps bcond ALU operation [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 50

What About Control Logic? n How does that affect the critical path? n Food

What About Control Logic? n How does that affect the critical path? n Food for thought for you: q q Can control logic be on the critical path? A note on CDC 5600: control store access too long… 51

What is the Slowest Instruction to Process? n Memory is not magic n n

What is the Slowest Instruction to Process? n Memory is not magic n n n What if memory sometimes takes 100 ms to access? Does it make sense to have a simple register to register add or jump to take {100 ms+all else to do a memory operation}? And, what if you need to access memory more than once to process an instruction? q q Which instructions need this? Do you provide multiple ports to memory? 52

Single Cycle u. Arch: Complexity n Contrived q n Inefficient q q q n

Single Cycle u. Arch: Complexity n Contrived q n Inefficient q q q n All instructions run as slow as the slowest instruction Must provide worst-case combinational resources in parallel as required by any instruction Need to replicate a resource if it is needed more than once by an instruction during different parts of the instruction processing cycle Not necessarily the simplest way to implement an ISA q n All instructions run as slow as the slowest instruction Single-cycle implementation of REP MOVS, INDEX, POLY? Not easy to optimize/improve performance q q Optimizing the common case does not work (e. g. common instructions) Need to optimize the worst case all the time 53

Microarchitecture Design Principles n Critical path design q n Find the maximum combinational logic

Microarchitecture Design Principles n Critical path design q n Find the maximum combinational logic delay and decrease it Bread and butter (common case) design q Spend time and resources on where it matters n q n Common case vs. uncommon case Balanced design q q n i. e. , improve what the machine is really designed to do Balance instruction/data flow through hardware components Balance the hardware needed to accomplish the work How does a single-cycle microarchitecture fare in light of these principles? 54

Multi-Cycle Microarchitectures 55

Multi-Cycle Microarchitectures 55

Multi-Cycle Microarchitectures n n Goal: Let each instruction take (close to) only as much

Multi-Cycle Microarchitectures n n Goal: Let each instruction take (close to) only as much time it really needs Idea q q Determine clock cycle time independently of instruction processing time Each instruction takes as many clock cycles as it needs to take n n Multiple state transitions per instruction The states followed by each instruction is different 56

Remember: The “Process instruction” n ISA specifies abstractly what A’ should be, given an

Remember: The “Process instruction” n ISA specifies abstractly what A’ should be, given an Step instruction and A q It defines an abstract finite state machine where n n q From ISA point of view, there are no “intermediate states” between A and A’ during instruction execution n n State = programmer-visible state Next-state logic = instruction execution specification One state transition per instruction Microarchitecture implements how A is transformed to A’ q q There are many choices in implementation We can have programmer-invisible state to optimize the speed of instruction execution: multiple state transitions per instruction n n Choice 1: AS AS’ (transform A to A’ in a single clock cycle) Choice 2: AS AS+MS 1 AS+MS 2 AS+MS 3 AS’ (take multiple clock cycles to transform AS to AS’) 57

Multi-Cycle Microarchitecture AS = Architectural (programmer visible) state at the beginning of an instruction

Multi-Cycle Microarchitecture AS = Architectural (programmer visible) state at the beginning of an instruction Step 1: Process part of instruction in one clock cycle Step 2: Process part of instruction in the next clock cycle … AS’ = Architectural (programmer visible) state at the end of a clock cycle 58

Benefits of Multi-Cycle Design n Critical path design q n Bread and butter (common

Benefits of Multi-Cycle Design n Critical path design q n Bread and butter (common case) design q n Can keep reducing the critical path independently of the worstcase processing time of any instruction Can optimize the number of states it takes to execute “important” instructions that make up much of the execution time Balanced design q No need to provide more capability or resources than really needed n n An instruction that needs resource X multiple times does not require multiple X’s to be implemented Leads to more efficient hardware: Can reuse hardware components needed multiple times for an instruction 59

Remember: Performance Analysis n Execution time of an instruction q n Execution time of

Remember: Performance Analysis n Execution time of an instruction q n Execution time of a program q q n Sum over all instructions [{CPI} x {clock cycle time}] {# of instructions} x {Average CPI} x {clock cycle time} Single cycle microarchitecture performance q q n {CPI} x {clock cycle time} CPI = 1 Clock cycle time = long Multi-cycle microarchitecture performance q CPI = different for each instruction n q Average CPI hopefully small Clock cycle time = short Now, we have two degrees of freedom to optimize independently 60