CS 35101 Computer Architecture Week 10 Single Cycle

  • Slides: 44
Download presentation
CS 35101 - Computer Architecture Week 10: Single Cycle Implementation Paul Durand ( www.

CS 35101 - Computer Architecture Week 10: Single Cycle Implementation Paul Durand ( www. cs. kent. edu/~durand ) [Adapted from M Irwin (www. cse. psu. edu/~mji) ] [Adapted from COD, Patterson & Hennessy, © 2005, UCB]

Head’s Up q This week’s material l l Building a MIPS datapath Single cycle

Head’s Up q This week’s material l l Building a MIPS datapath Single cycle datapath implementation & control - Reading assignment – PH 5. 1 -5. 4 q Next week’s material l Multiple cycle datapath implementation & control - Reading assignment – PH 5. 6 -5. 7

Review: Design Principles q Simplicity favors regularity l l q Good design demands good

Review: Design Principles q Simplicity favors regularity l l q Good design demands good compromises l q three instruction formats Smaller is faster l l l q fixed size instructions – 32 -bits only three instruction formats limited instruction set limited number of registers in register file limited number of addressing modes Make the common case fast l l arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands

The Processor: Datapath & Control q We're ready to look at an implementation of

The Processor: Datapath & Control q We're ready to look at an implementation of the MIPS q Simplified to contain only: l l l q Generic implementation: l l l q memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction Fetch PC = PC+4 Exec Decode All instructions (except j) use the ALU after reading the registers Why? memory-reference? arithmetic? control flow?

Abstract Implementation View q Two types of functional units: l l elements that operate

Abstract Implementation View q Two types of functional units: l l elements that operate on data values (combinational) elements that contain state (sequential) Instruction Memory PC Address Instruction Write Data Register Read Data Reg Addr File Reg Addr Read Data Reg Address ALU Data Memory Read Data Write Data q Single cycle operation q Split memory (Harvard) model - one memory for instructions and one for data

Single Cycle Implementation – where we are headed

Single Cycle Implementation – where we are headed

Clocking Methodologies q Clocking methodology defines when signals can be read and when they

Clocking Methodologies q Clocking methodology defines when signals can be read and when they can be written falling (negative) edge cycle time rising (positive) edge clock rate = 1/(cycle time) e. g. , 10 nsec cycle time = 100 MHz clock rate 1 nsec cycle time = 1 GHz clock rate q State element design choices l l level sensitive latch master-slave and edge-triggered flipflops

Review: State Elements q Set-reset latch R S q Q !Q R S Q(t+1)

Review: State Elements q Set-reset latch R S q Q !Q R S Q(t+1) !Q(t+1) 1 0 0 1 1 0 0 0 Q(t) !Q(t) 1 1 0 0 Level sensitive D latch clock Q D clock D l !Q Q latch is transparent when clock is high (copies input to output)

Review: State Elements, con’t q Race problem with latch based design … D Q

Review: State Elements, con’t q Race problem with latch based design … D Q D-latch 0 clock !Q D Q D-latch 1 clock !Q clock q Consider the case when D-latch 0 holds a 0 and Dlatch 1 holds a 1 and you want to transfer the contents of D-latch 0 to D-latch 1 and vice versa l l q must have the clock high long enough for the transfer to take place must not leave the clock high so long that the transferred data is copied back into the original latch Two-sided clock constraint

Review: State Elements, con’t q Solution is to use flipflops that change state (Q)

Review: State Elements, con’t q Solution is to use flipflops that change state (Q) only on clock edge (master-slave) D D clock D-latch clock Q !Q D Q Q !Q !Q D-latch clock D clock Q - master (first D-latch) copies the input when the clock is high (the slave (second D-latch) is locked in its memory state and the output does not change) - slave copies the master when the clock goes low (the master is now locked in its memory state so changes at the input are not loaded into the master D-latch) q One-sided clock constraint l must have the clock cycle time long enough to accommodate the worst case delay path

Our Implementation q An edge-triggered methodology q Typical execution l l l read contents

Our Implementation q An edge-triggered methodology q Typical execution l l l read contents of some state elements send values through some combinational logic write results to one or more state elements State element 1 Combinational logic State element 2 clock one clock cycle q Assumes state elements are written on every clock cycle; if not, need explicit write control signal l write occurs only when both the write control is asserted and the clock edge occurs

Fetching Instructions q Fetching l l instructions involves reading the instruction from the Instruction

Fetching Instructions q Fetching l l instructions involves reading the instruction from the Instruction Memory updating the PC to hold the address of the next instruction Add 4 Instruction Memory PC l l Read Address Instruction PC is updated every cycle, so it does not need an explicit write control signal Instruction Memory is read every cycle, so it doesn’t need an explicit read control signal

Decoding Instructions q Decoding instructions involves l sending the fetched instruction’s opcode and function

Decoding Instructions q Decoding instructions involves l sending the fetched instruction’s opcode and function field bits to the control unit Control Unit Instruction Read Addr 1 Read Register Read Addr 2 Data 1 File Write Addr Read Write Data l Data 2 reading two values from the Register File - Register File addresses are contained in the instruction

Executing R Format Operations q R format operations (add, sub, slt, and, or) 31

Executing R Format Operations q R format operations (add, sub, slt, and, or) 31 R-type: op 25 rs 20 15 10 rt rd 5 0 shamt funct l perform the indicated (by op and funct) operation on values in rs and rt l store the result back into the Register File (into location rd) Reg. Write Instruction Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Write Data l ALU control ALU overflow zero Data 2 Note that Register File is not written every cycle (e. g. sw), so we need an explicit write control signal for the Register File

Executing Load and Store Operations q Load and store operations 31 I-Type: l op

Executing Load and Store Operations q Load and store operations 31 I-Type: l op 25 rs 20 rt 15 0 address offset compute a memory address by adding the base register (in rs) to the 16 -bit signed offset field in the instruction - base register was read from the Register File during decode - offset value in the low order 16 bits of the instruction must be sign extended to create a 32 -bit signed value store value, read from the Register File during decode, must be written to the Data Memory l load value, read from the Data Memory, must be stored in the Register File l

Executing Load and Store Operations, con’t Reg. Write Instruction ALU control overflow zero Read

Executing Load and Store Operations, con’t Reg. Write Instruction ALU control overflow zero Read Addr 1 Read Register Read Addr 2 Data 1 File Write Addr Read Write Data 16 Address ALU Data Memory Read Data Write Data 2 Sign Extend Mem. Write Mem. Read 32

Executing Branch Operations q Branch operations have to 31 I-Type: op 25 rs 20

Executing Branch Operations q Branch operations have to 31 I-Type: op 25 rs 20 rt 15 0 address offset compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output) l compute the branch target address by adding the updated PC to the sign extended 16 -bit signed offset field in the instruction l - “base register” is the updated PC - offset value in the low order 16 bits of the instruction must be sign extended to create a 32 -bit signed value and then shifted left 2 bits to turn it into a word address

Executing Branch Operations, con’t Add 4 Add Shift left 2 Branch target address ALU

Executing Branch Operations, con’t Add 4 Add Shift left 2 Branch target address ALU control PC Instruction Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Write Data 16 Data 2 Sign Extend 32 zero (to branch control logic) ALU

Executing Jump Operations q Jump operations have to 31 J-Type: op l 25 0

Executing Jump Operations q Jump operations have to 31 J-Type: op l 25 0 jump target address replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Add 4 4 Instruction Memory PC Read Address Instruction Shift left 2 26 Jump address 28

Our Simple Control Structure q q We wait for everything to settle down l

Our Simple Control Structure q q We wait for everything to settle down l ALU might not produce “right answer” right away l we use write signals along with the clock edge to determine when to write (to the Register File and the Data Memory) Cycle time determined by length of the longest path We are ignoring some details like register setup and hold times

Creating a Single Datapath from the Parts q Assemble the datapath segments discussed earlier,

Creating a Single Datapath from the Parts q Assemble the datapath segments discussed earlier, add control lines as needed, and design the control path q Fetch, decode and execute each instructions in one clock cycle – single cycle design q l no datapath resource can be used more than once per instruction, so some must be duplicated (e. g. , why we have a separate Instruction Memory and Data Memory) l to share datapath elements between two different instruction classes will need multiplexors at the input of the shared elements with control lines to do the selection Cycle time is determined by length of the longest path

Fetch, R, and Memory Access Portions Add Reg. Write ALU control 4 Instruction Memory

Fetch, R, and Memory Access Portions Add Reg. Write ALU control 4 Instruction Memory PC Read Address Instruction ovf zero Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Write Data Address ALU Data Memory Read Data Write Data 2 Sign 16 Extend Mem. Write Mem. Read 32

Multiplexor Insertion Add Reg. Write ALUSrc ALU control 4 Instruction Memory PC Read Address

Multiplexor Insertion Add Reg. Write ALUSrc ALU control 4 Instruction Memory PC Read Address Instruction Address ALU Data Memory Read Data Write Data 2 Sign 16 Extend Memto. Reg ovf zero Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Write Data Mem. Write Mem. Read 32

Clock Distribution System Clock cycle Mem. Write Reg. Write Add 4 Instruction Memory PC

Clock Distribution System Clock cycle Mem. Write Reg. Write Add 4 Instruction Memory PC Read Address Instruction Memto. Reg ALUSrc ALU control ovf zero Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Write Data Address ALU Write Data 2 Sign 16 Extend Data Memory Read Data Mem. Read 32

Adding the Branch Portion Add 4 Shift left 2 Reg. Write Instruction Memory PC

Adding the Branch Portion Add 4 Shift left 2 Reg. Write Instruction Memory PC Read Address Instruction Read Addr 1 Mem. Write Memto. Reg Address ALU Data Memory Read Data Write Data 2 Sign 16 Extend PCSrc ALU control ovf zero Register Read Addr 2 Data 1 File Write Addr Read Write Data Add Mem. Read 32

Adding the Control q Selecting the operations to perform (ALU, Register File and Memory

Adding the Control q Selecting the operations to perform (ALU, Register File and Memory read/write) q Controlling the flow of data (multiplexor inputs) q Information comes from the 32 bits of the instruction 31 q Observations l l l R-type: op 25 rs 20 15 rt rd 10 5 0 shamt funct op field always 31 25 20 15 in bits 31 -26 I-Type: op address offset rs rt addr of two registers to be read are always specified by the rs and rt fields (bits 25 -21 and 20 -16) addr. of register to be written is in one of two places – in rt (bits 2016) for lw; in rd (bits 15 -11) for R-type instructions base register for lw and sw always in rs (bits 25 -21) offset for beq, lw, and sw always in bits 15 -0 0

(Almost) Complete Single Cycle Datapath 0 Add 4 Reg. Dst Instruction Memory PC Add

(Almost) Complete Single Cycle Datapath 0 Add 4 Reg. Dst Instruction Memory PC Add Shift left 2 Read Address Instr[31 -0] Reg. Write PCSrc Mem. Write Memto. Reg ALUSrc ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read Instr[151 Data 2 Write Data -11] Instr[15 -0] 1 Sign 16 Extend zero 0 ALU Address Data Memory Read Data 1 Write Data 0 1 32 ALU control Instr[5 -0] ALUOp Mem. Read

ALU Control q q ALU's operation based on instruction type and function code ALU

ALU Control q q ALU's operation based on instruction type and function code ALU control input Function 000 and 001 or 010 add 110 subtract 111 set on less than Why is the code for subtract 110 and not 011?

ALU Control, Con’t q Controlling the ALU makes use of multiple levels of decoding

ALU Control, Con’t q Controlling the ALU makes use of multiple levels of decoding l l main control unit generates the ALUOp bits ALU control unit generates ALU control inputs Instr op funct ALUOp desired action add 00 ALU control input 010 lw xxxxxx sw xxxxxx 00 add 010 beq xxxxxx 01 subtract 110 add 100000 10 add 010 subt 100010 10 subtract 110 and 100100 10 and 000 or 100101 10 or 001 slt 101010 10 slt 111

ALU Control Truth Table F 5 F 4 F 3 F 2 F 1

ALU Control Truth Table F 5 F 4 F 3 F 2 F 1 F 0 ALUOp 1 ALUOp 0 Op 2 q Op 0 X X X 0 0 0 1 0 X X X X 1 1 1 0 X X 0 0 1 X 0 1 0 X X 0 0 1 X 1 1 0 X X 0 1 0 0 1 X 0 0 0 X X 0 1 1 X 0 0 1 X X 1 0 1 X 1 1 1 Can make use of more don’t cares l l q Op 1 since ALUOp does not use the encoding 11 since F 5 and F 4 are always 10 Logic comes from the K-maps …

ALU Control Combinational Logic q From the truth table can design the ALU Control

ALU Control Combinational Logic q From the truth table can design the ALU Control logic

(Almost) Complete Datapath with Control Unit 0 Add ALUOp Reg. Dst PC Read Address

(Almost) Complete Datapath with Control Unit 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0] Mem. Read Memto. Reg Mem. Write ALUSrc Reg. Write ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15 -0] 1 PCSrc Branch Instr[31 -26] Control Unit Instruction Memory Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Main Control Unit Instr Reg. Dst ALUSrc Mem. Reg. Wr Mem. Rd Mem. Wr

Main Control Unit Instr Reg. Dst ALUSrc Mem. Reg. Wr Mem. Rd Mem. Wr Branch ALUOp 1 ALUOp 2 R-type 000000 lw 100011 sw 101011 beq 000100 q Completely determined by the instruction opcode field l Note that a multiplexor whose control input is 0 has a definite action, even if it is not used in performing the operation

R-type Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0]

R-type Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0] Mem. Read Memto. Reg Mem. Write ALUSrc Reg. Write ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15 -0] 1 PCSrc Branch Instr[31 -26] Control Unit Instruction Memory Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Store Word Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31

Store Word Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0] Mem. Read Memto. Reg Mem. Write ALUSrc Reg. Write ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15 -0] 1 PCSrc Branch Instr[31 -26] Control Unit Instruction Memory Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Load Word Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31

Load Word Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0] Mem. Read Memto. Reg Mem. Write ALUSrc Reg. Write ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15 -0] 1 PCSrc Branch Instr[31 -26] Control Unit Instruction Memory Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Branch Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0]

Branch Instruction Data/Control Flow 0 Add ALUOp Reg. Dst PC Read Address Instr[31 -0] Mem. Read Memto. Reg Mem. Write ALUSrc Reg. Write ovf Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15 -0] 1 PCSrc Branch Instr[31 -26] Control Unit Instruction Memory Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Main Control Unit Instr Reg. Dst R-type ALUSrc Mem. Reg. Wr Mem. Rd Mem.

Main Control Unit Instr Reg. Dst R-type ALUSrc Mem. Reg. Wr Mem. Rd Mem. Wr Branch ALUOp 1 ALUOp 0 1 0 0 1 X 0 1 1 0 0 X 1 X 0 X 1 0 0 0 X 0 X 0 1 X 1 000000 lw 100011 sw 101011 beq 000100 q Completely determined by the instruction opcode field l Note that a multiplexor whose control input is 0 has a definite action, even if it is not used in performing the operation

Control Unit Logic q From the truth table can design the Main Control logic

Control Unit Logic q From the truth table can design the Main Control logic Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26] R-type lw sw beq Reg. Dst ALUSrc Memto. Reg. Write Mem. Read Mem. Write Branch ALUOp 1 ALUOp 0

Review: Handling Jump Operations q Jump operation have to l replace the lower 28

Review: Handling Jump Operations q Jump operation have to l replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits 31 J-Type: op 0 jump target address Add 4 4 Instruction Memory PC Read Address Instruction Shift left 2 26 Jump address 28

Adding the Jump Operation Instr[25 -0] Shift left 2 26 28 1 32 0

Adding the Jump Operation Instr[25 -0] Shift left 2 26 28 1 32 0 PC+4[31 -28] 0 Add ALUOp Branch Instr[31 -26] Control Unit Reg. Dst Instruction Memory PC Read Address Instr[31 -0] Jump PCSrc ALUSrc Reg. Write ovf 1 Instr[15 -0] 1 Mem. Read Memto. Reg Mem. Write Instr[25 -21] Read Addr 1 Register Read Instr[20 -16] Read Addr 2 Data 1 File 0 Write Addr Read Instr[15 -11] Add Shift left 2 4 Write Data zero 0 ALU Data 2 1 Sign 16 Extend 32 Instr[5 -0] ALU control Address Data Memory Read Data 1 Write Data 0

Single Cycle Implementation Cycle Time q Unfortunately, though simple, the single cycle approach is

Single Cycle Implementation Cycle Time q Unfortunately, though simple, the single cycle approach is not used because it is inefficient q Clock cycle must have the same length for every instruction q What is the longest path (slowest instruction)?

Instruction Critical Paths Calculate cycle time assuming negligible delays (for muxes, control unit, sign

Instruction Critical Paths Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except: q l Instruction and Data Memory (2 ns) l ALU and adders (2 ns) l Register File access (reads or writes) (1 ns) Instr. I Mem R-type 2 load 1 2 1 1 2 2 store 2 beq 2 jump Reg Rd ALU Op 1 2 2 1 D Mem Reg Wr Total 1 8 6 2 2 2 7 5 2

Where We are Headed Problems with single cycle datapath design l l l Another

Where We are Headed Problems with single cycle datapath design l l l Another approach Address Read Data (Instr. or Data) Write Data Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read Write Data 2 ALUout Memory A l B l use a “smaller” cycle time have different instructions take different numbers of cycles a “multicycle” datapath: IR l PC q uses clock cycle inefficiently and what if we had a more complicated instruction like floating point multiply? wasteful of area MDR q