CPU Design for Multiple Clock Cycles per instruction

Where we are headed • Single Cycle Problems: – clock cycle determined by instruction

Multiple cycle data path • • Divide the instruction into components Choose the clock

Review: finite state machines • Finite state machines: – a set of states and

Multicycle Approach • Break up the instructions into steps, each step takes a cycle

Multiple Cycle Approach • Single memory – instruction + data • One ALU, instead

Jump and Branch – PC update requires special consideration. 3 possibilities. • PC =

Five Execution Steps • Instruction Fetch • Instruction Decode and Register Fetch • Execution,

Step 1: Instruction Fetch • Use PC to get instruction and put it in

Step 2: Instruction Decode and Register Fetch • Read registers rs and rt in

Step 3: Execution, Compute Effective Address or Branch Completion • ALU is performing one

Step 4: R-type or memory-access • Loads and stores access memory MDR <= Memory[ALUOut];

Step 5: Write-back step • Load Reg[IR[20: 16]]<= MDR; What about all the other

Summary: Step name Instruction fetch Action for R-type instructions Instruction decode/register fetch Action for

Simple Questions • How many cycles will it take to execute this code? lw

Implementing the Control • Value of control signals is dependent upon: – what instruction

High Level View of Control System Fig 5. 31/Third Edition 19

IF & ID cycles Fig 5. 32/Third Edition 20

FSM for Memory-Reference Instruction Fig 5. 33/Third Edition 21

R-type Instruction Fig 5. 34/Third Edition 22

Branch & Jump Fig 5. 36/Third Edition 23

Graphical Specification of FSM Instruction decode/ register fetch Instruction fetch ( Op 2 W

Finite State Machine for Control • Implementation: 25 Ó 1998 Morgan Kaufmann Publishers

PLA Implementation • If I picked a horizontal or vertical line could you explain

ROM Implementation • ROM = "Read Only Memory" – values of memory locations are

ROM Implementation • How many inputs are there? 6 bits for opcode, 4 bits

ROM vs PLA • Break up the table into two parts — 4 state

Processor design review • Single clock cycle instructions (CPI = 1) – Clock cycle

Design process • Determine datapath requirements – Pick an instruction (sometimes one instruction can

Interrupts, exceptions • Interrupt vs exception – Interrupt: External – I/O device request –

Exceptions • Exceptions are software driven – Overflow in an arithmetic instruction – Memory

Multicycle Datapath with Exception Handling Fig 5. 39/Third Edition 34

Exception handling Fig 5. 40/Third Edition 35

Slides: 35

Download presentation

CPU Design for Multiple Clock Cycles per instruction {CPI > 1} Be able to explain how an instruction is executed and the concept of datapaths and control page 1 of 30

Where we are headed • Single Cycle Problems: – clock cycle determined by instruction that takes the longest time – what if we had a more complicated instruction like floating point? – wasteful of area • each functional unit is used only once per cycle (per instruction) • it may be possible to reduce functional units • One Solution: – use a “smaller” cycle time – have different instructions take different numbers of cycles – a “multicycle” datapath 2

Multiple cycle data path • • Divide the instruction into components Choose the clock cycle to accommodate the longest component – We need some balance between the components to reduce wastage – Same approach carries over to pipeline architectures • Every instruction divided into – – – • Instruction fetch Instruction decode and operand fetch Execution, memory address computation or branch completion Memory access or complete R-type instruction Memory read completion Work towards fewer components – Must we have a PC - Adder and an ALU – what if these are in different cycles? – Single memory for instructions and data • • May need buffers (registers) for temp storage, e. g Instruction Register Controls will be more complicated – determined by cycle and op code 3

Review: finite state machines • Finite state machines: – a set of states and – next state function (determined by current state and the input) – output function (determined by current state and possibly input) – We’ll use a Moore machine (output based only on current Ó 1998 Morgan Kaufmann Publishers 4

Multicycle Approach • Break up the instructions into steps, each step takes a cycle – – balance the amount of work to be done all cycles are of equal length how to fix the length of a cycle? restrict each cycle to use only one major functional unit • At the end of a cycle – store values for use in later cycles (easiest thing to do) – introduce additional “internal” registers 5

Multiple Cycle Approach • Single memory – instruction + data • One ALU, instead of ALU + 2 Adders – Eliminating adders saves cost – what are we adding? • Add temp store registers at the functional unit output – save for use in later cycle – MDR, IR are added. Both values are indeed in the same clock cycle – A, B, ALUOut are added • Registers and multiplexers are less costly than adders • Overall savings! 6

Jump and Branch – PC update requires special consideration. 3 possibilities. • PC = PC + 4 during IF. Store directly into PC. • ALUOut contains the branch target when it is computed. • Pseudodirect address for jump instruction. PC 0 M u x 1 Address Memory Mem. Data Write data Instruction [25– 21] Read register 1 Instruction [20– 16] Read data 1 register 2 Registers Write Read register data 2 Instruction [15– 0] Instruction [15– 11] Instruction register Instruction [15– 0] Memory data register 0 M u x 1 A B 4 Write data 16 Sign extend 0 M u x 1 32 Zero ALU result ALUOut 0 1 M u 2 x 3 Shift left 2 7

Control Signals 8

Controller and control lines 9

Five Execution Steps • Instruction Fetch • Instruction Decode and Register Fetch • Execution, Memory Address Computation, or Branch Completion • Memory Access or R-type instruction completion • Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! 10 Ó 1998 Morgan Kaufmann Publishers

Step 1: Instruction Fetch • Use PC to get instruction and put it in the Instruction Register. • Increment the PC by 4 and put the result back in the PC. • Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? – Assert Mem. Read & IRWrite, Ior. D=0 (PC is source address) – PC+4: ALUSrc. A=0; ALUSrc. B=01; ALUOp=00 – Storing in PC: PCSource=00, Assert PCWrite What is the advantage of updating the PC now? – Value computed in parallel with Instruction Access. 11

Step 2: Instruction Decode and Register Fetch • Read registers rs and rt in case we need them • Compute the branch address in case the instruction is a branch {Potentially wasted effort vs saving cycles} • RTL: A <= Reg[IR[25: 21]]; B <= Reg[IR[20: 16]]; ALUOut <=PC+(sign-extend(IR[15: 0])<< 2); • We aren't setting any control lines based on the instruction type – instruction is being decoded in the control logic – we do not know the type before ID & RF starts 12

Step 3: Execution, Compute Effective Address or Branch Completion • ALU is performing one of three functions, based on instruction type • Memory Reference: ALUOut <= A + sign-extend(IR[15: 0]); • R-type: ALUOut <= A op B; • Branch: if (A==B) PC <= ALUOut; • Jump: PC <= PC[31: 28]||(IR[25: 0]<<2) 13

Step 4: R-type or memory-access • Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; • R-type instructions finish Reg[IR[15: 11]] <= ALUOut; The write takes place at the end of the cycle on the falling edge 14

Step 5: Write-back step • Load Reg[IR[20: 16]]<= MDR; What about all the other instructions? 15 Ó 1998 Morgan Kaufmann Publishers

Summary: Step name Instruction fetch Action for R-type instructions Instruction decode/register fetch Action for memory-reference Action for instructions branches IR <= Memory[PC] PC <= PC + 4 A <= Reg [IR[25: 21]] B <= Reg [IR[20: 16]] ALUOut <= PC + (sign-extend (IR[15 -0]) << 2) Execution, address computation, branch/ jump completion ALUOut <= A op B ALUOut <= A + sign-extend (IR[15: 0]) Memory access or R-type completion Reg [IR[15: 11]] <= ALUOut Load: MDR <= Memory[ALUOut] or Store: Memory [ALUOut] <= B Memory read completion Action for jumps if (A ==B) then PC <= PC [31: 28] PC <= ALUOut ||(IR[25: 0]<<2) Load: Reg[IR[20: 16]] <= MDR 16

Simple Questions • How many cycles will it take to execute this code? lw $t 2, 0($t 3) lw $t 3, 4($t 3) beq $t 2, $t 3, Label #assume not add $t 5, $t 2, $t 3 sw $t 5, 8($t 3) Label: . . . • What is going on during the 8 th cycle of execution? • In what cycle does the actual addition of $t 2 and $t 3 take place? lw: 5 cycles; beq: 3 cycles; add: 4 cycles; sw: 4 cycles lw: IF, ID, Mem. Addr. Compute, Mem. Access, WB 17

Implementing the Control • Value of control signals is dependent upon: – what instruction is being executed – which step is being performed: clock cycle related • Use the information we’ve accumulated to specify a finite state machine – specify the finite state machine graphically, or – use microprogramming • Implementation can be derived from specification 18

High Level View of Control System Fig 5. 31/Third Edition 19

IF & ID cycles Fig 5. 32/Third Edition 20

FSM for Memory-Reference Instruction Fig 5. 33/Third Edition 21

R-type Instruction Fig 5. 34/Third Edition 22

Branch & Jump Fig 5. 36/Third Edition 23

Graphical Specification of FSM Instruction decode/ register fetch Instruction fetch ( Op 2 W = 'L ') o r = ( Op 'SW ') Execution 6 ALUSrc. A = 1 ALUSrc. B = 10 ALUOp = 00 EQ ') 8 ALUSrc. A = 1 ALUSrc. B = 00 ALUOp = 10 Jump completion 9 ALUSrc. A = 1 ALUSrc. B = 00 ALUOp = 01 PCWrite. Cond PCSource = 01 PCWrite PCSource = 10 (O p = 'S W ') (Op = 'LW') ) y pe R -t = p (O Branch completion (Op = 'J') Memory address computation ALUSrc. A = 0 ALUSrc. B = 11 ALUOp = 00 'B Start 1 = How many state bits will we need? Mem. Read ALUSrc. A = 0 Ior. D = 0 IRWrite ALUSrc. B = 01 ALUOp = 00 PCWrite PCSource = 00 (O p 0 Memory access 3 Memory access 5 Mem. Read Ior. D = 1 R-type completion 7 Mem. Write Ior. D = 1 Reg. Dst = 1 Reg. Write Memto. Reg = 0 Write-back step 4 Reg. Dst = 0 Reg. Write Memto. Reg = 1 24 Ó 1998 Morgan Kaufmann Publishers

Finite State Machine for Control • Implementation: 25 Ó 1998 Morgan Kaufmann Publishers

PLA Implementation • If I picked a horizontal or vertical line could you explain it? 26 Ó 1998 Morgan Kaufmann Publishers

ROM Implementation • ROM = "Read Only Memory" – values of memory locations are fixed ahead of time • A ROM can be used to implement a truth table – if the address is m-bits, we can address 2 m entries in the ROM. – our outputs are the bits of data that the address points to. m n 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 m is the "height", and n is the "width" 27 Ó 1998 Morgan Kaufmann Publishers

ROM Implementation • How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i. e. , 210 = 1024 different addresses) • How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs • ROM is 210 x 20 = 20 K bits (and a rather unusual size) • Rather wasteful, since for lots of the entries, the outputs are the same 28 Ó 1998 Morgan Kaufmann Publishers

ROM vs PLA • Break up the table into two parts — 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM — Total: 4. 3 K bits of ROM • PLA is much smaller — can share product terms — only need entries that produce an active output — can take into account don't cares • Size is (#inputs ´ #product-terms) + (#outputs ´ #product-terms) For this example = (10 x 17)+(20 x 17) = 460 PLA cells • PLA cells usually about the size of a ROM cell (slightly bigger) Ó 1998 Morgan Kaufmann Publishers 29

Processor design review • Single clock cycle instructions (CPI = 1) – Clock cycle is longer – Designing for the worst case • Multiple clock cycles per instruction – Divided each instruction into components – Tried for balance among the functions – Better performance ? • Some instructions take a little longer. • Some instructions take fewer cycles than others. • On the average we have improved performance 30

Design process • Determine datapath requirements – Pick an instruction (sometimes one instruction can represent an entire class, e. g. R-type) – Determine the datapath required for execution of the instruction – Determine the controls required for the instruction – Find the data path required for all the instructions – Find the shared path requirements • One approach: develop an input – output matrix • Find destinations that have more than one input • Insert multiplexers where necessary • Determine control requirements – CPI = 1 • Controls are controlled by opcode – CPI > 1 • Controls are controlled by opcode and system state • Finite State Machine – Hardwired (PLA) or Software (Microprgrammed) implementation 31

Interrupts, exceptions • Interrupt vs exception – Interrupt: External – I/O device request – Exception: Internal – OS calls, arithmetic overflow • Interrupts are external hardware events – – – Raise an interrupt (hardware) Wait to complete the current instruction Determine the source of the interrupt Save the return address Transfer to relevant Interrupt Service Routine • Save the registers that may change • Execute the program – Can this be interrupted? • Restore the registers • Return to execution of the program 32

Exceptions • Exceptions are software driven – Overflow in an arithmetic instruction – Memory access yields an undefined instruction • MIPS exception handling – Registers • Stores address of the problem instruction in EPC – Exception PC • Store the cause of the exception in the Cause Register – Cause low order bit = 0 (undefined instruction) – Cause low order bit = 1 (arithmetic overflow) • Additional control signals – Int. Cause, EPCWrite and Cause. Write – Transfer control to specified location in OS • OS terminates program or continues processing 33

Multicycle Datapath with Exception Handling Fig 5. 39/Third Edition 34

Exception handling Fig 5. 40/Third Edition 35