Multi Cycle CPU Previously built a Single Cycle

Multi Cycle CPU ØPreviously: built a Single Cycle CPU. Ø Today: ØExceptions ØMulti-cycle CPU; ØMicroprogramming CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Mid-term Review Discussion Session Ø Peterson Hall 104 ØTue: 2 -3 pm ØTue: 3 -4 pm CS 141 -L 4 - Tarun Soni, Summer ‘ 03

The Story so far: Ø Instruction Set Architectures Ø Performance issues Ø 2 s complement, Addition, Subtraction Ø Multiplication, Division, Floating Point numbers Ø ALUs Ø Single Cycle CPU ØExceptions ØMulticycle CPU: datapath; control ØMicroprogramming CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Alternative Architectures • Design alternative: – provide more powerful operations – goal is to reduce number of instructions executed – danger is a slower cycle time and/or a higher CPI • Sometimes referred to as “RISC vs. CISC” – virtually all new instruction sets since 1982 have been RISC – VAX: minimize code size, make assembly language easy instructions from 1 to 54 bytes long! • We’ll look at Pentium, Ultra. Sparc and JVM CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Pentium CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Java VM • • Most instr one byte – ADD – POP One byte arg – ILOAD IND 8 – BIPUSH CON 8 Two byte arg – SIPUSH CON 16 – IF_ICMPEQ OFFSET 16 Type = int, signed int etc. CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Ultra. Sparc CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions or Oops! CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions • There are two sources of non-sequential control flow in a processor – explicit branch and jump instructions – exceptions Branches are synchronous and deterministic Exceptions are typically asynchronous and non-deterministic Guess which is more difficult to handle? • • • arithmetic overflow divide by zero I/O device signals completion to CPU user program invokes the OS memory parity error illegal instruction timer signal • • • exceptions as any unexpected change in control flow interrupts as any externally-caused exception Literature is not consistent CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions • • The machine we’ve been designing in class can generate two types of exceptions. – arithmetic overflow – illegal instruction On an exception, we need to – save the PC (invisible to user code) – record the nature of the exception/interrupt – transfer control to OS user program Exception: System Exception Handler return from exception CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions • MIPS architecture defines the instruction as having no effect if the instruction causes an exception. • When we get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state. • This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard CS 141 -L 4 - • Interrupts – caused by external events – asynchronous to program execution – may be handled between instructions – simply suspend and resume user program • Traps/Exceptions – caused by internal events • exceptional conditions (overflow) • errors (parity) • faults (non-resident page) – synchronous to program execution – condition must be remedied by the handler – instruction may be retried or simulated and program continued or program may be aborted Tarun Soni, Summer ‘ 03

Exceptions Addressing the Exception Handler • Traditional Approach: Interupt Vector – PC <- MEM[ IV_base + cause || 00] – 370, 68000, Vax, 80 x 86, . . . iv_base • RISC Handler Table – PC <– IT_base + cause || 0000 – saves state and jumps – Sparc, PA, M 88 K, . . . • MIPS Approach: fixed entry – PC <– EXC_addr – Actually very small table • RESET entry • TLB • other CS 141 -L 4 - cause handler code handler entry code iv_base cause Tarun Soni, Summer ‘ 03

Exceptions Saving State • Push it onto the stack – Vax, 68 k, 80 x 86 • Save it in special registers – MIPS EPC, Bad. Vaddr, Status, Cause • Shadow Registers – M 88 k – Save state in a shadow of the internal pipeline registers Significant component of “interrupt response time” CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions • • • For our MIPS-subset architecture, we will add two registers: – EPC: a 32 -bit register to hold the user’s PC – Cause: A register to record the cause of the exception • we’ll assume undefined inst = 0, overflow = 1 We will also add three control signals: – EPCWrite (will need to be able to subtract 4 from PC) – Cause. Write – Int. Cause We will extend PCSource multiplexor to be able to latch the interrupt handler address into the PC. CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions EPC Rs Rt Rd Imm 16 4 ALU Mem. Wr. Memto. Reg ctr = Extender Adder Mux PC Mux 32 32 Wr. En. Adr Data In Data Clk. Memory 0 Mux n. PC_sel Reg. Dst Equal Rd Rt 1 0 Rs Rt Reg. Wr 5 5 5 bus. A Rw Ra Rb bus. W 32 32 32 -bit 32 Registers bus. B 0 32 Clk 1 imm 16 32 16 ALU Adder PC Ext imm 16 Int. Cause 00 PC Instruction<31: 0> <0: 15> <11: 15> Inst Memory Adr <16: 20> Interrupt Handler Address PCSource sub 4 EPCWrite <21: 25> PCWrite Cause. Write 1 Ext. Op ALUSrc CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions: Creating a “Control line” Regs: Instruction<31: 0> Rd <0: 15> Rs <11: 15> Rt <16: 20> Op Fun <21: 25> Adr <21: 25> Inst Memory – EPC: – Cause: control signals: – EPCWrite (subtract 4 from PC) – Cause. Write – Int. Cause Imm 16 Control Exception n. PC_sel. Reg. Wr Reg. Dst Ext. Op ALUSrc ALUctr Mem. Wr Memto. Reg Signals Equal DATA PATH CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions: Creating the data path Regs: – EPC: – Cause: control signals: – EPCWrite (subtract 4 from PC) – Cause. Write – Int. Cause Extend PCSource MUX to include jump address from int-table Ideal Instruction Memory Instruction Rd Rs 5 5 A PC 32 Clk CS 141 -L 4 - Imm 16 Rw Ra Rb 32 32 -bit Registers 32 32 ALU Next Address Instruction Address Rt 5 B 32 Data Address Data In Ideal Data Memory Clk Tarun Soni, Summer ‘ 03

CPU Multi Cycle CPU CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU The Big Picture: Where are We Now? • The Five Classic Components of a Computer Processor Input Control Memory Datapath • Output Datapath Design, then Control Design CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Recap: Processor Design is a Process • Bottom-up – assemble components in target technology to establish critical timing • Top-down – specify component behavior from high-level requirements • Iterative refinement – establish partial solution, expand improve Instruction Set Architecture => processor datapath Reg. File Mux ALU control Reg Cells CS 141 -L 4 - Mem Decoder Sequencer Gates Tarun Soni, Summer ‘ 03

CPU: The single cycle Instruction Fetch Instruction Next Store Execute Fetch Operand Decode Execute Result Store Execute an entire instruction Next Instruction CS 141 -L 4 - ° Design hardware for each of these steps!!! Tarun Soni, Summer ‘ 03

CPU: Clocking Clk Setup Hold Don’t Care . . . • All storage elements are clocked by the same clock edge CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU: Main Control op<5>. . <0> R-type op<5>. . <0> ori PLA Implementation of the Main Control op<5>. . <0> lw op<5>. . <0> sw op<5>. . <0> beq jump op<0> Reg. Write ALUSrc Reg. Dst Memto. Reg Mem. Write Branch Jump Ext. Op ALUop<2> ALUop<1> ALUop<0> CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU: Main Control Logic / Store (PLA, ROM) microinstruction Conditions Instruction Decode OPcode Control Points Datapath • In our single-cycle processor, each instruction is realized by exactly one control command or “microinstruction” – in general, the controller is a finite state machine – microinstruction can also control sequencing (see later) CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU: Abstract View of a single cycle processor Main Control op Result Store Mem. Wr Reg. Dst Reg. Wrt Data Mem Access Ext. Op ALUSrc ALUctr Equal Register Fetch Instruction Fetch PC n. PC_sel Next PC ALU Mem. Rd Mem. Wr ALU control fun • looks like a FSM with PC as state CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU: Why is a CPI=1 processor bad? Arithmetic & Logical PC Inst Memory Reg File mux ALU Data Mem mux setup Load PC muxsetup Critical Path Store PC Inst Memory Reg File Branch PC cmp mux • Long Cycle Time • All instructions take as much time as the slowest • Real memory is not so nice as our idealized memory – cannot always get the job done in one (short) cycle CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU: Why is a CPI=1 processor bad? Goal: balance amount of work done each cycle. I cache Decode, R-Read ALU PC update D cache R-Write Total R-type 1 1 . 9 - - . 8 3. 7 Load 1 1 . 9 - 1 . 8 4. 7 Store 1 1 . 9 - 1 - 3. 9 beq 1 1 . 9 . 1 - - 3. 0 • Load needs 5 cycles • Store and R-type need 4 • beq needs 3 CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU: Reducing Cycle Time • Cut combinational dependency graph and insert register / latch • Do same work in two fast cycles, rather than one slow one storage element Acyclic Combinational Logic (A) => storage element Acyclic Combinational Logic (B) storage element CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU: Building blocks Carry. In • Adder A 32 Adder B Sum 32 Carry 32 Select A B MUX • MUX 32 32 Y 32 OP A 32 ALU • ALU B CS 141 -L 4 - 32 Result 32 Tarun Soni, Summer ‘ 03

CPU: Building blocks • Building a 64 -bit adder from 2 x 32 -bit adders Carry. In A[31. . 0] 32 32 Adder B[31. . 0] Sum[31. . 0] 32 Carry Select A 32 Carry. In Adder B[63. . 32] 32 Sum[63. . 32] Carry 32 B 32 MUX A[63. . 32] 32 32 OP • Speed of addition? • For one ADD? • For consecutive ADDS? CS 141 -L 4 - Tarun Soni, Summer ‘ 03 Y

Multicycle CPU: Individual operations • Next address logic – PC <= branch ? PC + offset : PC + 4 • Instruction Fetch – Instruction. Reg <= Mem[PC] • Register Access – A <= R[rs] • ALU operation – R <= A + B CS 141 -L 4 - Result Store Mem. Wr Reg. Dst Reg. Wr Reg. File Data Mem. Rd Mem. Wr Mem Access Exec ALUctr ALUSrc Ext. Op Operand Fetch Instruction Fetch PC Next PC n. PC_sel Control Tarun Soni, Summer ‘ 03

Multicycle CPU: Partitioning Time • Five execution steps (some instructions use fewer) – – – IF: Instruction Fetch ID: Instruction Decode (& register fetch & add PC+immed) EX: Execute Mem: Memory access WB: Write-Back into registers IF ID EX Mem WB I cache Decode, R-Read ALU PC update D cache R-Write Total R-type 1 1 . 9 - - . 8 3. 7 Load 1 1 . 9 - 1 . 8 4. 7 Store 1 1 . 9 - 1 - 3. 9 beq 1 1 . 9 . 1 - - 3. 0 CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Steps Note: Reuse of ALU IF CS 141 -L 4 - ID Ex Mem Tarun Soni, Summer ‘ 03 WB

CS 141 -L 4 - Result Store Reg. File Mem Access Exec Data Mem Operand Fetch Instruction Fetch PC Next PC Mem. Wr Reg. Dst Reg. Wr Mem. Rd Mem. Wr ALUctr ALUSrc Ext. Op n. PC_sel Multicycle CPU Partitioning the CPI=1 Datapath • Add registers between smallest steps Tarun Soni, Summer ‘ 03

Multicycle CPU Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Reg Exec Mem CS 141 -L 4 - Wr Store Ifetch Reg Exec Mem R-type Ifetch Tarun Soni, Summer ‘ 03

Multicycle CPU: Instruction Types CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Sharing Hardware IR <- Mem[PC] A <- R[rs]; B<– R[rt] S <– A + B R[rd] <– S; PC <– PC+4; S <– A or ZX R[rt] <– S; PC <– PC+4; S <– A + SX M <– Mem[S] <- B R[rd] <– M; PC <– PC+4; PC < PC+SX; • Example: memory is used twice, at different times – Ave mem access per inst = 1 + Flw + Fsw ~ 1. 3 – if CPI is 4. 8, imem utilization = 1/4. 8, dmem =0. 3/4. 8 • We could reduce HW without hurting performance – extra control CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Sharing Functional Units Reuse: • ALU • Memory Need more • Muxing • Control Single ALU, Common data and instruction memory datapath CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Adding State Elements Since we reuse logic (e. g. ALU), we need to store results between states Need extra registers when: – signal is computed in one clock cycle and used in another, AND – the inputs to the combinational circuit can change before the signal is written into a state element. CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Adding State Elements IF CS 141 -L 4 - ID Ex Mem Tarun Soni, Summer ‘ 03 WB

Multicycle CPU: The Full Multi-Cycle Implementation CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Cycle 1: Instruction Fetch Datapath: IR = Memory[PC], PC = PC + 4 (may be revised later) Control: Ior. D=0, Mem. Read=1, Mem. Wr=0, IRwrite=1, ALUsrc. A=0, etc CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Cycle 1: Instruction Decode CS 141 -L 4 - A = Register[IR[25 -21]] B = Register[IR[20 -16]] ALUout = PC + (sign-extend (IR[15 -0]) << 2) Tarun Soni, Summer ‘ 03

Cycle 2: Instruction Decode & Reg. Fetch A = Reg[IR[25 -21]] B = Reg[IR[20 -16]] ALUout = PC + (sign-extend (IR[15 -0]) << 2) We compute target address even though we don’t know if it will be used – Operation may not be branch – Even if it is, branch may not be taken Why? Everything up to this point must be instruction-independent, because we haven’t decoded the instruction. The ALU, the (incremented) PC, and the immed field are now all available CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Cycle 3 for beq: EXecute A ALU out B • In cycle 1, PC was incremented by 4 • In cycle 2, ALUout was set to branch target • This cycle, we conditionally reset PC: if (A==B) PC=ALUout CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Cycle 3: R-type Instruction • Cycle 3 (EXecute) ALUout = A op B • Cycle 4 (Write. Back) Reg[IR[15 -11]] = ALUout R-type instruction is finished CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Cycle 3: R-type Instruction A B CS 141 -L 4 - Cycle 3: ALUout = A op B Cycle 4: Reg[IR[15 -11]] = ALUout Tarun Soni, Summer ‘ 03

Cycle 4: R-type Instruction A ALU out B CS 141 -L 4 - Cycle 3: ALUout = A op B Cycle 4: Reg[IR[15 -11]] = ALUout Tarun Soni, Summer ‘ 03

CS 141 -L 4 - B Mem. To. Reg Mem. Rd Mem. Wr ALUSrc ALUctr Ext. Op n. PC_sel Reg. Dst Reg. Wr File Equal R Result Store Ext ALU A Mem Access IR Reg File Data Mem Operand Fetch Instruction Fetch PC Next PC Multicycle CPU: The datapath M Extra Registers: • IR • A, B • R ( sometimes called S or ALUout) • M Tarun Soni, Summer ‘ 03

Multicycle CPU: The datapath • Logical Register Transfer • Physical Register Transfers inst Logical Register Transfers ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] ADDU A<– R[rs]; B <– R[rt] S <– A + B PC <– PC + 4 Reg. File M Data Mem B S Mem Access A Exec Reg File IR Inst. Mem PC Next PC Equal R[rd] <– S; CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: The datapath • Logical Register Transfer • Physical Register Transfers inst Logical Register Transfers ORI R[rt] <– R[rs] OR zx(Im 16); PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] ADDU A<– R[rs]; B <– R[rt] S <– ( A or Zero. Ext(Im 16) PC <– PC + 4 Reg. File S Mem Access B Exec Reg File IR Inst. Mem A M Data Mem CS 141 -L 4 - PC Next PC Equal R[rt] <– S; ) Tarun Soni, Summer ‘ 03

Multicycle CPU: The datapath • Logical Register Transfer inst Logical Register Transfers LW R[rt] <– MEM(R[rs] + sx(Im 16); • Physical Register Transfers PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] LW A<– R[rs]; B <– R[rt] S <– A + Sign. Ex(Im 16) M <– MEM[S] PC <– PC + 4 Reg. File M Data Mem B S Mem Access A Exec Reg File IR Inst. Mem PC Next PC Equal R[rd] <– M; CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: The datapath • Logical Register Transfer inst Logical Register Transfers SW MEM(R[rs] + sx(Im 16) <– R[rt]; PC <– PC + 4 • Physical Register Transfers inst Physical Register Transfers IR <– MEM[pc] SW A<– R[rs]; B <– R[rt] S <– A + Sign. Ex(Im 16); PC <– PC + 4 Reg. File M Data Mem B S Mem Access A Exec Reg File IR Inst. Mem PC Next PC Equal MEM[S] <– B; CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: The datapath • Logical Register Transfer inst Logical Register Transfers BEQ if R[rs] == R[rt] then PC <= PC + sx(Im 16) || 00 • Physical Register Transfers inst else PC <= PC + 4 Physical Register Transfers inst IR <– MEM[pc] Physical Register Transfers IR <– MEM[pc] BEQ|Eq PC <– PC + sx(Im 16) || 00 Reg. File M Data Mem B S Mem Access A Exec Reg File IR Inst. Mem PC Next PC Equal BEQ|Eq PC <– PC + 4 CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Summary CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Mid-term alert !! • How many cycles will it take to execute this code? lw $t 2, 0($t 3) lw $t 3, 4($t 3) beq $t 2, $t 3, Label add $t 5, $t 2, $t 3 sw $t 5, 8($t 3) #assume not Label: . . . • • What is going on during the 8 th cycle of execution? In what cycle does the actual addition of $t 2 and $t 3 takes place? CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Sharing Hardware “Princeton” Organization A-Bus B Bus next PC P C IR ZX SX Reg File A S Mem B W-Bus • Single memory for instruction and data access – memory utilization -> 1. 3/4. 8 • In this case our state diagram does not change – several additional control signals – must ensure each bus is only driven by one source on each cycle CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Control Line Timing Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem R-type Ifetch IRWrite CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Review: Finite State Machines • Finite state machines: – a set of states and – next state function (determined by current state and the input) – output function (determined by current state and possibly input) – We’ll use a Moore machine (output based only on current state) CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Control If (State == Instruction Fetch) { IRWrite = 1; // All other signals are 0; State = Operand Fetch; } If (State == Execute && Instruction. Op. Code == BEQ ) { // Do your thing. . } Control. Output = f(State, Op. Code) Next. State = f(State, Op. Code) CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Our basic FSM Instruction fetch Decode and Register Fetch Memory instructions CS 141 -L 4 - R-type instructions Branch instructions Jump instruction Tarun Soni, Summer ‘ 03

Multicycle CPU: Control “instruction fetch” IR <= MEM[PC] A <= R[rs] B <= R[rt] S <= A fun B S <= A or ZX LW S <= A + SX M <= MEM[S] SW BEQ & Equal BEQ & ~Equal S <= A + SX PC <= PC + 4 PC <= PC + SX || 00 MEM[S] <= B PC <= PC + 4 R[rd] <= S R[rt] <= M PC <= PC + 4 CS 141 -L 4 - Execute ORi Write-back Memory R-type “decode / operand fetch” Tarun Soni, Summer ‘ 03

Multicycle CPU: Control Number of states? Number of bits for state? CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Control: Assigning States IR <= MEM[PC] 0000 “instruction fetch” “decode” A <= R[rs] B <= R[rt] S <= A fun B 0100 ORi S <= A or ZX 0110 LW S <= A + SX 1000 M <= MEM[S] 1001 BEQ & Equal BEQ & ~Equal SW S <= A + SX 1011 PC <= PC + 4 PC <= PC + SX || 00 0011 0010 MEM[S] <= B PC <= PC + 4 1100 Write-back R-type Memory Execute 0001 R[rd] <= S R[rt] <= M PC <= PC + 4 0101 CS 141 -L 4 - 0111 1010 Tarun Soni, Summer ‘ 03

Multicycle CPU: Detailed control spec. State Op field 0000 0001 0001 0010 0011 R: 0100 0101 ORi: 0110 0111 LW: 1000 1001 1010 SW: 1011 1100 CS 141 -L 4 - ? ? ? BEQ R-type or. I LW SW xxxxxx xxxxxx xxxxxx Eq Next IR PC en sel ? 0 1 x x x x 0001 1 0010 0100 0110 1000 1011 0000 1 0101 0000 1 0111 0000 1 1001 1010 0000 1 1100 0000 1 Ops AB Exec Ex Sr ALU S Mem RWM Write-Back M-R Wr Dst 11 11 11 1 0 0 1 fun 1 0 0 1 1 0 0 or 1 0 0 1 0 add 1 1 0 0 0 1 1 0 add 1 0 0 1 Tarun Soni, Summer ‘ 03

Multicycle CPU: Implementation styles • • ROM = "Read Only Memory" – values of memory locations are fixed ahead of time A ROM can be used to implement a truth table – if the address is m-bits, we can address 2 m entries in the ROM. – our outputs are the bits of data that the address points to. m n 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 – 2 m is the "height", and n is the "width" CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Implementation styles • • How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i. e. , 210 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs • ROM is 210 x 20 = 20 K bits • Rather wasteful, since for lots of the entries, the outputs are the same — i. e. , opcode is often ignored CS 141 -L 4 - (and a rather unusual size) Tarun Soni, Summer ‘ 03

Multicycle CPU: Implementation styles • Break up the table into two parts — 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM — Total: 4. 3 K bits of ROM • PLA is much smaller — can share product terms — only need entries that produce an active output — can take into account don't cares • Size is (#inputs ´ #product-terms) + (#outputs ´ #product-terms) For this example = (10 x 17)+(20 x 17) = 460 PLA cells • PLA cells usually about the size of a ROM cell (slightly bigger) CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Multicycle CPU: Implementation styles PLA Implementation IRWrite = (!S 0 && !S 1 && !S 2 && !S 3) NS 0 = ( S[3. . 0] == 0000) || ( S[3. . 0] == 0110 ) || ( S[3. . 0] == 1001 && OP[5. . 0]=000010 ) || (…) CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Microprogramming • Control is the hard part of processor design ° Datapath is fairly regular and well-organized ° Memory is highly regular ° Control is irregular and global Consider the FSM in case of 100 s of instructions !!! • FSMs get unmanageable quickly as they grow. – hard to specify – hard to manipulate – error prone – hard to visualize • The state digrams that arise define the controller for an instruction set processor are highly structured • Use this structure to construct a simple “microsequencer” • Control reduces to programming this very simple device – microprogramming CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Microprogramming Control Logic PLA or ROM Multicycle Datapath Outputs Inputs 1 Adder State Reg Types of “branching” • Set state to 0 • Dispatch (state 1) • Use incremented state number Address Select Logic Microprogramming: A Particular Strategy for Implementing the Control Unit of a processor by "programming" at the level of register transfer operations Microarchitecture: Logical structure and functional capabilities of the hardware as seen by the microprogrammer Opcode Historical Note: Common case: State += 1; CS 141 -L 4 - IBM 360 Series first to distinguish between architecture & organization Same instruction set across wide range of implementations, each with different cost/performance Tarun Soni, Summer ‘ 03

Macro-Micro programming? Main Memory User program plus Data ADD SUB AND this can change!. . . DATA one of these is mapped into one of these execution unit CPU CS 141 -L 4 - control memory AND microsequence e. g. , Fetch Calc Operand Addr Fetch Operand(s) Calculate Save Answer(s) Tarun Soni, Summer ‘ 03

Horizontal Microinstructions ° “Horizontal” Microcode – control field for each control point in the machine µseq µaddr A-mux B-mux OPcode Control Logic / Store (PLA, ROM) Conditions Instruction Decode bus enables register enables microinstruction Control Points Datapath Depending on bus organization, many potential control combinations simply wrong, i. e. , implies transfers that can never happen at the same time. Idea: encode fields to save ROM space Example: mem_to_reg and ALU_to_reg should never happen simultenously; => encode in single bit which is decoded rather than two separate bits CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Vertical Microinstructions ° “Vertical” Microcode – encoded control fields with local decode src D E C dst other control fields next states inputs D E C MUX Some of these may have nothing to do with registers! CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Design Microinstruction Sets 1) Start with list of control signals 2) Group signals together that make sense (vs. random): called “fields” 3) Places fields in some logical order (e. g. , ALU operation & ALU operands first and microinstruction sequencing last) 4) Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals – Use computers to design computers 5) To minimize the width, encode operations that will never be used at the same time CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Single Bit Control Microinstructions Signal name ALUSel. A Reg. Write Memto. Reg Effect when deasserted 1 st ALU operand = PC None Reg. write data input = ALU Reg. dest. no. = rt Target. Write None Mem. Read None Mem. Write None Ior. D Memory address = PC IRWrite None PCWrite. Cond None Signal name ALUOp Multiple Bit Control Start with list of control signals, grouped into fields ALUSel. B PCSource CS 141 -L 4 - Value 00 01 10 11 000 001 010 011 100 00 01 10 Effect when asserted 1 st ALU operand = Reg[rs] Reg. is written Reg. write data input = memory Reg. Dst Reg. dest. no. = rd Target reg. = ALU Memory at address is read Memory at address is written Memory address = ALU IR = Memory PC = PCSource IF ALUzero then PC = PCSource Effect ALU adds ALU subtracts ALU does function code ALU does logical OR 2 nd ALU input = Reg[rt] 2 nd ALU input = 4 2 nd ALU input = sign extended IR[15 -0] 2 nd ALU input = sign extended, shift left 2 IR[15 -0] 2 nd ALU input = zero extended IR[15 -0] PC = ALU PC = Target PC = PC+4[29 -26] : IR[25– 0] << 2 Tarun Soni, Summer ‘ 03

Microinstructions Field Name ALU Control SRC 1 SRC 2 ALU Destination Memory Register PCWrite Control Sequencing Total width CS 141 -L 4 - Width wide narrow 4 2 2 1 5 3 6 4 4 3 1 1 5 4 3 2 30 20 Control Signals Set ALUOp ALUSel. A ALUSel. B Reg. Write, Memto. Reg, Reg. Dst, Target. Wr. Mem. Read, Mem. Write, Ior. D IRWrite PCWrite, PCWrite. Cond, PCSource Addr. Ctl bits Tarun Soni, Summer ‘ 03

Microinstructions: MIPS field name and values Field Name ALU SRC 1 SRC 2 ALU destination Memory register PC write Sequencing CS 141 -L 4 - Values for Field Add Subt. Func code Or PC rs 4 Extend 0 Extshft rt Target rd Read PC Read ALU Write ALU IR Write rt Read rt ALU Target-cond. jump addr. Seq Fetch Dispatch Function of Field with Specific Value ALU adds ALU subtracts ALU does function code ALU does logical OR 1 st ALU input = PC 1 st ALU input = Reg[rs] 2 nd ALU input = 4 2 nd ALU input = sign ext. IR[15 -0] 2 nd ALU input = zero ext. IR[15 -0] 2 nd ALU input = sign ex. , sl IR[15 -0] 2 nd ALU input = Reg[rt] Target = ALUout Reg[rd] = ALUout Read memory using PC Read memory using ALU output Write memory using ALU output IR = Mem Reg[rt] = Mem = Reg[rt] PC = ALU output IF ALU Zero then PC = Target PC = PCSource Go to sequential µinstruction Go to the first microinstruction Dispatch using ROM. Tarun Soni, Summer ‘ 03

Microinstructions: The datapath again Field Name SRC 1 SRC 2 ALU destination CS 141 -L 4 - Values for Field PC rs 4 Extend 0 Extshft rt Target rd Function of Field with Specific Value 1 st ALU input = PC 1 st ALU input = Reg[rs] 2 nd ALU input = 4 2 nd ALU input = sign ext. IR[15 -0] 2 nd ALU input = zero ext. IR[15 -0] 2 nd ALU input = sign ex. , sl IR[15 -0] 2 nd ALU input = Reg[rt] Target = ALUout Reg[rd] = ALUout Tarun Soni, Summer ‘ 03

Microinstructions: Pros-Cons • Specification Advantages: – Easy to design and write – Design architecture and microcode in parallel • Implementation (off-chip ROM) Advantages – Easy to change since values are in memory – Can emulate other architectures and instruction sets – Can make use of internal registers • Implementation Disadvantages, SLOWER now that: – Control is implemented on same chip as processor – ROM is no longer faster than RAM – No need to go back and make changes CS 141 -L 4 - Tarun Soni, Summer ‘ 03

CPU Control: Methodology CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Microprogramming: the last word ? Summary: Microprogramming one inspiration for RISC • • • If simple instruction could execute at very high clock rate… If you could even write compilers to produce microinstructions… If most programs use simple instructions and addressing modes… If microcode is kept in RAM instead of ROM so as to fix bugs … If same memory used for control memory could be used instead as cache for “macroinstructions”… • Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1 -1) CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Exceptions Supporting exceptions in our FSM Instruction Fetch, state 0 Instruction Decode/ Register Fetch, state 1 Mem. Read ALUSel. A = 0 Ior. D = 0 IRWrite ALUSel. B = 01 ALUOp = 00 PCWrite PCSource = 00 Op Memory Inst FSM CS 141 -L 4 - R-type Inst FSM Branch Inst FSM Opcode = JMP e d co = ty R- Op O o pe EQ W =L W r. S =B e d pco O pc od e co de Start ALUSel. A = 0 ALUSel. B = 11 ALUOp = 00 Target. Write = an yt hi ng el se to state 10 Jump Inst FSM Tarun Soni, Summer ‘ 03

Exceptions Supporting exceptions in our FSM from state 1 R-type instructions ALUSel. A = 1 ALUSel. B = 00 ALUOp = 10 ALUSel. A = 1 Reg. Dst = 1 Reg. Write Memto. Reg = 0 ALUSel. B = 10 ALUOp = 10 CS 141 -L 4 - To state 0 overflow To state 11 Tarun Soni, Summer ‘ 03

Exceptions state 11 illegal instruction state 10 Int. Cause=0 Cause. Write state 12 ALUSel. A = 0 ALUSel. B = 01 ALUOp = 01 EPCWrite Interrupt Handler Address PC Int. Cause=1 Cause. Write sub 4 PCSource Cause. Write Int. Cause state 13 Write Cause into register Write PC into EPC Load Exception Handler address to PC CS 141 -L 4 - PCWrite PCSource=11 EPCWrite EPC PCWrite Cause arithmetic overflow Supporting exceptions in our FSM To state 0 (fetch) Tarun Soni, Summer ‘ 03

Exceptions IR <= MEM[PC] PC <= PC + 4 A <= R[rs] B <= R[rt] R-type S <= A fun B LW ORi S <= A op ZX S <= A + SX SW S <= A + SX S <= A - B 0010 ~Equal MEM[S] <= B PC <= PC + SX || 00 0011 R[rt] <= M EPC <= PC - 4 PC <= exp_addr cause <= 12 (Ovf) CS 141 -L 4 - BEQ Equal M <= MEM[S] R[rt] <= S EPC <= PC - 4 PC <= exp_addr cause <= 10 (RI) other overflow R[rd] <= S undefined instruction Additional condition from Datapath Tarun Soni, Summer ‘ 03

Summary • • • multicycle CPUs make things faster. control is harder. microprogramming can simplify (conceptually) CPU control generation a microprogram is a small program inside the CPU that executes the individual instructions of the “real” program. exception-handling is difficult in the CPU, because the interactions between the executing instructions and the interrupt are complex and unpredictable. CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Mid-Term Review • • • Technology trends: Design for the future Instruction Set Architectures: types of ISAs: Addressing modes, length of instruction etc. MIPS instruction format-basic classes of instructions Registers and load store architectures Data types, operands, memory organization/addressing Basic MIPS instructions: Arithmetic, logical, data transfer, branching, jumps Issues in jump/branching distance and immediate addressing modes Stacks and frames E. g. , swap(), leaf_procedure(), nested_procedure() • • • Performance: Relative (Boeing e. g, ), Metrics, Benchmarking, SPEC marks Performance = Instruction Count x Cycles/Instruction x Seconds/Cycle Amdahl’s law Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Arithmetic: 2 s complement Basic digital logic, 1 -bit adder, full adder, 32 -bit adder/subtractor ALU: adder+mux+special conditions Delays in combinational logic, clocking Ripple carry vs. Carry look ahead adders • • • CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Mid-Term Review • • • Multiplication & Division: grade school version • 3 incrementally better algorithms (data paths) Basics of booth arithmetic Floating point representation Floating point operations (+, -, *, /) Guard, round and sticky bits • • • Single cycle CPU Building blocks: Register files, memory etc. Storage units, clocking methodology PC arithmetic Instruction fetch Datapath on various operations: Load, Store, Branch, R-type, I-type Control: basic control signals for the MIPS subset Distributed control: Main control + ALU control PLA implementation Timing diagrams CS 141 -L 4 - Tarun Soni, Summer ‘ 03

Mid-Term Review • • Multi-cycle CPU Datapath: registers/stages: Ifetch, A, B, Execute, Store etc. Various instructions through the datapath Control: Sharing functional units Finite state machine perspective for control: FSM for MIPS Implementation styles: ROM, PLA Microprogramming: Horizontal, vertical, relationship to RISC Exceptions: change in FSM, internal, external; need to save state. CS 141 -L 4 - Tarun Soni, Summer ‘ 03