Embedded Systems in Silicon TD 5102 MIPS design
Embedded Systems in Silicon TD 5102 MIPS design Datapath and Control Henk Corporaal http: //www. ics. ele. tue. nl/~heco/courses/Emb. Systems HC TD 5102 Technical University Eindhoven DTI / NUS Singapore 2005/2006
Topics n Building a datapath u n A single cycle processor datapath u n HC TD 5102 all instruction actions in one (long) cycle A multi-cycle processor datapath u n support a subset of the MIPS-I instruction-set each instructions takes multiple (shorter) cycles Exception support 2
Datapath and Control FSM or Microprogramming Registers & Memories Multiplexors Buses ALUs Control HC TD 5102 Datapath 3
The Processor: Datapath & Control n Simplified MIPS implementation to contain only: u u u n u u u beq, j use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? F HC TD 5102 lw, sw add, sub, and, or, Generic Implementation: u n memory-reference instructions: arithmetic-logical instructions: slt control flow instructions: memory-reference? 4
More Implementation Details n Abstract / Simplified View: Data PC Address Instruction memory Instruction Register # Registers Register # ALU Address Data memory Register # Data n Two types of functional units: u u HC TD 5102 elements that operate on data values (combinational) elements that contain state (sequential) 5
State Elements n n Unclocked vs. Clocked Clocks used in synchronous logic u when should an element that contains state be updated? falling edge cycle time rising edge HC TD 5102 6
An unclocked state element n The set-reset (SR) latch u output depends on present inputs and also on past inputs R Q Q S Truth table: HC TD 5102 R 0 0 1 1 S 0 1 Q Q 1 0 ? state change 7
Latches and Flip-flops n n Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock u u Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written HC TD 5102 8
D-latch n Two inputs: u u n Two outputs: u HC TD 5102 the data value to be stored (D) the clock signal (C) indicating when to read & store D the value of the internal state (Q) and it's complement 9
D flip-flop n Output changes only on the clock edge D D C D latch Q D latch _ C Q Q _ Q C HC TD 5102 10
Our Implementation n n An edge triggered methodology Typical execution: u u u read contents of some state elements, send values through some combinational logic, write results to one or more state elements State element 1 Combinational logic State element 2 Clock cycle HC TD 5102 11
Register File n 3 -ported: one write, two read ports Read reg. #1 Read data 1 Read reg. #2 Read data 2 Write reg. # Write data Write HC TD 5102 12
Register file: read ports • Register file built using D flip-flops Read register number 1 Register 0 Register 1 M Register n – 1 u x Read data 1 Register n Read register number 2 M u Read data 2 x Implementation of the read ports HC TD 5102 13
Register file: write port n Note: we still use the real clock to determine when to write W r ite 0 1 R e g is te r n u m b e r n -to -1 C R e g iste r 0 D C d e co d e r n – 1 R e g iste r 1 D n C R e g is te r n – 1 D C R e g iste r n R e g iste r d a ta HC TD 5102 D 14
Simple Implementation Include the functional units we need for each instruction n Instruction address Mem. Write PC Instruction Add Sum Instruction memory Address a. Instruction memory 5 Register numbers 5 5 Data b. Program counter 3 Read register 1 Read register 2 Registers Write register Write data c. Adder ALU control Write data Read data Data memory Data Sign extend 32 Mem. Read a. Data memory unit Read data 1 16 b. Sign-extension unit Zero ALU result Read data 2 Reg. Write a. Registers HC TD 5102 b. ALU Why do we need this stuff? 15
Building the Datapath n Use multiplexors to stitch them together PCSrc M u x Add ALU result 4 Shift left 2 Registers PC Read address Instruction memory Read register 1 Read data 1 register 2 Write register Write data Reg. Write 16 HC TD 5102 ALUSrc Read data 2 Sign extend M u x 32 3 ALU operation Zero ALU result Mem. Write Memto. Reg Address Read data Data Write memory data M u x Mem. Read 16
Our Simple Control Structure n n n All of the logic is combinational We wait for everything to settle down, and the right thing to be done u ALU might not produce “right answer” right away u we use write signals along with clock to determine when to write Cycle time determined by length of the longest path S tate elem ent 1 Com binational logic State elem ent 2 Clock cycle We are ignoring some details like setup and hold times ! HC TD 5102 17
Control n Selecting the operations to perform (ALU, read/write, etc. ) n Controlling the flow of data (multiplexor inputs) n Information comes from the 32 bits of the instruction n Example: add $8, $17, $18 000000 op n HC TD 5102 Instruction Format: 10001 rs 10010 rt 01000 rd 00000 100000 shamt funct ALU's operation based on instruction type and function code 18
Control: 2 level implementation 31 6 Control 2 26 instruction register Opcode bit 00: lw, sw 01: beq 10: add, sub, and, or, slt Control 1 Funct. HC TD 5102 2 ALUop 5 6 3 ALUcontrol 000: and 001: or 010: add 110: sub 111: set on less than ALU 0 19
Datapath with Control 0 M u x Add 4 Instruction [31– 26] PC Instruction [31– 0] Instruction memory Read register 1 Instruction [20– 16] Instruction [15– 11] Instruction [15– 0] 0 M u x 1 1 Zero ALU result Address Shift left 2 Reg. Dst Branch Mem. Read Memto. Reg Control ALUOp Mem. Write ALUSrc Reg. Write Instruction [25– 21] Read address ALU Add result Read data 1 Read register 2 Registers Read Write data 2 register 0 M u x 1 Write data 16 Sign extend Write data Read data Data memory 1 M u x 0 32 ALU control Instruction [5– 0] HC TD 5102 20
ALU Control 1 n n What should the ALU do with this instruction example: lw $1, 100($2) HC TD 5102 2 1 op rs rt 100 16 bit offset ALU control input 000 001 010 111 n 35 AND OR add subtract set-on-less-than Why is the code for subtract 110 and not 011? 21
ALU Control 1 n Must describe hardware to compute 3 -bit ALU control input given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic u function code for arithmetic intputs Describe it using a truth table u n HC TD 5102 ALU Operation class, computed from instruction type (can turn into gates): outputs 22
ALU Control 1 n HC TD 5102 Simple combinational logic (truth tables) 23
Deriving Control 2 signals Input 6 -bits 9 control (output) signals Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4 HC TD 5102 24
Control 2 n HC TD 5102 PLA example implementation 25
Single Cycle Implementation n Calculate cycle time assuming negligible delays except: u memory (2 ns), ALU and adders (2 ns), register file access (1 ns) PCSrc Add 4 Reg. Write Instruction [25– 21] PC Read address Instruction [31– 0] Instruction memory Instruction [20– 16] 1 M u Instruction [15– 11] x 0 Reg. Dst Instruction [15– 0] Read register 1 Read register 2 Read data 1 Read Write data 2 register Write Registers data 16 Sign 32 extend Shift left 2 ALU Add result 1 M u x 0 Mem. Write ALUSrc 1 M u x 0 ALU control Zero ALU result Memto. Reg Address Write data Read data Data memory 1 M u x 0 Mem. Read Instruction [5– 0] ALUOp HC TD 5102 26
Single Cycle Implementation n Memory (2 ns), ALU & adders (2 ns), reg. file access (1 ns) n Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns n Variable clock length (not realistic, just as exercise): u u u n HC TD 5102 R-instr: Load: Store: Branch: Jump: 6 ns 8 ns 7 ns 5 ns 2 ns Average depends on instruction mix (see pg 374) 27
Where we are headed n Single Cycle Problems: u u n what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources One Solution: u use a “smaller” cycle time have different instructions take different numbers of cycles a “multicycle” datapath: Instruction register PC Address IR ALU Registers Memory data register MDR HC TD 5102 A Register # Instruction Memory or data Data ALUOut Register # B Register # 28
Multicycle Approach n We will be reusing functional units u u n n Add registers after every major functional unit Our control signals will not be determined solely by instruction u n HC TD 5102 ALU used to compute address and to increment PC Memory used for instruction and data e. g. , what should the ALU do for a “subtract” instruction? We’ll use a finite state machine (FSM) or microcode for control 29
Review: finite state machines n Finite state machines: u u u a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) Current state Inputs Next-state function Clock Output function u HC TD 5102 Next state Outputs We’ll use a Moore machine (output based only on current state) 30
Multicycle Approach n Break up the instructions into steps, each step takes a cycle u u n At the end of a cycle u u n store values for use in later cycles (easiest thing to do) introduce additional “internal” registers Notice: we distinguish u u HC TD 5102 balance the amount of work to be done restrict each cycle to use only one major functional unit processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout) 31
Multicycle Approach PC 0 M u x 1 Address Memory Mem. Data Write data Instruction [25– 21] Read register 1 Instruction [20– 16] Read data 1 register 2 Registers Write Read register data 2 Instruction [15– 0] Instruction [15– 11] Instruction register Instruction [15– 0] Memory data register HC TD 5102 0 M u x 1 A B Sign extend 32 Zero ALU result ALUOut 0 4 Write data 16 0 M u x 1 1 M u 2 x 3 Shift left 2 32
Multicycle Approach n Note that previous picture does not include: u u u branch support jump support Control lines and logic n Tclock > max (ALU delay, Memory access, Regfile access) n See book for complete picture HC TD 5102 33
Five Execution Steps n Instruction Fetch n Instruction Decode and Register Fetch n Execution, Memory Address Computation, or Branch Completion n Memory Access or R-type instruction completion n Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! HC TD 5102 34
Step 1: Instruction Fetch n n n Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register. Transfer Language" IR = Memory[PC]; PC = PC + 4; n Can we figure out the values of the control signals? n What is the advantage of updating the PC now? HC TD 5102 35
Step 2: Instruction Decode and Register Fetch n n Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch Previous two actions are done optimistically!! RTL: A = Reg[IR[25 -21]]; B = Reg[IR[20 -16]]; ALUOut = PC+(sign-extend(IR[15 -0])<< 2); n HC TD 5102 We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) 36
Step 3 (instruction dependent) n ALU is performing one of four functions, based on instruction type n Memory Reference: ALUOut = A + sign-extend(IR[15 -0]); n R-type: ALUOut = A op B; n Branch: if (A==B) PC = ALUOut; n Jump: PC = PC[31 -28] || (IR[25 -0]<<2) HC TD 5102 37
Step 4 (R-type or memoryaccess) n Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B; n R-type instructions finish Reg[IR[15 -11]] = ALUOut; The write actually takes place at the end of the cycle on the edge HC TD 5102 38
Write-back step n Memory read completion step Reg[IR[20 -16]]= MDR; What about all the other instructions? HC TD 5102 39
Summary execution steps Steps taken to execute any instruction class HC TD 5102 40
Simple Questions n How many cycles will it take to execute this code? lw $t 2, 0($t 3) lw $t 3, 4($t 3) beq $t 2, $t 3, L 1 add $t 5, $t 2, $t 3 sw $t 5, 8($t 3) L 1: . . . n n HC TD 5102 #assume not taken What is going on during the 8 th cycle of execution? In what cycle does the actual addition of $t 2 and $t 3 takes place? 41
Implementing the Control n Value of control signals is dependent upon: u u n Use the information we have accumulated to specify a finite state machine (FSM) u u n HC TD 5102 what instruction is being executed which step is being performed specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification 42
FSM: high level view Start/reset Instruction fetch, decode and register fetch Memory access instructions HC TD 5102 R-type instructions Branch instruction Jump instruction 43
Memory address computation (Op 2 = 'L W' (O ) or 'S p= ') EQ = e) Branch completion Execution 6 ALUSrc. A = 1 ALUSrc. B = 10 ALUOp = 00 8 ALUSrc. A = 1 ALUSrc. B = 00 ALUOp = 10 Jump completion 9 ALUSrc. A = 1 ALUSrc. B = 00 ALUOp = 01 PCWrite. Cond PCSource = 01 PCWrite PCSource = 10 (O p = 'S W ') (Op = 'LW') (O p W ') yp R -t 'B How many state bits will we need? ALUSrc. A = 0 ALUSrc. B = 11 ALUOp = 00 = n 1 p Start Mem. Read ALUSrc. A = 0 Ior. D = 0 IRWrite ALUSrc. B = 01 ALUOp = 00 PCWrite PCSource = 00 (Op = 'J') Graphical Specification of FSM Instruction decode/ register fetch Instruction fetch Memory access 3 Memory access 5 Mem. Read Ior. D = 1 R-type completion 7 Mem. Write Ior. D = 1 Reg. Dst = 1 Reg. Write Memto. Reg = 0 Write-back step 4 Reg. Dst = 0 Reg. Write Memto. Reg = 1 HC TD 5102 44
Finite State Machine for Control Implementation: HC TD 5102 45
opcode PLA Impleme n-tation (see fig C. 14) current state n If I picked a horizontal or vertical line could you explain it ? What type of FSM is used? datapath control n next state HC TD 5102 46
ROM Implementation n ROM = "Read Only Memory" u n values of memory locations are fixed ahead of time A ROM can be used to implement a truth table u u if the address is m-bits, we can address 2 m entries in the ROM our outputs are the bits of data that the address points to ROM n bits m bits address 0 0 0 1 1 1 0 0 1 1 1 0 0 data 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 m is the "heigth", and n is the "width" HC TD 5102 47
ROM Implementation n n HC TD 5102 How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i. e. , 210 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 210 x 20 = 20 K bits (very large and a rather unusual size) Rather wasteful, since for lots of the entries, the outputs are the same — i. e. , opcode is often ignored 48
ROM Implementation Cheaper implementation: n n Exploit the fact that the FSM is a Moore machine ==> u Control outputs only depend on current state and not on other incoming control signals ! u Next state depends on all inputs Break up the table into two parts — 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM — Total number of bits: 4. 3 K bits of ROM HC TD 5102 49
ROM vs PLA n n PLA is much smaller u can share product terms (ROM has an entry (=address) for every product term u only need entries that produce an active output u can take into account don't cares Size of PLA: (#inputs ´ #product-terms) + (#outputs ´ #productterms) u n HC TD 5102 For this example: (10 x 17)+(20 x 17) = 460 PLA cells usually slightly bigger than the size of a ROM cell 50
Exceptions n n Unexpected events External: interrupt u n Internal: exception u n e. g. Overflow, Undefined instruction opcode, Software trap, Page fault How to handle exception? u u u HC TD 5102 e. g. I/O request Jump to general entry point (record exception type in status register) Jump to vectored entry point Address of faulting instruction has to be recorded ! 51
Exceptions Changes needed: see fig. 5. 48 / 5. 49 / 5. 50 n n Extend PC input mux with extra entry with fixed address: “C 000000 hex” Add EPC register containing old PC (we’ll use the ALU to decrement PC with 4) u n Cause register (one bit in our case) containing: u u n 0: undefined instruction 1: ALU overflow Add 2 states to FSM u u HC TD 5102 extra input ALU src 2 needed with fixed value 4 undefined instr. state #10 overflow state #11 52
Exceptions Legend: Int. Cause =0/1 Cause. Write ALUSrc. A = 0 ALUSrc. B = 01 ALUOp = 01 EPCWrite PCSource =11 type of exception write Cause register select PC select constant 4 subtract operation write EPC register with current PC write PC with exception address select exception address: C 000000 hex 2 New states: #10 undefined instruction Int. Cause =0 Cause. Write ALUSrc. A = 0 ALUSrc. B = 01 ALUOp = 01 EPCWrite PCSource =11 #11 overflow Int. Cause =1 Cause. Write ALUSrc. A = 0 ALUSrc. B = 01 ALUOp = 01 EPCWrite PCSource =11 To state 0 (begin of next instruction) HC TD 5102 53
- Slides: 53