CPE 626 Advanced VLSI Design L 02 Department































![Execution [EX] (1/2) • Register-register ALU instruction – ALU performs the operation specified by Execution [EX] (1/2) • Register-register ALU instruction – ALU performs the operation specified by](https://slidetodoc.com/presentation_image/2c751b922912b4b362a531f74990acb4/image-32.jpg)
![Execution [EX] (2/2) • Memory reference – ALU adds the operands to form effective Execution [EX] (2/2) • Memory reference – ALU adds the operands to form effective](https://slidetodoc.com/presentation_image/2c751b922912b4b362a531f74990acb4/image-33.jpg)
![Memory access (MEM) • Memory reference – load – store LMD Mem[ALUOutput] B • Memory access (MEM) • Memory reference – load – store LMD Mem[ALUOutput] B •](https://slidetodoc.com/presentation_image/2c751b922912b4b362a531f74990acb4/image-34.jpg)
![Write-back (WB) • Register-register ALU Regs[IR 16. . 20] ALUOutput • Register-immediate ALU Regs[IR Write-back (WB) • Register-register ALU Regs[IR 16. . 20] ALUOutput • Register-immediate ALU Regs[IR](https://slidetodoc.com/presentation_image/2c751b922912b4b362a531f74990acb4/image-35.jpg)

![Sequential Execution 10 5 IF Ii ID Ii+1 Time [clocks] Ex Mem WB IF Sequential Execution 10 5 IF Ii ID Ii+1 Time [clocks] Ex Mem WB IF](https://slidetodoc.com/presentation_image/2c751b922912b4b362a531f74990acb4/image-37.jpg)


![Pipelining Lessons (cont’d) Time [clocks] 5 IF Ii Ii+1 ID Ex Mem WB IF Pipelining Lessons (cont’d) Time [clocks] 5 IF Ii Ii+1 ID Ex Mem WB IF](https://slidetodoc.com/presentation_image/2c751b922912b4b362a531f74990acb4/image-40.jpg)



![DLX Pipeline Definition: IF, ID • Stage IF – IF/ID. IR Mem[PC]; – if DLX Pipeline Definition: IF, ID • Stage IF – IF/ID. IR Mem[PC]; – if](https://slidetodoc.com/presentation_image/2c751b922912b4b362a531f74990acb4/image-44.jpg)


- Slides: 46
CPE 626: Advanced VLSI Design L 02 Department of Electrical and Computer Engineering University of Alabama in Huntsville UAH-CPE 631
Outline • Simple Processor – MU 0 – Datapath Design – Control Logic – ALU Design • Pipeline Processor – DLX – ISA • • Registers Addressing Modes and Data Types Instruction Format Instruction Set – Non-pipeline Implementation – Pipeline Implementation 10/31/2020 UAH-CPE 631 2
MU 0 – A Simple Processor • Instruction format • Instruction set 10/31/2020 UAH-CPE 631 3
MU 0 Logic Design • Follow an approach to separate the design into two components – Datapath – all the components carrying, storing or processing bits including the accumulator, program counter, ALU, and instruction register – Control logic – everything that does not fit comfortably into datapath • Datapath design: many ways to do this – Assume that memory access is limiting factor, and assume that memory access will take exactly one clock cycle 10/31/2020 UAH-CPE 631 4
MU 0 Datapath Example • Program Counter – PC • Instruction Register • Accumulator - ACC • Instruction Decode and Control Logic • Arithmetic-Logic Unit – ALU Follow the principle that the memory will be limiting factor in design: each instruction takes exactly the number of clock cycles defined by the number of memory accesses it must take. Note: We do not have a dedicated PC incrementer! Why? 10/31/2020 UAH-CPE 631 5
MU 0 Datapath Design • Assume that each instruction starts when it has arrived in the IR • Step 1: EX (execute) – – – – • Initialization LDA S: ACC <- Mem[S] STO S: Mem[S] <- ACC ADD S: ACC <- ACC + Mem[S] SUB S: ACC <- ACC - Mem[S] JMP S: PC <- S JGE S: if (ACC >= 0) PC <- S JNE S: if (ACC != 0) PC <- S – Reset input to start executing instructions from a known address; here it is 000 hex • provide zero at the ALU output and then load it into the PC register • Step 2: IF (fetch the next instruction) – Either PC or the address in the IR is issued to fetch the next instruction – address is incremented in the ALU and value saved into the PC 10/31/2020 UAH-CPE 631 6
MU 0 RTL Organization • Control Logic – – – – – Asel Bsel ACCce (ACC change enable) PCce (PC change enable) IRce (IR change enable) ACCoe (ACC output enable) ALUfs (ALU function select) MEMrq (memory request) Rn. W (read/write) Ex/ft (execute/fetch) 10/31/2020 UAH-CPE 631 7
MU 0 control logic 10/31/2020 UAH-CPE 631 8
LDA S (0000) Ex/ft = 0 Ex/ft = 1 B 10/31/2020 B+1 UAH-CPE 631 9
STO S (0001) Ex/ft = 0 Ex/ft = 1 x 10/31/2020 B+1 UAH-CPE 631 10
ADD S (0010) Ex/ft = 0 Ex/ft = 1 A+B 10/31/2020 B+1 UAH-CPE 631 11
SUB S (0011) Ex/ft = 0 Ex/ft = 1 A-B 10/31/2020 B+1 UAH-CPE 631 12
JMP S (0100) Ex/ft = 0 B+1 10/31/2020 UAH-CPE 631 13
JGE S (0101) Ex/ft = 0, ACC 15 = 1 Ex/ft = 0, ACC 15 = 0 B+1 10/31/2020 UAH-CPE 631 14
JNE S (0110) Ex/ft = 0, ACCz = 1 Ex/ft = 0, ACCz = 0 B+1 10/31/2020 UAH-CPE 631 15
STP (001) Ex/ft = 0 x 10/31/2020 UAH-CPE 631 16
Reset Ex/ft = 0 0 10/31/2020 UAH-CPE 631 17
MU 0 ALU Design • ALU functions: A+B, A-B, B, B+1, 0 (used only when reset is active) => 4 functions 10/31/2020 • Aen (enable operand A) • Binv (invert operand B) UAH-CPE 631 18
Another Example: DLX Architecture UAH-CPE 631
DLX Registers • GPRs with load-store architecture • GPR: 32 32 -bit named R 0, R 1, . . . R 31, R 0=0 • FPR (floating point registers): – single precision: 32 32 -bit named F 0, F 1, . . . F 31 (accessed independently) – double precision: 16 64 -bit named F 0, F 2, . . . F 30 (accessed in pairs) • Instructions which support transfers between GPRs and FPRs • Other status registers, e. g. , floating-point status register (hold information about the results of FP ops) 10/31/2020 UAH-CPE 631 20
Addressing Modes and Data Types • Immediate with a 16 -bit value field • Displacement with a 16 -bit displacement – register deferred derived when disp=0 – absolute derived from displacement with R 0 • Byte addressable in big-endian with 32 -bit address • All memory references are load/store through GPR or FPR and must be aligned • Data types – 8 -bit bytes, 16 -bit half words (loaded into registers with either zeros or the sign bit replicated to fill 32 bits) – 32 -bit integers – 32 -bit single precision and 64 -bit double-precision for FP 10/31/2020 UAH-CPE 631 21
Instruction Formats • I-type: load, store, arithmetic, logic, relational, shift, branch • R-type: arithmetic, logic, relational • J-type: jump, jump and link, trap, return from exception I-type instruction 6 Opcode 5 5 16 rs 1 rd immediate Encodes: Loads and stores of bytes, words, half words All immediates (rd rs 1 op immediate) Conditional branch instructions (rs 1 is register, rd is unused) Jump register, jump and link register (rd=0, rs=destination, imm. =0) 6 R-type instruction Opcode 5 5 rs 1 rs 2 16 rd func Reg-reg ALU operations: (rd rs 1 func rs 2); func={add, sub, . . . } Read/write special registers and moves 6 J-type instruction Opcode 26 Offset added to PC Jump and jump and link; Trap and return from exception 10/31/2020 UAH-CPE 631 22
Instructions for Data Transfers Instruction Opcode Instruction Meaning LB, LBU, SB Load byte, load byte unsigned, store byte LH, LHU, SH Load half word, load half word unsigned, store half word LW, SW Load word, store word (to/from integer registers) LF, LD, SF, SD Load SP float, load DP float, store SP float, store DP float (SP - single precision, DP - double precision) MOVI 2 S, MOVS 2 I Move from/to GPR to/from a special register MOVF, MOVD Copy one floating-point register or a DP pair to another register or pair MOVFP 2 I, MOVI 2 FP Move 32 bits from/to FP register to/from integer registers Example Instruction LW R 1, 30(R 2) LW R 1, 1000(R 0) LB R 1, 40(R 3) LBU R 1, 40(R 3) LH R 1, 40(R 3) LF F 0, 50(R 3) LD F 0, 50(R 2) 10/31/2020 Meaning Regs[R 1] 32 Mem[30 + Regs[R 2]] Regs[R 1] 32 Mem[1000 + 0] Regs[R 1] 32 (Mem[40 + Regs[R 3]]0)24 ## Mem[40 + Regs[R 3]] Regs[R 1] 32 (Mem[40 + Regs[R 3]]0)16 ## Mem[40 + Regs[R 3]] ## Mem[41+Regs[R 3]] Regs[F 0] 32 Mem[50 + Regs[R 3]] Regs[F 0] ## Regs[F 1] 32 Mem[50 + Regs[R 2]] UAH-CPE 631 23
Arithmetic/logical instructions • All ALU instructions are register-register – add, sub, and, or, xor, shift – Immediate forms also available – LHI loads immediate value into most significant 16 bits • R 0 used to synthesise other operations – Loading constant is an immediate => add with R 0 as one source – Register-register move is an add with R 0 as one source • Compare operations put 1 ("true") in destination if condition is met 10/31/2020 UAH-CPE 631 24
Arithmetic/logical instructions (cont’d) Instruction Opcode Instruction Meaning ADD, ADDI, ADDUI Add, add immediate (all immediates are 16 -bits); signed and unsigned SUB, SUBI, SUBUI Subtract, subtract immediate; signed and unsigned MULT, MULTU, DIVU Multiply and divide, signed and unsigned; operands must be floating-point registers; all operations take and yield 32 -bit values AND, ANDI And, and immediate OR, ORI, XORI Or, or immediate, exclusive or immediate LHI Load high immediate - loads upper half of register with immediate SLL, SRA, SLLI, SRLI, SRAI Shifts: both immediate(S__I) and variable form(S__); shifts are shift left logical, right arithmetic S__, S__I Set conditional: "__"may be LT, GT, LE, GE, EQ, NE Example Instruction ADD R 1, R 2, R 3 ADDI R 1, R 2, #3 LHI R 1, #42 SLLI R 1, R 2, #5 SLT R 1, R 2, R 3 10/31/2020 Meaning Regs[R 1] Regs[R 2] + Regs[R 3] Regs[R 1] Regs[R 2] + 3 Regs[R 1] 42##016 Regs[R 1] Regs[R 2] << 5 if (Regs[R 2] < Regs[R 3]) Regs[R 1] 1 else Regs[R 1] 0 UAH-CPE 631 25
Control-flow instructions • Jump can use 26 -bit signed offset from PC or contents of register • Jump-and-link saves PC in R 31 • Conditional branches test source for zero/non-zero and use 16 -bit signed offset Instruction Opcode Instruction Meaning BEQZ, BNEZ Branch GPR equal/not equal to zero; 16 -bit offset from PC BFPT, BFPF Test comparison bit in the FP status register and branch; 16 -bit offset from PC J, JR Jumps: 26 -bit offset from PC(J) or target in register(JR) TRAP Transfer to operating system at a vectored address RFE Return to user code from an exception; restore user code 10/31/2020 UAH-CPE 631 26
Floating-point instructions in DLX • Moves between floating point (32 -bit) and double-precision (64 -bit) registers • Operations: add, subtract, multiply, divide • Also, integer multiply/divide on floating point regs Instruction Opcode Instruction Meaning ADDD, ADDF Add DP, SP numbers SUBD, SUBF Subtract DP, SP numbers MULTD, MULTF Multiply DP, SP floating point DIVD, DIVF Divide DP, SP floating point CVTF 2 D, CVTF 2 I, CVTD 2 F, CVTD 2 I, CVTI 2 F, CVTI 2 D Convert instructions: CVTx 2 y converts from type x to type y, where x and y are one of I(Integer), D(Double precision), or F(Single precision). Both operands are in the FP registers. __D, __F DP and SP compares: "__" may be LT, GT, LE, GE, EQ, NE; set comparison bit in FP status register. 10/31/2020 UAH-CPE 631 27
A Simple Implementation of DLX UAH-CPE 631
Instruction Execution • Process of “instruction execution” is usually broken up into stages (“divide and conquer”) – smaller stages are easier to design – easy to optimize (change) one stage without touching the others • 5 main stages for DLX; each stage takes one clock cycle – – – Instruction Fetch (IF) Instruction Decode / Register fetch cycle (ID) Execution / Effective address cycle (EX) Memory access / Branch completion cycle (MEM) Write-back cycle (WB) 10/31/2020 UAH-CPE 631 29
Instruction Fetch (IF) • Send out PC and fetch the instruction from the memory into instruction register (IR) – IR is used to hold the instruction • Increment the PC by 4 to address the next sequential instruction – NPC is used to hold the next sequential address IR Mem[PC] NPC PC + 4 10/31/2020 UAH-CPE 631 30
Instruction Decode (ID) • Decode the instruction to determine instruction type (Opcode field - 6 ms bits of the instruction) • Read in data from all necessary registers – temporary registers A, B hold outputs of GPR – Imm is used to hold sign-extended lower 16 -bits of the IR – decoding is done in parallel with reading registers since these fields are at fixed locations – a register may be read even we do not use it A Regs[IR 6. . 10] B Regs[IR 11. . 15] Imm (IR 16)16##IR 16. . 31 10/31/2020 UAH-CPE 631 31
Execution [EX] (1/2) • Register-register ALU instruction – ALU performs the operation specified by the opcode on the values in registers A and B; the result is placed in the temporary register ALUOutput A op B • Register-immediate ALU instruction – ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm; the result is placed in the temporary register ALUOutput A op Imm 10/31/2020 UAH-CPE 631 32
Execution [EX] (2/2) • Memory reference – ALU adds the operands to form effective address and places the result into the temporary register ALUOutput A + Imm • Branch – ALU adds the NPC to the Imm to compute the address of the branch target – Register A is checked to determine whether the branch is taken (for BEQZ op is “==“; for BNEZ op is “!=“; ) – Cond is 1 -bit register (1 - branch is taken, 0 - not taken) ALUOutput NPC + Imm Cond (A op 0) 10/31/2020 UAH-CPE 631 33
Memory access (MEM) • Memory reference – load – store LMD Mem[ALUOutput] B • Branch – if the instruction branches, the PC is replaced with the branch destination; otherwise, it is replaced with NPC if (cond) PC ALUOutput else PC NPC 10/31/2020 UAH-CPE 631 34
Write-back (WB) • Register-register ALU Regs[IR 16. . 20] ALUOutput • Register-immediate ALU Regs[IR 11. . 15] ALUOutput • Load instruction Regs[IR 11. . 15] LMD 10/31/2020 UAH-CPE 631 35
Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Next PC Add NPC M U X Next SEQ PC Zero? 4 RS 1 RS 2 IR RD Imm Reg. File Sign Extend A B M U X ALUoutput Instruction Memory M U X ALU PC Write Back L Data M Memory D M U X Imm WB Data 10/31/2020 UAH-CPE 631 36
Sequential Execution 10 5 IF Ii ID Ii+1 Time [clocks] Ex Mem WB IF ID Ex Mem WB Ii+2 IF ID Ex Mem WB Instructions Sequential execution for these 3 instructions (Ii, Ii+1, Ii+2) takes 15 clock cycles 10/31/2020 UAH-CPE 631 37
Pipelined Execution 10 5 IF Ii Ii+1 Analogy with automobile assembly line - many steps, each contributing something to the construction of the car - each step operates in parallel with other steps, though on a different car ID Ex Mem WB IF ID Ii+2 Ii+3 Ii+4 Instructions Time [clocks] Ex Mem WB Pipe stages (segments) Pipelined execution for instructions Ii, Ii+1, and Ii+2 takes 7 clock cycles 10/31/2020 UAH-CPE 631 38
Pipelining Lessons • Pipelining does not help latency of single instruction, it helps throughput IF ID Ex Mem WB Ii of entire workload Ii+1 IF ID Ex Mem WB • Multiple instructions operating simultaneously Ii+2 IF ID Ex Mem WB using different resources • Potential speedup = Number pipe stages Instructions • Time to “fill” pipeline and time Latency & Throughput: to “drain” • Latency: . . . how long it takes to execute an instruction reduce speedup: • Throughput: . . . how often an 2. 15 X vs. 5 X in this example Time [clocks] 5 instruction exits the pipeline 10/31/2020 UAH-CPE 631 39
Pipelining Lessons (cont’d) Time [clocks] 5 IF Ii Ii+1 ID Ex Mem WB IF ID Ii+2 Instructions 10/31/2020 Ex Mem WB • Pipeline stages are hooked together => all stages must be ready to proceed at the same time • Machine cycle – the time required between moving an instruction one step down the pipeline (usually one clock cycle) • The length of a machine cycle is determined by the time required for the slowest stage • Unbalanced lengths of pipe stages also reduces speedup UAH-CPE 631 40
Visualizing Pipeline Time (clock cycles) IM Reg IM CC 5 DM Reg IM 10/31/2020 UAH-CPE 631 DM Reg CC 6 CC 7 Reg DM ALU Reg CC 4 ALU O r d e r IM CC 3 ALU I n s t r. CC 2 ALU CC 1 Reg DM Reg 41
Pipeline Datapath Instruction Fetch Instr. Decode Reg. Fetch IF/ID Execute Addr. Calc ID/EX EX/MEM Next PC 4 IR 11. . 15 Reg. File Imm M U X ALU PC IR MEM/WB Zero? IR 6. . 10 Instruction Memory Write Back M U X Next SEQ PC Add Memory Access Data Memory M U X Sign Extend MEM/WB. IR 11. . 15 or MEM/WB. IR 16. . 20 WB Data 10/31/2020 UAH-CPE 631 42
Instruction Flow through Pipeline Regs Time (clock cycles) CC 1 Sub R 6, R 5, R 7 ALU Add R 1, R 2, R 3 Lw R 4, 0(R 2) DM Nop Add R 1, R 2, R 3 Reg Reg UAH-CPE 631 ALU DM DM DM Nop Reg Nop IM Lw R 4, 0(R 2) ALU Xor R 9, R 8, R 1 Reg Reg Add R 1, R 2, R 3 Nop IM IM IM Nop 10/31/2020 Sub R 6, R 5, R 7 Lw R 4, 0(R 2) Add R 1, R 2, R 3 CC 4 CC 3 CC 2 43
DLX Pipeline Definition: IF, ID • Stage IF – IF/ID. IR Mem[PC]; – if EX/MEM. cond {IF/ID. NPC, PC EX/MEM. ALUOUT} else {IF/ID. NPC, PC + 4}; • Stage ID – ID/EX. A Regs[IF/ID. IR 6… 10]; ID/EX. B Regs[IF/ID. IR 11… 15]; – ID/EX. Imm (IF/ID. IR 16)16 ## IF/ID. IR 16… 31; – ID/EX. NPC IF/ID. NPC; ID/EX. IR IF/ID. IR; 10/31/2020 UAH-CPE 631 44
DLX Pipeline Definition: IE • ALU – EX/MEM. IR ID/EX. IR; – EX/MEM. ALUOUT ID/EX. A func ID/EX. B; or EX/MEM. ALUOUT ID/EX. A func ID/EX. Imm; – EX/MEM. cond 0; • load/store – EX/MEM. IR ID/EX. IR; EX/MEM. B ID/EX. B; – EX/MEM. ALUOUT ID/EX. A ID/EX. Imm; – EX/MEM. cond 0; • branch – EX/MEM. NPC ID/EX. A ID/EX. Imm; – EX/MEM. cond (ID/EX. A func 0); 10/31/2020 UAH-CPE 631 45
DLX Pipeline Definition: MEM, WB • Stage MEM – ALU • MEM/WB. IR EX/MEM. IR; • MEM/WB. ALUOUT EX/MEM. ALUOUT; – load/store • MEM/WB. IR EX/MEM. IR; • MEM/WB. LMD Mem[EX/MEM. ALUOUT] or Mem[EX/MEM. ALUOUT] EX/MEM. B; • Stage WB – ALU • Regs[MEM/WB. IR 16… 20] MEM/WB. ALUOUT; or Regs[MEM/WB. IR 11… 15] MEM/WB. ALUOUT; – load • Regs[MEM/WB. IR 11… 15] MEM/WB. LMD; 10/31/2020 UAH-CPE 631 46