Hardware Components of Any Computer Five classic components

CPU Organization • Datapath Design: – Capabilities & performance characteristics of principal Functional Units

Hierarchy of Computer Architecture High-Level Language Programs Software Application Operating System Machine Language Program

A Hierarchy of Computer Design Level 1 Name Modules Electronics Gates, FF’s 2 Logic

Computer Performance Measures: Program Execution Time • For a specific program compiled to run

CPU Execution Time: The CPU Equation • A program is comprised of a number

Factors Affecting CPU Performance CPU time = Seconds = Instructions x Cycles Program Instruction

Aspects of CPU Execution Time CPU Time = Instruction count x CPI x Clock

Instruction Types & CPI: An Example • An instruction set has three instruction classes:

Instruction Frequency & CPI • Given a program with n types or classes of

Instruction Type Frequency & CPI: A RISC Example Base Machine (Reg / Reg) Op

Metrics of Computer Performance Execution time: Target workload, SPEC 95, etc. Application Programming Language

Performance Enhancement Calculations: Amdahl's Law • The performance enhancement possible due to a given

$Pictorial Depiction of Amdahl’s Law Enhancement E accelerates fraction F of execution time by$

Amdahl's Law With Multiple Enhancements: Example • Three CPU performance enhancements are proposed with

MIPS Instruction Formats 31 R-Type I-Type: ALU 26 op 6 bits 31 26 31

A Single Cycle MIPS Datapath n. PC_sel 4 Rd Imm 16 Reg. Dst ALUctr

<0: 25> Rd <0: 15> Rs <11: 15> Rt <16: 20> Op Fun <21:

Control Signal Generation See Appendix A func 10 0000 10 0010 Don’t Care op

The Truth Table For The Main Control op Reg. Dst ALUSrc Memto. Reg. Write

PLA Implementation of the Main Control. . op<5> <0> R-type . . op<5> <0>

Worst Case Timing (Load) Clk PC Old Value Clk-to-Q New Value Instruction Memoey Access

MIPS Single Cycle Instruction Timing Comparison Arithmetic & Logical PC Inst Memory Reg File

CPU Design Steps 1. Analyze instruction set operations using independent RTN => datapath requirements.

CPU Design & Implantation Process • Bottom-up Design: – Assemble components in target technology

Drawback of Single Cycle Processor • Long cycle time. • All instructions must take

Reducing Cycle Time: Multi-Cycle Design • Cut combinational dependency graph by inserting registers /

Instruction Processing Cycles Instruction Fetch Next Obtain instruction from program storage Update program counter

Partitioning The Single Cycle Datapath Result Store Mem. Wr Mem. Rd Mem. Wr Reg.

Reg. Dst Reg. Wr File Equal Mem. To. Reg Result Store ALUSrc ALUctr Mem.

Operations In Each Cycle R-Type Logic Immediate Load Store Branch Instruction Fetch IR ¬

Finite State Machine (FSM) Control Model • State specifies control points for Register Transfer.

Control Specification For Multi-cycle CPU Finite State Machine (FSM) “instruction fetch” IR ¬ MEM[PC]

Traditional FSM Controller datapath + state diagram => control • Translate RTN statements into

Mapping RTNs To Control Points Examples & State Assignments IR ¬ MEM[PC] “instruction fetch”

Detailed Control Specification State Op field Eq Write-Back Next IR PC en sel AB

Alternative Multiple Cycle Datapath (In Textbook) • Shared instruction/data memory unit • A single

Operations In Each Cycle R-Type Instruction Fetch Instruction Decode Execution IR ¬ Mem[PC] PC

Finite State Machine (FSM) Specification IR ¬ MEM[PC] PC ¬ PC + 4 “instruction

MIPS Multi-cycle Datapath Performance Evaluation • What is the average CPI? – State diagram

Control Implementation Alternatives • Control may be designed using one of several initial representations.

Microprogrammed Control • Finite state machine control for a full set of instructions is

Microprogrammed Control Unit Microprogram Storage ROM/PLA Multicycle Datapath Outputs Control Signal Fields Microinstruction Address

Multiple Bit Control Single Bit Control List of control Signals Grouped Into Fields Signal

Microinstruction Field Values Field Name ALU Values for Field Add Subt. Func code Or

Microprogram for The Control Unit Label ALU SRC 1 SRC 2 Fetch: Add PC

MIPS Integer ALU Requirements (1) Functional Specification: inputs: 2 x 32 -bit operands A,

Building Block: 1 -bit ALU Performs: AND, OR, addition on A, B or A,

32 -Bit ALU Using 32 1 -Bit ALUs Carry. In 0 32 -bit rippled-carry

MIPS ALU With SLT Support Added Carry. In 0 Less A 0 B 0

Improving ALU Performance: Carry Look Ahead (CLA) Cin A 0 B 1 A 0

Unsigned Multiplication Example • Paper and pencil example (unsigned): Multiplicand 1000 Multiplier 1001 1000

Operation of Combinational Multiplier 0 0 0 A 3 A 3 P 7 •

MULTIPLY HARDWARE Version 3 • Combine Multiplier register and Product register: – 32 -bit

Multiply Algorithm Version 3 Product 0 = 1 Start 1. Test Product 0 =

Booth’s Algorithm Current Bit 1 0 1 0 0 Bit to the Right Explanation

Booths Example (2 x 7) Operation Multiplicand Product next? 0. initial value 0010 0000

Combinational Shifter from MUXes Basic Building Block sel A B 1 0 D 8

Division 1001 Divisor 1000 1001010 – 1000 10 • Quotient Dividend Remainder (or Modulo

DIVIDE HARDWARE Version 3 • 32 -bit Divisor register. • 32 -bit ALU. •

Divide Algorithm Version 3 Start: Place Dividend in Remainder 1. Shift the Remainder register

Representation of Floating Point Numbers in Single Precision IEEE 754 Standard Value = N

Representation of Floating Point Numbers in Double Precision IEEE 754 Standard Value = N

Floating Point Conversion Example • The decimal number -2345. 12510 is to be represented

Start Compare the exponents of the two numbers shift the smaller number to the

Floating Point Addition Hardware EECC 550 - Shaaban #68 Final Review Spring 2000 5

Floating Point Multiplication Flowchart (1) Set the result to zero: exponent = 0 Is

Extra Bits for Rounding Extra bits used to prevent or minimize rounding errors. How

Rounding Digits Normalized result, but some non-zero digits to the right of the significand

Sticky Bit Additional bit to the right of the round digit to better fine

Pipelining: Design Goals • The length of the machine clock cycle is determined by

Pipelined Instruction Processing Representation Time in clock cycles ® Clock cycle Number Instruction Number

Single Cycle, Multi-Cycle, Pipeline: Performance Comparison Example For 1000 instructions, execution time: • Single

Single Cycle, Multi-Cycle, Vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: 8

MIPS: A Pipelined Datapath 0 M u x 1 IF/ID EX/MEM ID/EX MEM/WB Add

Pipeline Control • Pass needed control signals along from one stage to the next

Basic Performance Issues In Pipelining • Pipelining increases the CPU instruction throughput: The number

Pipelining Performance Example • Example: For an unpipelined machine: – Clock cycle = 10

Pipeline Hazards • Hazards are situations in pipelining which prevent the next instruction in

Structural hazard Example: Single Memory For Instructions & Data Time (clock cycles) Instr 4

Data Hazards Example • Problem with starting next instruction before first is finished –

Data Hazard Resolution: Stall Cycles Stall the pipeline by a number of cycles. The

Performance of Pipelines with Stalls • Hazards in pipelines may make it necessary to

Data Hazard Resolution: Forwarding – Register file forwarding to handle read/write to same register

Data Hazard Example With Forwarding Time (in clock cycles) CC 1 Value of register

A Data Hazard Requiring A Stall A load followed by an R-type instruction that

Compiler Scheduling Example • Reorder the instructions to avoid as many pipeline stalls as

Control Hazards: Example • Three other instructions are in the pipeline before branch instruction

• • • Reducing Delay of Taken Branches Next PC of a branch

Pipeline Performance Example • Assume the following MIPS instruction mix: Type Arith/Logic Load Store

Memory Hierarchy: Motivation Processor-Memory (DRAM) Performance Gap 100 CPU Processor-Memory Performance Gap: (grows 50%

Processor-DRAM Performance Gap Impact: Example • To illustrate the performance impact, assume a pipelined

Memory Hierarchy: Motivation The Principle Of Locality • Programs usually access a relatively small

Levels of The Memory Hierarchy Part of The On-chip CPU Datapath 16 -256 Registers

A Typical Memory Hierarchy (With Two Levels of Cache) Faster Larger Capacity Processor Control

Cache Design & Operation Issues • Q 1: Where can a block be placed

Cache Organization Example EECC 550 - Shaaban #100 Final Review Spring 2000 5 -9

Locating A Data Block in Cache • Each block frame in cache has an

Address Field Sizes Physical Address Generated by CPU Block Address Tag Block Offset Index

Four-Way Set Associative Cache: MIPS Implementation Example A d dre s s Tag Field

Cache Organization/Addressing Example • Given the following: – A single-level L 1 cache with

Cache Example: Fully Associative Case V V All 128 tags must be checked in

Cache Example: Direct Mapped Case V V V Only a single tag must be

Cache Example: 2 -Way Set-Associative Two tags in a set must be checked in

Cache Replacement Policy • When a cache miss occurs the cache controller may have

Miss Rates for Caches with Different Size, Associativity & Replacement Algorithm Sample Data Associativity:

Single Level Cache Performance For a CPU with a single level (L 1) of

Single Level Cache Performance CPUtime = IC x CPI x C CPIexecution = CPI

Cache Performance Example • Suppose a CPU executes at Clock Rate = 200 MHz

Cache Performance Example • Suppose for the previous example we double the clock rate

3 Levels of Cache CPU L 1 Cache L 2 Cache L 3 Cache

3 -Level Cache Performance CPUtime = IC x (CPIexecution + Mem Stall cycles per

Three Level Cache Example • • • CPU with CPIexecution = 1. 1 running

Memory Bandwidth Improvement Techniques • Wider Main Memory: Memory width is increased to a

Wider memory, bus and cache Narrow bus and cache with interleaved memory Three examples

Memory Interleaving EECC 550 - Shaaban #119 Final Review Spring 2000 5 -9 -2000

Memory Width, Interleaving: An Example Given a base system with following parameters: Cache Block

Virtual Memory Benefits – Illusion of having more physical main memory – Allows program

Page Table Page table register Virtual address 3 1 30 2 9 28 2

Virtual Memory Issues/Strategies • Main memory block placement: Fully associative placement is used to

Speeding Up Address Translation: Translation Lookaside Buffer (TLB) • • TLB: A small on-chip

TLB Operation TLB & Cache Operation Virtual address TLB access Cache is physically-addressed TLB

Slides: 125

Download presentation

Hardware Components of Any Computer Five classic components of all computers: 1. Control Unit; 2. Datapath; 3. Memory; 4. Input; 5. Output } Processor or Central Processing Unit (CPU) Computer Processor (active) Control Unit Datapath Memory (passive) (where programs, data live when running) Devices Keyboard, Mouse, etc. Input Disk Output Display, Printer, etc. EECC 550 - Shaaban #1 Final Review Spring 2000 5 -9 -2000

CPU Organization • Datapath Design: – Capabilities & performance characteristics of principal Functional Units (FUs): – (e. g. , Registers, ALU, Shifters, Logic Units, . . . ) – Ways in which these components are interconnected (buses connections, multiplexors, etc. ). – How information flows between components. • Control Unit Design: – Logic and means by which such information flow is controlled. – Control and coordination of FUs operation to realize the targeted Instruction Set Architecture to be implemented (can either be implemented using a finite state machine or a microprogram). • Hardware description with a suitable language, possibly using Register Transfer Notation (RTN). EECC 550 - Shaaban #2 Final Review Spring 2000 5 -9 -2000

Hierarchy of Computer Architecture High-Level Language Programs Software Application Operating System Machine Language Program Software/Hardware Boundary Assembly Language Programs Compiler Firmware Instr. Set Proc. I/O system Instruction Set Architecture Datapath & Control Hardware Digital Design Circuit Design Microprogram Layout Logic Diagrams Circuit Diagrams Register Transfer Notation (RTN) EECC 550 - Shaaban #3 Final Review Spring 2000 5 -9 -2000

A Hierarchy of Computer Design Level 1 Name Modules Electronics Gates, FF’s 2 Logic 3 Organization Registers, ALU’s. . . Processors, Memories Primitives Descriptive Media Transistors, Resistors, etc. Gates, FF’s …. Logic Diagrams Registers, ALU’s … Register Transfer Notation (RTN) Low Level - Hardware 4 Microprogramming Assembly Language Circuit Diagrams Microinstructions Microprogram Firmware 5 Assembly language programming 6 Procedural Programming 7 Application OS Routines Applications Drivers. . Systems Assembly language Instructions Assembly Language Programs OS Routines High-level Language Programs Procedural Constructs Problem-Oriented Programs High Level - Software EECC 550 - Shaaban #4 Final Review Spring 2000 5 -9 -2000

Computer Performance Measures: Program Execution Time • For a specific program compiled to run on a specific machine “A”, the following parameters are provided: – The total instruction count of the program. – The average number of cycles per instruction (average CPI). – Clock cycle of machine “A” • How can one measure the performance of this machine running this program? – Intuitively the machine is said to be faster or has better performance running this program if the total execution time is shorter. – Thus the inverse of the total measured program execution time is a possible performance measure or metric: Performance. A = 1 / Execution Time. A How to compare performance of different machines? What factors affect performance? How to improve performance? EECC 550 - Shaaban #5 Final Review Spring 2000 5 -9 -2000

CPU Execution Time: The CPU Equation • A program is comprised of a number of instructions – Measured in: instructions/program • The average instruction takes a number of cycles per instruction (CPI) to be completed. – Measured in: cycles/instruction • CPU has a fixed clock cycle time = 1/clock rate – Measured in: seconds/cycle • CPU execution time is the product of the above three parameters as follows: CPU time = Seconds Program = Instructions x Cycles Program Instruction x Seconds Cycle EECC 550 - Shaaban #6 Final Review Spring 2000 5 -9 -2000

Factors Affecting CPU Performance CPU time = Seconds = Instructions x Cycles Program Instruction Count CPI Program X X Compiler X X Instruction Set Architecture (ISA) X X Organization Technology x Seconds X Cycle Clock Rate X X EECC 550 - Shaaban #7 Final Review Spring 2000 5 -9 -2000

Aspects of CPU Execution Time CPU Time = Instruction count x CPI x Clock cycle Depends on: Program Used Compiler ISA Instruction Count Depends on: Program Used Compiler ISA CPU Organization CPI Clock Cycle Depends on: CPU Organization Technology EECC 550 - Shaaban #8 Final Review Spring 2000 5 -9 -2000

Instruction Types & CPI: An Example • An instruction set has three instruction classes: Instruction class A B C CPI 1 2 3 • Two code sequences have the following instruction counts: Code Sequence 1 2 Instruction counts for instruction class A B C 2 1 2 4 1 1 • CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles CPI for sequence 1 = clock cycles / instruction count = 10 /5 = 2 • CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles CPI for sequence 2 = 9 / 6 = 1. 5 EECC 550 - Shaaban #9 Final Review Spring 2000 5 -9 -2000

Instruction Frequency & CPI • Given a program with n types or classes of instructions with the following characteristics: Ci = Count of instructions of typei CPIi = Average cycles per instruction of typei Fi = Frequency of instruction typei = Ci/ total instruction count Then: EECC 550 - Shaaban #10 Final Review Spring 2000 5 -9 -2000

Instruction Type Frequency & CPI: A RISC Example Base Machine (Reg / Reg) Op Freq Cycles CPI(i) ALU 50% 1. 5 Load 20% 5 1. 0 Store 10% 3. 3 Branch 20% 2. 4 % Time 23% 45% 14% 18% Typical Mix CPI =. 5 x 1 +. 2 x 5 +. 1 x 3 +. 2 x 2 = 2. 2 EECC 550 - Shaaban #11 Final Review Spring 2000 5 -9 -2000

Metrics of Computer Performance Execution time: Target workload, SPEC 95, etc. Application Programming Language Compiler ISA (millions) of Instructions per second – MIPS (millions) of (F. P. ) operations per second – MFLOP/s Datapath Control Megabytes per second. Function Units Transistors Wires Pins Cycles per second (clock rate). Each metric has a purpose, and each can be misused. EECC 550 - Shaaban #12 Final Review Spring 2000 5 -9 -2000

Performance Enhancement Calculations: Amdahl's Law • The performance enhancement possible due to a given design improvement is limited by the amount that the improved feature is used • Amdahl’s Law: Performance improvement or speedup due to enhancement E: Execution Time without E Speedup(E) = -------------------Execution Time with E Performance with E = ----------------Performance without E – Suppose that enhancement E accelerates a fraction F of the execution time by a factor S and the remainder of the time is unaffected then: Execution Time with E = ((1 -F) + F/S) X Execution Time without E Hence speedup is given by: Execution Time without E 1 Speedup(E) = ----------------------------- = ----------((1 - F) + F/S) X Execution Time without E (1 - F) + F/S Note: All fractions here refer to original execution time. EECC 550 - Shaaban #13 Final Review Spring 2000 5 -9 -2000

$Pictorial Depiction of Amdahl’s Law Enhancement E accelerates fraction F of execution time by$

Pictorial Depiction of Amdahl’s Law Enhancement E accelerates fraction F of execution time by a factor of S Before: Execution Time without enhancement E: Unaffected, fraction: (1 - F) Affected fraction: F Unchanged Unaffected, fraction: (1 - F) F/S After: Execution Time with enhancement E: Execution Time without enhancement E 1 Speedup(E) = --------------------------- = ---------Execution Time with enhancement E (1 - F) + F/S EECC 550 - Shaaban #14 Final Review Spring 2000 5 -9 -2000

Amdahl's Law With Multiple Enhancements: Example • Three CPU performance enhancements are proposed with the following speedups and percentage of the code execution time affected: Speedup 1 = S 1 = 10 Speedup 2 = S 2 = 15 Speedup 3 = S 3 = 30 • • • Percentage 1 = F 1 = 20% Percentage 1 = F 2 = 15% Percentage 1 = F 3 = 10% While all three enhancements are in place in the new design, each enhancement affects a different portion of the code and only one enhancement can be used at a time. What is the resulting overall speedup? Speedup = 1 / [(1 -. 2 -. 15 -. 1) +. 2/10 +. 15/15 +. 1/30)] = 1/ [. 55 +. 0333 ] = 1 /. 5833 = 1. 71 EECC 550 - Shaaban #15 Final Review Spring 2000 5 -9 -2000

MIPS Instruction Formats 31 R-Type I-Type: ALU 26 op 6 bits 31 26 31 J-Type: Jumps • • • rs 5 bits op 6 bits Load/Store, Branch 21 16 rt 5 bits 21 rs 5 bits 11 rd 5 bits 6 shamt 5 bits 0 funct 6 bits 16 0 immediate rt 5 bits 16 bits 26 op 6 bits 0 target address 26 bits op: Opcode, operation of the instruction. rs, rt, rd: The source and destination register specifiers. shamt: Shift amount. funct: selects the variant of the operation in the “op” field. address / immediate: Address offset or immediate value. target address: Target address of the jump instruction. EECC 550 - Shaaban #16 Final Review Spring 2000 5 -9 -2000

A Single Cycle MIPS Datapath n. PC_sel 4 Rd Imm 16 Reg. Dst ALUctr Mem. Wr Equal Rd Rt 0 1 32 imm 16 16 0 1 32 Data In 32 Clk 32 Wr. En Adr 0 Mux 00 Clk Extender Clk = 32 ALU bus. W Mux PC Mux Adder Rs Rt 5 5 bus. A Rw Ra Rb 32 32 -bit Registers bus. B 32 Memto. Reg. Wr 5 Adder PC Ext imm 16 Rt Instruction<31: 0> <0: 15> Rs <11: 15> Adr <16: 20> <21: 25> Inst Memory 1 Data Memory Ext. Op ALUSrc EECC 550 - Shaaban #17 Final Review Spring 2000 5 -9 -2000

<0: 25> Rd <0: 15> Rs <11: 15> Rt <16: 20> Op Fun <21: 25> Adr Instruction<31: 0> <21: 25> Instruction Memory Imm 16 Jump_target Control Unit n. PC_sel Reg. Wr Reg. Dst Ext. Op ALUSrc ALUctr Mem. Wr Memto. Reg Jump Equal DATA PATH EECC 550 - Shaaban #18 Final Review Spring 2000 5 -9 -2000

Control Signal Generation See Appendix A func 10 0000 10 0010 Don’t Care op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 add sub ori lw sw beq jump 1 1 0 0 x x x Reg. Dst ALUSrc Memto. Reg. Write Mem. Write n. PCsel Jump Ext. Op ALUctr<2: 0> 0 0 1 0 1 0 1 1 1 0 1 x 0 1 0 x 0 0 x Add 0 0 x Subtract 0 0 0 Or 0 0 1 Add 1 0 x 0 1 x xxx Subtract EECC 550 - Shaaban #19 Final Review Spring 2000 5 -9 -2000

The Truth Table For The Main Control op Reg. Dst ALUSrc Memto. Reg. Write Mem. Write Branch Jump Ext. Op ALUop (Symbolic) ALUop <2> ALUop <1> ALUop <0> 00 0000 R-type 1 0 00 1101 10 0011 10 1011 00 0100 00 0010 ori lw sw beq jump 0 0 x x x 1 1 1 0 x 0 1 0 0 0 0 1 1 0 0 0 1 x 0 1 0 0 1 x 0 0 1 0 x x 0 0 0 1 x “R-type” 1 0 0 Or 0 1 0 Add 0 0 0 Subtract 0 xxx x 0 1 EECC 550 - Shaaban #20 Final Review Spring 2000 5 -9 -2000

PLA Implementation of the Main Control. . op<5> <0> R-type . . op<5> <0> ori . . op<5> <0> lw . . op<5> <0> sw . . op<5> <0> beq op<0> jump Reg. Write ALUSrc Reg. Dst Memto. Reg Mem. Write Branch Jump Ext. Op ALUop<2> ALUop<1> ALUop<0> EECC 550 - Shaaban #21 Final Review Spring 2000 5 -9 -2000

Worst Case Timing (Load) Clk PC Old Value Clk-to-Q New Value Instruction Memoey Access Time New Value Rs, Rt, Rd, Op, Func Old Value ALUctr Old Value Ext. Op Old Value New Value ALUSrc Old Value New Value Memto. Reg Old Value New Value Reg. Wr Old Value New Value bus. A bus. B Delay through Control Logic New Value Register Write Occurs Register File Access Time New Value Old Value Delay through Extender & Mux Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time bus. W Old Value New EECC 550 - Shaaban #22 Final Review Spring 2000 5 -9 -2000

MIPS Single Cycle Instruction Timing Comparison Arithmetic & Logical PC Inst Memory Reg File mux ALU mux setup Load PC Inst Memory ALU Data Mem Store PC mux Reg File Critical Path Inst Memory Reg File ALU Data Mem Branch PC Inst Memory Reg File mux cmp mux setup mux EECC 550 - Shaaban #23 Final Review Spring 2000 5 -9 -2000

CPU Design Steps 1. Analyze instruction set operations using independent RTN => datapath requirements. 2. Select set of datapath components & establish clock methodology. 3. Assemble datapath meeting the requirements. 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic. EECC 550 - Shaaban #24 Final Review Spring 2000 5 -9 -2000

CPU Design & Implantation Process • Bottom-up Design: – Assemble components in target technology to establish critical timing. • Top-down Design: – Specify component behavior from high-level requirements. • Iterative refinement: – Establish a partial solution, expand improve. Instruction Set Architecture => processor datapath Reg. File Mux ALU control Reg Cells Mem Decoder Sequencer Gates EECC 550 - Shaaban #25 Final Review Spring 2000 5 -9 -2000

Drawback of Single Cycle Processor • Long cycle time. • All instructions must take as much time as the slowest: – Cycle time for load is longer than needed for all other instructions. • Real memory is not as well-behaved as idealized memory – Cannot always complete data access in one (short) cycle. EECC 550 - Shaaban #26 Final Review Spring 2000 5 -9 -2000

Reducing Cycle Time: Multi-Cycle Design • Cut combinational dependency graph by inserting registers / latches. • The same work is done in two or more fast cycles, rather than one slow cycle. storage element Acyclic Combinational Logic (A) Acyclic Combinational Logic => storage element Acyclic Combinational Logic (B) storage element EECC 550 - Shaaban #27 Final Review Spring 2000 5 -9 -2000

Instruction Processing Cycles Instruction Fetch Next Obtain instruction from program storage Update program counter to address Instruction of next instruction Instruction Determine instruction type Decode Obtain operands from registers Execute Compute result value or status Result Store result in register/memory if needed Store (usually called Write Back). } Common steps for all instructions EECC 550 - Shaaban #28 Final Review Spring 2000 5 -9 -2000

Partitioning The Single Cycle Datapath Result Store Mem. Wr Mem. Rd Mem. Wr Reg. Dst Reg. Wr Reg. File Data Mem Exec Mem Access ALUctr ALUSrc Ext. Op Operand Fetch Instruction Fetch PC Next PC n. PC_sel Add registers between smallest steps EECC 550 - Shaaban #29 Final Review Spring 2000 5 -9 -2000

Reg. Dst Reg. Wr File Equal Mem. To. Reg Result Store ALUSrc ALUctr Mem. Rd Mem. Wr M Data Mem B R Mem Access A Ext ALU Ext. Op Reg File Operand Fetch Instruction Fetch IR PC Next PC n. PC_sel Example Multi-cycle Datapath Registers added: IR: Instruction register A, B: Two registers to hold operands read from register file. R: or ALUOut, holds the output of the ALU M: or Memory data register (MDR) to hold data read from data memory EECC 550 - Shaaban #30 Final Review Spring 2000 5 -9 -2000

Operations In Each Cycle R-Type Logic Immediate Load Store Branch Instruction Fetch IR ¬ Mem[PC] IR ¬ Mem[PC] Instruction Decode A ¬ R[rs] A ¬ B ¬ R[rt] IR ¬ Mem[PC] R[rs] If Equal = 1 PC ¬ PC + 4 + Execution R¬ A + B R ¬ A OR Zero. Ext[imm 16] R ¬ A + Sign. Ex(Im 16) (Sign. Ext(imm 16) x 4) else PC ¬ PC + 4 Memory M ¬ Mem[R] ¬ B PC ¬ PC + 4 Write Back ¬ M R[rd] ¬ R R[rt] ¬ R R[rd] PC ¬ PC + 4 EECC 550 - Shaaban #31 Final Review Spring 2000 5 -9 -2000

Finite State Machine (FSM) Control Model • State specifies control points for Register Transfer. • Transfer occurs upon exiting state (same falling edge). inputs (conditions) Next State Logic State X Control State Register Transfer Control Points Depends on Input Output Logic outputs (control points) EECC 550 - Shaaban #32 Final Review Spring 2000 5 -9 -2000

Control Specification For Multi-cycle CPU Finite State Machine (FSM) “instruction fetch” IR ¬ MEM[PC] “decode / operand fetch” A ¬ R[rs] B ¬ R[rt] R ¬ A or ZX R[rd] ¬ R PC ¬ PC + 4 R[rt] ¬ R PC ¬ PC + 4 To instruction fetch LW SW BEQ & Equal BEQ & ~Equal PC ¬ PC + 4 R ¬ A + SX M ¬ MEM[R] ¬ B PC ¬ PC + 4 R[rt] ¬ M PC ¬ PC + 4 To instruction fetch PC ¬ PC + SX || 00 To instruction fetch Write-back R ¬ A fun B ORi Memory Execute R-type EECC 550 - Shaaban #33 Final Review Spring 2000 5 -9 -2000

Traditional FSM Controller datapath + state diagram => control • Translate RTN statements into control points. • Assign states. • Implement the controller. EECC 550 - Shaaban #34 Final Review Spring 2000 5 -9 -2000

Mapping RTNs To Control Points Examples & State Assignments IR ¬ MEM[PC] “instruction fetch” 0000 imem_rd, IRen A ¬ R[rs] B ¬ R[rt] Aen, Ben “decode / operand fetch” 0001 ALUfun, Sen ORi LW R ¬ A or ZX R ¬ A + SX 0110 1000 Reg. Dst, Reg. Wr, PCen M ¬ MEM[S] 1001 BEQ & Equal SW BEQ & ~Equal R ¬ A + SX 1011 MEM[S] ¬ B PC ¬ PC + 4 1100 R[rd] ¬ R PC ¬ PC + 4 0101 R[rt] ¬ R PC ¬ PC + 4 0111 To instruction fetch state 0000 R[rt] ¬ M PC ¬ PC + 4 1010 To instruction fetch state 0000 PC ¬ PC + 4 0011 PC ¬ PC + SX || 00 0010 To instruction fetch state 0000 Write-back R ¬ A fun B 0100 Memory Execute R-type EECC 550 - Shaaban #35 Final Review Spring 2000 5 -9 -2000

Detailed Control Specification State Op field Eq Write-Back Next IR PC en sel AB ? 0 1 x x x x 0001 0010 0100 0110 1000 1011 0000 0101 0000 1 x x 0111 0000 x x x 1001 1010 0000 Ops Exec Ex Sr ALU S Mem RWM M-R Wr Dst R 0000 0001 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 ? ? ? BEQ R-type or. I LW SW xxxxxx 0 1 1 xxxxxx 0 1 0 xxxxxx 1 1 0 11 11 11 1 1 1 0 0 1 fun 1 0 0 or 1 0 0 1 0 add 1 1 0 0 1 0 EECC 550 - Shaaban #36 Final Review Spring 2000 5 -9 -2000

Alternative Multiple Cycle Datapath (In Textbook) • Shared instruction/data memory unit • A single ALU shared among instructions • Shared units require additional or widened multiplexors • Temporary registers to hold data between clock cycles of the instruction: • Additional registers: Instruction Register (IR), Memory Data Register (MDR), A, B, ALUOut EECC 550 - Shaaban #37 Final Review Spring 2000 5 -9 -2000

Operations In Each Cycle R-Type Instruction Fetch Instruction Decode Execution IR ¬ Mem[PC] PC ¬ PC + 4 A ¬ R[rs] B ¬ R[rt] Logic Immediate IR ¬ Mem[PC] PC ¬ PC + 4 Load Store IR ¬ Mem[PC] PC ¬ PC + 4 A ¬ R[rs] A ¬ B ¬ R[rt] ALUout ¬ PC + (Sign. Ext(imm 16) x 4) ALUout ¬ PC + ALUout ¬ A + B ALUout (Sign. Ext(imm 16) x 4) ¬ A OR Zero. Ext[imm 16] ALUout ¬ PC + (Sign. Ext(imm 16) x 4) ALUout ¬ A + Sign. Ex(Im 16) Branch IR ¬ Mem[PC] PC ¬ PC + 4 R[rs] A ¬ R[rs] B ¬ R[rt] ALUout ¬ PC + (Sign. Ext(imm 16) x 4) If Equal = 1 ALUout ¬ PC ¬ ALUout A + Sign. Ex(Im 16) Memory M ¬ Mem[ALUout] Write Back R[rd] ¬ ALUout R[rt] ¬ ALUout R[rd] Mem[ALUout] ¬ B ¬ Mem EECC 550 - Shaaban #38 Final Review Spring 2000 5 -9 -2000

Finite State Machine (FSM) Specification IR ¬ MEM[PC] PC ¬ PC + 4 “instruction fetch” 0000 ALUout ¬ A fun B 0100 ORi ALUout ¬ A op ZX 0110 Memory Execute R-type ALUout ¬ PC +SX 0001 LW ALUout ¬ A + SX 1000 1001 R[rd] ¬ ALUout R[rt] ¬ M 0101 0111 1010 BEQ SW ALUout ¬ A + SX M ¬ MEM[ALUout] To instruction fetch “decode” If A = B then PC ¬ ALUout 1011 0010 MEM[ALUout] ¬ B To instruction fetch 1100 To instruction fetch Write-back A ¬ R[rs] B ¬ R[rt] EECC 550 - Shaaban #39 Final Review Spring 2000 5 -9 -2000

MIPS Multi-cycle Datapath Performance Evaluation • What is the average CPI? – State diagram gives CPI for each instruction type – Workload below gives frequency of each type Type CPIi for type Frequency CPIi x freq. Ii Arith/Logic 4 40% 1. 6 Load 5 30% 1. 5 Store 4 10% 0. 4 branch 3 20% 0. 6 Average CPI: 4. 1 Better than CPI = 5 if all instructions took the same number of clock cycles (5). EECC 550 - Shaaban #40 Final Review Spring 2000 5 -9 -2000

Control Implementation Alternatives • Control may be designed using one of several initial representations. The choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique. Initial Representation Sequencing Control Logic Representation Implementation Technique Finite State Diagram Microprogram Explicit Next State Microprogram counter Function + Dispatch ROMs Logic Equations Truth Tables PLA ROM “hardwired control” “microprogrammed control” EECC 550 - Shaaban #41 Final Review Spring 2000 5 -9 -2000

Microprogrammed Control • Finite state machine control for a full set of instructions is very complex, and may involve a very large number of states: – Slight microoperation changes require new FSM controller. • Microprogramming: Designing the control as a program that implements the machine instructions. • A microprogam for a given machine instruction is a symbolic representation of the control involved in executing the instruction and is comprised of a sequence of microinstructions. • • Each microinstruction defines the set of datapath control signals that must asserted (active) in a given state or cycle. • The format of the microinstructions is defined by a number of fields each responsible for asserting a set of control signals. • Microarchitecture: – Logical structure and functional capabilities of the hardware as seen by the microprogrammer. EECC 550 - Shaaban #42 Final Review Spring 2000 5 -9 -2000

Microprogrammed Control Unit Microprogram Storage ROM/PLA Multicycle Datapath Outputs Control Signal Fields Microinstruction Address Inputs 1 Adder Sequencing Control Field State Reg Address Select Logic Types of “branching” • Set state to 0 (fetch) • Dispatch i (state 1) • Use incremented address (seq) state number 2 Microprogram Counter, Micro. PC Opcode EECC 550 - Shaaban #43 Final Review Spring 2000 5 -9 -2000

Multiple Bit Control Single Bit Control List of control Signals Grouped Into Fields Signal name Effect when deasserted Effect when asserted ALUSel. A 1 st ALU operand = PC 1 st ALU operand = Reg[rs] Reg. Write None Reg. is written Memto. Reg. write data input = ALU Reg. write data input = memory Reg. Dst Reg. dest. no. = rd Mem. Read None Memory at address is read, MDR ¬ Mem[addr] Mem. Write None Memory at address is written Ior. D Memory address = PC Memory address = S IRWrite None IR ¬ Memory PCWrite None PC ¬ PCSource PCWrite. Cond None IF ALUzero then PC ¬ PCSource = ALUout Signal name Value Effect ALUOp 00 ALU adds 01 ALU subtracts 10 ALU does function code 11 ALU does logical OR ALUSel. B 000 2 nd ALU input = Reg[rt] 001 2 nd ALU input = 4 010 2 nd ALU input = sign extended IR[15 -0] 011 2 nd ALU input = sign extended, shift left 2 IR[15 -0] 100 2 nd ALU input = zero extended IR[15 -0] EECC 550 - Shaaban #44 Final Review Spring 2000 5 -9 -2000

Microinstruction Field Values Field Name ALU Values for Field Add Subt. Func code Or SRC 1 PC rs SRC 2 4 Extend 0 Extshft rt destination rd ALU rt Memory Read PC Read ALU Write ALU Memory register IR PC write ALUout. Cond Sequencing Seq Fetch Dispatch i Function of Field with Specific Value ALU adds ALU subtracts ALU does function code ALU does logical OR 1 st ALU input = PC 1 st ALU input = Reg[rs] 2 nd ALU input = 4 2 nd ALU input = sign ext. IR[15 -0] 2 nd ALU input = zero ext. IR[15 -0] 2 nd ALU input = sign ex. , sl IR[15 -0] 2 nd ALU input = Reg[rt] Reg[rd] ¬ ALUout Reg[rt] ¬ Mem Read memory using PC Read memory using ALU output Write memory using ALU output, value B IR ¬ Mem PC ¬ ALU IF ALU Zero then PC ¬ ALUout Go to sequential µinstruction Go to the first microinstruction Dispatch using ROM. EECC 550 - Shaaban #45 Final Review Spring 2000 5 -9 -2000

Microprogram for The Control Unit Label ALU SRC 1 SRC 2 Fetch: Add PC PC 4 Extshft Lw: Add rs Extend Dest. Add rs Func rs Sequencing Read PC IR Seq Dispatch ALU Seq Fetch Extend Seq Write ALU Rtype: Mem. Reg. PC Write Seq Read ALU rt MEM Sw: Memory rt rd ALU Beq: Subt. Fetch rs rt Ori: Or rs Extend 0 Fetch Seq Fetch ALUout. Cond. rt ALU Seq Fetch EECC 550 - Shaaban #46 Final Review Spring 2000 5 -9 -2000

MIPS Integer ALU Requirements (1) Functional Specification: inputs: 2 x 32 -bit operands A, B, 4 -bit mode outputs: 32 -bit result S, 1 -bit carry, 1 bit overflow, 1 bit zero operations: add, addu, subu, and, or, xor, nor, slt. U 10 operations thus 4 control bits (2) Block Diagram: 32 c A zero ovf 32 ALU S 32 B m 4 00 add 01 add. U 02 sub 03 sub. U 04 and 05 or 06 xor 07 nor 12 slt 13 slt. U EECC 550 - Shaaban #47 Final Review Spring 2000 5 -9 -2000

Building Block: 1 -bit ALU Performs: AND, OR, addition on A, B or A, B inverted invert. B Operation Carry. In and A B 1 -bit Full Adder Mux or Result add Carry. Out EECC 550 - Shaaban #48 Final Review Spring 2000 5 -9 -2000

32 -Bit ALU Using 32 1 -Bit ALUs Carry. In 0 32 -bit rippled-carry adder A 0 1 -bit Result 0 ALU B 0 Carry. Out 0 Carry. In 1 A 1 1 -bit Result 1 ALU B 1 Carry. Out 1 Carry. In 2 A 2 1 -bit Result 2 ALU B 2 Carry. In 3 : : Carry. In 31 A 31 B 31 Carry. Out 30 1 -bit ALU (operation/invert. B lines not shown) Addition/Subtraction Performance: Assume gate delay = T Total delay = = 32 x (1 -Bit ALU Delay) 32 x gate delay 64 T Result 31 Carry. Out 31 C EECC 550 - Shaaban #49 Final Review Spring 2000 5 -9 -2000

MIPS ALU With SLT Support Added Carry. In 0 Less A 0 B 0 A 1 B 1 Less = 0 A 2 B 2 Less = 0 1 -bit Result 0 ALU Carry. In 1 Carry. Out 0 1 -bit Result 1 ALU Carry. In 2 Carry. Out 1 1 -bit ALU Carry. In 3 : : Zero Result 2 : : Carry. Out 30 Carry. In 31 A 31 1 -bit B 31 Result 31 ALU Less = 0 Carry. Out 31 Overflow C EECC 550 - Shaaban #50 Final Review Spring 2000 5 -9 -2000

Improving ALU Performance: Carry Look Ahead (CLA) Cin A 0 B 1 A 0 0 1 1 S G P C 1 =G 0 + C 0 P 0 A B S G P A B B 0 1 C-out 0 C-in 1 “kill” “propagate” “generate” P = A and B G = A xor B C 2 = G 1 + G 0 P 1 + C 0 P 1 S G P C 3 = G 2 + G 1 P 2 + G 0 P 1 P 2 + C 0 P 1 P 2 A B S G P EECC 550 - Shaaban #51 Final Review Spring 2000 5 -9 -2000

C L A G 0 P 0 C 1 =G 0 + C 0 P 0 Delay = 2 + 1 = 5 gate delays = 5 T { 4 -bit Adder Cascaded Carry Look-ahead C 0 16 -Bit Example C 2 = G 1 + G 0 P 1 + C 0 P 1 Assuming all gates have equal delay T 4 -bit Adder C 3 = G 2 + G 1 P 2 + G 0 P 1 P 2 + C 0 P 1 P 2 G P 4 -bit Adder C 4 =. . . EECC 550 - Shaaban #52 Final Review Spring 2000 5 -9 -2000

Unsigned Multiplication Example • Paper and pencil example (unsigned): Multiplicand 1000 Multiplier 1001 1000 0000 1000 Product 01001000 • m bits x n bits = m + n bit product, m = 32, n = 32, 64 bit product. • The binary number system simplifies multiplication: 0 => place 0 1 => place a copy ( 0 x multiplicand). ( 1 x multiplicand). • We will examine 4 versions of multiplication hardware & algorithm: –Successive refinement of design. EECC 550 - Shaaban #53 Final Review Spring 2000 5 -9 -2000

Operation of Combinational Multiplier 0 0 0 A 3 A 3 P 7 • • • P 6 A 2 P 5 A 2 A 1 P 4 0 A 3 A 2 A 1 0 A 0 B 1 A 0 B 2 A 0 P 3 B 3 P 2 P 1 P 0 At each stage shift A left ( x 2). Use next bit of B to determine whether to add in shifted multiplicand. Accumulate 2 n bit partial product at each stage. EECC 550 - Shaaban #54 Final Review Spring 2000 5 -9 -2000

MULTIPLY HARDWARE Version 3 • Combine Multiplier register and Product register: – 32 -bit Multiplicand register. – 32 -bit ALU. – 64 -bit Product register, (0 -bit Multiplier register). Multiplicand 32 bits 32 -bit ALU Shift Right Product (Multiplier) 64 bits Control Write EECC 550 - Shaaban #55 Final Review Spring 2000 5 -9 -2000

Multiply Algorithm Version 3 Product 0 = 1 Start 1. Test Product 0 = 0 1 a. Add multiplicand to the left half of product & place the result in the left half of Product register 2. Shift the Product register right 1 bit. 32 nd repetition? No: < 32 repetitions Yes: 32 repetitions Done EECC 550 - Shaaban #56 Final Review Spring 2000 5 -9 -2000

Booth’s Algorithm Current Bit 1 0 1 0 0 Bit to the Right Explanation Example Begins run of 1 s 0001111000 sub Middle of run of 1 s 0001111000 none End of run of 1 s 0001111000 add Middle of run of 0 s 0001111000 none Op • Originally designed for Speed (when shift was faster than add). • Replace a string of 1 s in multiplier with an initial subtract when we first see a one and then later add for the bit after the last one. EECC 550 - Shaaban #57 Final Review Spring 2000 5 -9 -2000

Booths Example (2 x 7) Operation Multiplicand Product next? 0. initial value 0010 0000 0111 0 10 -> sub 1 a. P = P - m 1110 0111 0 + 1110 shift P (sign ext) 1 b. 0010 1111 0011 1 11 -> nop, shift 2. 0010 1111 1001 1 11 -> nop, shift 3. 0010 1111 1100 1 01 -> add 4 a. 0010 + 0010 0001 1100 1 shift 0000 1110 0 done 4 b. 0010 EECC 550 - Shaaban #58 Final Review Spring 2000 5 -9 -2000

Combinational Shifter from MUXes Basic Building Block sel A B 1 0 D 8 -bit right shifter A 7 A 6 A 5 A 4 A 3 A 2 A 1 S 2 S 1 S 0 A 0 1 0 1 0 1 0 1 0 1 0 1 0 R 7 R 6 R 5 R 4 • What comes in the MSBs? • How many levels for 32 -bit shifter? R 3 R 2 R 1 R 0 EECC 550 - Shaaban #59 Final Review Spring 2000 5 -9 -2000

Division 1001 Divisor 1000 1001010 – 1000 10 • Quotient Dividend Remainder (or Modulo result) See how big a number can be subtracted, creating quotient bit on each step: Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder => | Dividend | = | Quotient | + | Divisor | • 3 versions of divide, successive refinement EECC 550 - Shaaban #60 Final Review Spring 2000 5 -9 -2000

DIVIDE HARDWARE Version 3 • 32 -bit Divisor register. • 32 -bit ALU. • 64 -bit Remainder regegister (0 -bit Quotient register). Divisor 32 bits 32 -bit ALU “HI” “LO” Shift Left Remainder (Quotient) 64 bits Control Write EECC 550 - Shaaban #61 Final Review Spring 2000 5 -9 -2000

Divide Algorithm Version 3 Start: Place Dividend in Remainder 1. Shift the Remainder register left 1 bit. 2. Subtract the Divisor register from the left half of the Remainder register, & place the result in the left half of the Remainder register. Remainder >= 0 3 a. Shift the Remainder register to the left setting the new rightmost bit to 1. Test Remainder < 0 3 b. Restore the original value by adding the Divisor register to the left half of the Remainderregister, &place the sum in the left half of the Remainder register. Also shift the Remainder register to the left, setting the new least significant bit to 0. nth repetition? No: < n repetitions Yes: n repetitions (n = 4 here) Done. Shift left half of Remainder right 1 bit. EECC 550 - Shaaban #62 Final Review Spring 2000 5 -9 -2000

Representation of Floating Point Numbers in Single Precision IEEE 754 Standard Value = N = (-1)S X 2 E-127 X (1. M) 0 < E < 255 Actual exponent is: e = E - 127 Example: 1 sign S 8 E 23 M exponent: excess 127 binary integer added 0 = 0 0000 0. . . 0 Magnitude of numbers that can be represented is in the range: Which is approximately: mantissa: sign + magnitude, normalized binary significand with a hidden integer bit: 1. M -1. 5 = 1 01111111 10. . . 0 2 -126 (1. 0) 1. 8 x 10 - 38 127 (2 - 2 -23 ) to 2 to 3. 40 x 10 38 EECC 550 - Shaaban #63 Final Review Spring 2000 5 -9 -2000

Representation of Floating Point Numbers in Double Precision IEEE 754 Standard Value = N = (-1)S X 2 E-1023 X (1. M) 0 < E < 2047 Actual exponent is: e = E - 1023 Example: 1 sign S 11 E 52 M Mantissa: sign + magnitude, normalized binary significand with a hidden integer bit: 1. M exponent: excess 1023 binary integer added 0 = 0 000000 0. . . 0 Magnitude of numbers that can be represented is in the range: Which is approximately: 2 -1. 5 = 1 011111 10. . . 0 -1022 (1. 0) to - 308 to 2. 23 x 10 2 1023 (2 - 52 ) 1. 8 x 10 308 EECC 550 - Shaaban #65 Final Review Spring 2000 5 -9 -2000

Floating Point Conversion Example • The decimal number -2345. 12510 is to be represented in the IEEE 754 32 -bit single precision format: -2345. 12510 = -1001001. 0012 (converted to binary) = -1. 001001001 x 211 (normalized binary) Hidden • The mantissa is negative so the sign S is given by: S=1 • The biased exponent E is given by E = e + 127 E = 11 + 127 = 13810 = 100010102 • Fractional part of mantissa M: M =. 00100100100000 (in 23 bits) The IEEE 754 single precision representation is given by: 1 10001010 S E 1 bit 8 bits 00100100100000 M 23 bits EECC 550 - Shaaban #66 Final Review Spring 2000 5 -9 -2000

Start Compare the exponents of the two numbers shift the smaller number to the right until its exponent matches the larger exponent (1) (2) Floating Point Addition Flowchart Add the significands (mantissas) Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent (3) (4) No Overflow or Underflow ? Yes Generate exception or return error No Still normalized? yes Done Round the significand to the appropriate number of bits If mantissa = 0, set exponent to 0 (5) EECC 550 - Shaaban #67 Final Review Spring 2000 5 -9 -2000

Floating Point Addition Hardware EECC 550 - Shaaban #68 Final Review Spring 2000 5 -9 -2000

Floating Point Multiplication Flowchart (1) Set the result to zero: exponent = 0 Is one/both operands =0? (2) Compute exponent: biased exp. (X) + biased exp. (Y) - bias (3) Compute sign of result: Xs XOR Ys (4) Multiply the mantissas Normalize mantissa if needed (5) Generate exception or return error No Start Still Normalized? Yes Overflow or Underflow? (6) No Round or truncate the result mantissa Yes (7) Done EECC 550 - Shaaban #69 Final Review Spring 2000 5 -9 -2000

Extra Bits for Rounding Extra bits used to prevent or minimize rounding errors. How many extra bits? IEEE: As if computed the result exactly and rounded. Addition: 1. xxxxx + 1. xxxxx 0. 001 xxxxx 0. 01 xxxxx 1 x. xxxxy 1. xxxxxyyy 1 x. xxxxyyy post-normalization • • pre-normalization pre and post Guard Digits: digits to the right of the first p digits of significand to guard against loss of digits – can later be shifted left into first P places during normalization. Addition: carry-out shifted in Subtraction: borrow digit and guard Multiplication: carry and guard, Division requires guard EECC 550 - Shaaban #70 Final Review Spring 2000 5 -9 -2000

Rounding Digits Normalized result, but some non-zero digits to the right of the significand --> the number should be rounded E. g. , B = 10, p = 3: - 2 -bias 0 2 1. 69 = 1. 6900 * 10 2 -bias 0 0 7. 85 = -. 0785 * 10 0 2 1. 61 = 1. 6115 * 10 2 -bias One round digit must be carried to the right of the guard digit so that after a normalizing left shift, the result can be rounded, according to the value of the round digit. IEEE Standard: four rounding modes: round to nearest (default) round towards plus infinity round towards minus infinity round towards 0 round to nearest: round digit < B/2 then truncate > B/2 then round up (add 1 to ULP: unit in last place) = B/2 then round to nearest even digit it can be shown that this strategy minimizes the mean error introduced by rounding. EECC 550 - Shaaban #71 Final Review Spring 2000 5 -9 -2000

Sticky Bit Additional bit to the right of the round digit to better fine tune rounding d 0. d 1 d 2 d 3. . . dp-1 0 0 X. . . X XX S Sticky bit: set to 1 if any 1 bits fall off the end of the round digit d 0. d 1 d 2 d 3. . . dp-1 0 0 X. . . X XX 0 d 0. d 1 d 2 d 3. . . dp-1 0 0 X. . . X XX 1 generates a borrow Rounding Summary: Radix 2 minimizes wobble in precision. Normal operations in +, -, *, / require one carry/borrow bit + one guard digit. One round digit needed for correct rounding. Sticky bit needed when round digit is B/2 for max accuracy. Rounding to nearest has mean error = 0 if uniform distribution of digits are assumed. EECC 550 - Shaaban #72 Final Review Spring 2000 5 -9 -2000

Pipelining: Design Goals • The length of the machine clock cycle is determined by the time required for the slowest pipeline stage. • An important pipeline design consideration is to balance the length of each pipeline stage. • If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls): Time per instruction on unpipelined machine Number of pipe stages • Under these ideal conditions: – Speedup from pipelining = the number of pipeline stages = k – One instruction is completed every cycle: CPI = 1. EECC 550 - Shaaban #73 Final Review Spring 2000 5 -9 -2000

Pipelined Instruction Processing Representation Time in clock cycles ® Clock cycle Number Instruction Number 1 2 3 4 5 6 7 Instruction I+1 Instruction I+2 Instruction I+3 Instruction I +4 IF ID IF EX ID IF MEM EX ID IF WB MEM EX ID WB MEM EX 8 WB MEM 9 WB Time to fill the pipeline Pipeline Stages: IF ID EX MEM WB = Instruction Fetch = Instruction Decode = Execution = Memory Access = Write Back First instruction, I Completed Last instruction, I+4 completed EECC 550 - Shaaban #74 Final Review Spring 2000 5 -9 -2000

Single Cycle, Multi-Cycle, Pipeline: Performance Comparison Example For 1000 instructions, execution time: • Single Cycle Machine: – 8 ns/cycle x 1 CPI x 1000 inst = 8000 ns • Multicycle Machine: – 2 ns/cycle x 4. 6 CPI (due to inst mix) x 1000 inst = 9200 ns • Ideal pipelined machine, 5 -stages: – 2 ns/cycle x (1 CPI x 1000 inst + 4 cycle fill) = 2008 ns EECC 550 - Shaaban #75 Final Review Spring 2000 5 -9 -2000

Single Cycle, Multi-Cycle, Vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: 8 ns Load Store Waste 2 ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load IF ID EX MEM WB Store IF ID EX R-type MEM IF Pipeline Implementation: Load IF ID Store IF EX ID R-type IF MEM EX ID WB MEM EX WB MEM WB EECC 550 - Shaaban #76 Final Review Spring 2000 5 -9 -2000

MIPS: A Pipelined Datapath 0 M u x 1 IF/ID EX/MEM ID/EX MEM/WB Add 4 Add result Address PC Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 16 IF Instruction Fetch Sign extend 0 M u x 1 Zero ALU result Address Data memory Write data Read data 1 M u x 0 32 ID Instruction Decode EX Execution MEM Memory WB Write Back EECC 550 - Shaaban #77 Final Review Spring 2000 5 -9 -2000

Pipeline Control • Pass needed control signals along from one stage to the next as the instruction travels through the pipeline just like the data WB Instruction IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB EECC 550 - Shaaban #78 Final Review Spring 2000 5 -9 -2000

Basic Performance Issues In Pipelining • Pipelining increases the CPU instruction throughput: The number of instructions completed per unit time. Under ideal condition instruction throughput is one instruction per machine cycle, or CPI = 1 • Pipelining does not reduce the execution time of an individual instruction: The time needed to complete all processing steps of an instruction (also called instruction completion latency). • It usually slightly increases the execution time of each instruction over unpipelined implementations due to the increased control overhead of the pipeline and pipeline stage registers delays. EECC 550 - Shaaban #79 Final Review Spring 2000 5 -9 -2000

Pipelining Performance Example • Example: For an unpipelined machine: – Clock cycle = 10 ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%, respectively. – If pipelining adds 1 ns to the machine clock cycle then the speedup in instruction execution from pipelining is: Non-pipelined Average instruction execution time = Clock cycle x Average CPI = 10 ns x ((40% + 20%) x 4 + 40%x 5) = 10 ns x 4. 4 = 44 ns In the pipelined five implementation five stages are used with an average instruction execution time of: 10 ns + 1 ns = 11 ns Speedup from pipelining = Instruction time unpipelined Instruction time pipelined = 44 ns / 11 ns = 4 times EECC 550 - Shaaban #80 Final Review Spring 2000 5 -9 -2000

Pipeline Hazards • Hazards are situations in pipelining which prevent the next instruction in the instruction stream from executing during the designated clock cycle resulting in one or more stall cycles. • Hazards reduce the ideal speedup gained from pipelining and are classified into three classes: – Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions. – Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. – Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC. EECC 550 - Shaaban #81 Final Review Spring 2000 5 -9 -2000

Structural hazard Example: Single Memory For Instructions & Data Time (clock cycles) Instr 4 Reg Mem Reg Mem Reg ALU Instr 3 Reg ALU Instr 2 Mem ALU Instr 1 Reg ALU O r d e r Load Mem ALU I n s t r. Mem Reg Detection is easy in this case (right half highlight means read, left half write) EECC 550 - Shaaban #82 Final Review Spring 2000 5 -9 -2000

Data Hazards Example • Problem with starting next instruction before first is finished – Data dependencies here that “go backward in time” create data hazards. sub and or add sw $2, $1, $3 $12, $5 $13, $6, $2 $14, $2 $15, 100($2) Time (in clock cycles) CC 1 Value of register $2: 10 Program execution order (in instructions) sub $2, $1, $3 and $12, $5 or $13, $6, $2 add $14, $2 sw $15, 100($2) IM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10/– 20 – 20 DM Reg IM Reg DM Reg IM Reg Reg DM Reg EECC 550 - Shaaban #83 Final Review Spring 2000 5 -9 -2000

Data Hazard Resolution: Stall Cycles Stall the pipeline by a number of cycles. The control unit must detect the need to insert stall cycles. In this case two stall cycles are needed. Time (in clock cycles) CC 1 Value of register $2: 10 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 CC 11 10 10/– 20 – 20 DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $5 or $13, $6, $2 add $14, $2 sw $15, 100($2) IM Reg IM STALL Reg IM DM Reg IM Reg Reg DM Reg EECC 550 - Shaaban #84 Final Review Spring 2000 5 -9 -2000

Performance of Pipelines with Stalls • Hazards in pipelines may make it necessary to stall the pipeline by one or more cycles and thus degrading performance from the ideal CPI of 1. CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction • If pipelining overhead is ignored and we assume that the stages are perfectly balanced then: Speedup = CPI unpipelined / (1 + Pipeline stall cycles per instruction) • When all instructions take the same number of cycles and is equal to the number of pipeline stages then: Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction) EECC 550 - Shaaban #86 Final Review Spring 2000 5 -9 -2000

Data Hazard Resolution: Forwarding – Register file forwarding to handle read/write to same register – ALU forwarding EECC 550 - Shaaban #87 Final Review Spring 2000 5 -9 -2000

Data Hazard Example With Forwarding Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 – 20 X 10/– 20 X X – 20 X X DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $5 or $13, $6, $2 add $14, $2 sw $15, 100($2) IM Reg IM DM Reg IM Reg DM Reg EECC 550 - Shaaban #88 Final Review Spring 2000 5 -9 -2000

A Data Hazard Requiring A Stall A load followed by an R-type instruction that uses the loaded value Even with forwarding in place a stall cycle is needed This condition must be detected by hardware EECC 550 - Shaaban #89 Final Review Spring 2000 5 -9 -2000

Compiler Scheduling Example • Reorder the instructions to avoid as many pipeline stalls as possible: lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) The data hazard occurs on register $16 between the second lw and the first sw resulting in a stall cycle With forwarding we need to find only one independent instructions to place between them, swapping the lw instructions works: lw $15, 0($2) lw $16, 4($2) sw $15, 0($2) sw $16, 4($2) • • • Without forwarding we need three independent instructions to place between them, so in addition two nops are added. lw $15, 0($2) lw $16, 4($2) nop sw $15, 0($2) sw $16, 4($2) EECC 550 - Shaaban #90 Final Review Spring 2000 5 -9 -2000

Control Hazards: Example • Three other instructions are in the pipeline before branch instruction target decision is made when BEQ is in MEM stage. • In the above diagram, we are predicting “branch not taken” – Need to add hardware for flushing the three following instructions if we are wrong losing three cycles. EECC 550 - Shaaban #91 Final Review Spring 2000 5 -9 -2000

• • • Reducing Delay of Taken Branches Next PC of a branch known in MEM stage: Costs three lost cycles if taken. If next PC is known in EX stage, one cycle is saved. Branch address calculation can be moved to ID stage using a register comparator, costing only one cycle if branch is taken. IF. Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM = M u x Instruction memory ALU M u x Data mem ory M u x Sign extend M u x Forwarding unit EECC 550 - Shaaban #92 Final Review Spring 2000 5 -9 -2000

Pipeline Performance Example • Assume the following MIPS instruction mix: Type Arith/Logic Load Store branch Frequency 40% 30% of which 25% are followed immediately by an instruction using the loaded value 10% 20% of which 45% are taken • What is the resulting CPI for the pipelined MIPS with forwarding and branch address calculation in ID stage? • CPI = Ideal CPI + Pipeline stall clock cycles per instruction = = 1 + 1 + 1. 165 stalls by loads +. 3 x. 25 x 1 +. 075 + stalls by branches. 2 x. 45 x 1. 09 EECC 550 - Shaaban #93 Final Review Spring 2000 5 -9 -2000

Memory Hierarchy: Motivation Processor-Memory (DRAM) Performance Gap 100 CPU Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 1 µProc 60%/yr. DRAM 7%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 EECC 550 - Shaaban #94 Final Review Spring 2000 5 -9 -2000

Processor-DRAM Performance Gap Impact: Example • To illustrate the performance impact, assume a pipelined RISC CPU with CPI = 1 using non-ideal memory. • Over an 10 year period, ignoring other factors, the cost of a full memory access in terms of number of wasted instructions: Year CPU speed MHZ 1986: 1988: 1991: 1994: 1996: 8 33 75 200 300 CPU cycle ns 125 30 13. 3 5 3. 33 Memory Access Minimum CPU cycles or instructions wasted ns 190 175 155 130 100 190/125 175/30 155/13. 3 130/5 110/3. 33 = = = 1. 5 5. 8 11. 65 26 33 EECC 550 - Shaaban #95 Final Review Spring 2000 5 -9 -2000

Memory Hierarchy: Motivation The Principle Of Locality • Programs usually access a relatively small portion of their address space (instructions/data) at any instant of time (program working set). • Two Types of locality: – Temporal Locality: If an item is referenced, it will tend to be referenced again soon. – Spatial locality: If an item is referenced, items whose addresses are close will tend to be referenced soon. • The presence of locality in program behavior, makes it possible to satisfy a large percentage of program access needs using memory levels with much less capacity than program address space. EECC 550 - Shaaban #96 Final Review Spring 2000 5 -9 -2000

Levels of The Memory Hierarchy Part of The On-chip CPU Datapath 16 -256 Registers One or more levels (Static RAM): Level 1: On-chip 16 -64 K Level 2: On or Off-chip 128 -512 K Level 3: Off-chip 128 K-8 M Dynamic RAM (DRAM) 16 M-16 G Interface: SCSI, RAID, IDE, 1394 4 G-100 G Farther away from The CPU Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput Registers Cache Main Memory Magnetic Disc Optical Disk or Magnetic Tape EECC 550 - Shaaban #97 Final Review Spring 2000 5 -9 -2000

A Typical Memory Hierarchy (With Two Levels of Cache) Faster Larger Capacity Processor Control Registers Datapath Second Level Cache (SRAM) L 2 On-Chip Level One Cache L 1 Main Memory (DRAM) Speed (ns): 1 s 100 s Size (bytes): 100 s Ks Ms Virtual Memory, Secondary Storage (Disk) Tertiary Storage (Tape) 10, 000 s 10, 000, 000 s (10 s ms) (10 s sec) Gs Ts EECC 550 - Shaaban #98 Final Review Spring 2000 5 -9 -2000

Cache Design & Operation Issues • Q 1: Where can a block be placed cache? (Block placement strategy & Cache organization) – Fully Associative, Set Associative, Direct Mapped • Q 2: How is a block found if it is in cache? (Block identification) – Tag/Block • Q 3: Which block should be replaced on a miss? (Block replacement) – Random, LRU • Q 4: What happens on a write? (Cache write policy) – Write through, write back EECC 550 - Shaaban #99 Final Review Spring 2000 5 -9 -2000

Cache Organization Example EECC 550 - Shaaban #100 Final Review Spring 2000 5 -9 -2000

Locating A Data Block in Cache • Each block frame in cache has an address tag. • The tags of every cache block that might contain the required data are checked or searched in parallel. • A valid bit is added to the tag to indicate whether this entry contains a valid address. • The address from the CPU to cache is divided into: – A block address, further divided into: • An index field to choose a block set in cache. (no index field when fully associative). • A tag field to search and match addresses in the selected set. – A block offset to select the data from the block. Block Address Tag Index Block Offset EECC 550 - Shaaban #101 Final Review Spring 2000 5 -9 -2000

Address Field Sizes Physical Address Generated by CPU Block Address Tag Block Offset Index Block offset size = log 2(block size) Index size = log 2(Total number of blocks/associativity) Tag size = address size - index size - offset size EECC 550 - Shaaban #102 Final Review Spring 2000 5 -9 -2000

Four-Way Set Associative Cache: MIPS Implementation Example A d dre s s Tag Field 31 3 0 12 11 10 9 8 V Tag D a ta Index Field 8 22 In d ex 3 2 1 0 V Tag D a ta V T ag D a ta 0 1 2 253 254 255 22 256 sets 1024 block frames 32 4 - to - 1 m u ltip le xo r H it D a ta EECC 550 - Shaaban #103 Final Review Spring 2000 5 -9 -2000

Cache Organization/Addressing Example • Given the following: – A single-level L 1 cache with 128 cache block frames • Each block frame contains four words (16 bytes) – 16 -bit memory addresses to be cached (64 K bytes main memory or 4096 memory blocks) • Show the cache organization/mapping and cache address fields for: • Fully Associative cache. • Direct mapped cache. • 2 -way set-associative cache. EECC 550 - Shaaban #104 Final Review Spring 2000 5 -9 -2000

Cache Example: Fully Associative Case V V All 128 tags must be checked in parallel by hardware to locate a data block Valid bit V Block Address = 12 bits Tag = 12 bits Block offset = 4 bits EECC 550 - Shaaban #105 Final Review Spring 2000 5 -9 -2000

Cache Example: Direct Mapped Case V V V Only a single tag must be checked in parallel to locate a data block Valid bit V Block Address = 12 bits Tag = 5 bits Index = 7 bits Block offset = 4 bits Main Memory EECC 550 - Shaaban #106 Final Review Spring 2000 5 -9 -2000

Cache Example: 2 -Way Set-Associative Two tags in a set must be checked in parallel to locate a data block Valid bits not shown Block Address = 12 bits Tag = 6 bits Index = 6 bits Block offset = 4 bits Main Memory EECC 550 - Shaaban #107 Final Review Spring 2000 5 -9 -2000

Cache Replacement Policy • When a cache miss occurs the cache controller may have to select a block of cache data to be removed from a cache block frame and replaced with the requested data, such a block is selected by one of two methods: – Random: • Any block is randomly selected for replacement providing uniform allocation. • Simple to build in hardware. • The most widely used cache replacement strategy. – Least-recently used (LRU): • Accesses to blocks are recorded and the block replaced is the one that was not used for the longest period of time. • LRU is expensive to implement, as the number of blocks to be tracked increases, and is usually approximated. EECC 550 - Shaaban #108 Final Review Spring 2000 5 -9 -2000

Miss Rates for Caches with Different Size, Associativity & Replacement Algorithm Sample Data Associativity: Size 16 KB 64 KB 256 KB 2 -way LRU Random 5. 18% 5. 69% 1. 88% 2. 01% 1. 15% 1. 17% 4 -way LRU Random 4. 67% 5. 29% 1. 54% 1. 66% 1. 13% 8 -way LRU Random 4. 39% 4. 96% 1. 39% 1. 53% 1. 12% EECC 550 - Shaaban #109 Final Review Spring 2000 5 -9 -2000

Single Level Cache Performance For a CPU with a single level (L 1) of cache and no stalls for cache hits: With ideal memory CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty) + (Writes x Write miss rate x Write miss penalty) If write and read miss penalties are the same: Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty EECC 550 - Shaaban #110 Final Review Spring 2000 5 -9 -2000

Single Level Cache Performance CPUtime = IC x CPI x C CPIexecution = CPI with ideal memory CPI = CPIexecution + Mem Stall cycles per instruction CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x C Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x Miss penalty CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty ) x C Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x C EECC 550 - Shaaban #111 Final Review Spring 2000 5 -9 -2000

Cache Performance Example • Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per cycle) with a single level of cache. • CPIexecution = 1. 1 • Instruction mix: 50% arith/logic, 30% load/store, 20% control • Assume a cache miss rate of 1. 5% and a miss penalty of 50 cycles. CPI = CPIexecution + mem stalls per instruction Mem Stalls per instruction = Mem accesses per instruction x Miss rate x Miss penalty Mem accesses per instruction = 1 +. 3 = 1. 3 Instruction fetch Load/store Mem Stalls per instruction = 1. 3 x. 015 x 50 = 0. 975 CPI = 1. 1 +. 975 = 2. 075 The ideal CPU with no misses is 2. 075/1. 1 = 1. 88 times faster EECC 550 - Shaaban #112 Final Review Spring 2000 5 -9 -2000

Cache Performance Example • Suppose for the previous example we double the clock rate to 400 MHZ, how much faster is this machine, assuming similar miss rate, instruction mix? • Since memory speed is not changed, the miss penalty takes more CPU cycles: Miss penalty = 50 x 2 = 100 cycles. CPI = 1. 1 + 1. 3 x. 015 x 100 = 1. 1 + 1. 95 = 3. 05 Speedup = (CPIold x Cold)/ (CPInew x Cnew) = 2. 075 x 2 / 3. 05 = 1. 36 The new machine is only 1. 36 times faster rather than 2 times faster due to the increased effect of cache misses. ® CPUs with higher clock rate, have more cycles per cache miss and more memory impact on CPI EECC 550 - Shaaban #113 Final Review Spring 2000 5 -9 -2000

3 Levels of Cache CPU L 1 Cache L 2 Cache L 3 Cache Hit Rate= H 1, Hit time = 1 cycle Hit Rate= H 2, Hit time = T 2 cycles Hit Rate= H 3, Hit time = T 3 Main Memory access penalty, M EECC 550 - Shaaban #114 Final Review Spring 2000 5 -9 -2000

3 -Level Cache Performance CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x C Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access • For a system with 3 levels of cache, assuming no penalty when found in L 1 cache: Stall cycles per memory access = [miss rate L 1] x [ Hit rate L 2 x Hit time L 2 + Miss rate L 2 x (Hit rate L 3 x Hit time L 3 + Miss rate L 3 x Memory access penalty) ] = [1 - H 1] x [ H 2 x T 2 + ( 1 -H 2 ) x (H 3 x (T 2 + T 3) + (1 - H 3) x M) ] EECC 550 - Shaaban #115 Final Review Spring 2000 5 -9 -2000

Three Level Cache Example • • • CPU with CPIexecution = 1. 1 running at clock rate = 500 MHZ 1. 3 memory accesses per instruction. L 1 cache operates at 500 MHZ with a miss rate of 5% L 2 cache operates at 250 MHZ with miss rate 3%, (T 2 = 2 cycles) L 3 cache operates at 100 MHZ with miss rate 1. 5%, (T 3 = 5 cycles) • Memory access penalty, M= 100 cycles. Find CPI. • With single L 1, CPI = 1. 1 + 1. 3 x. 05 x 100 = 7. 6 CPI = CPIexecution + Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access Stall cycles per memory access = [1 - H 1] x [ H 2 x T 2 + ( 1 -H 2 ) x (H 3 x (T 2 + T 3) + (1 - H 3) x M) ] = [. 05] x [. 97 x 2 + (. 03) x (. 985 x (2+5) +. 015 x 100)] =. 05 x [ 1. 94 +. 03 x ( 6. 895 + 1. 5) ] =. 05 x [ 1. 94 +. 274] =. 11 • CPI = 1. 1 + 1. 3 x. 11 = 1. 24 EECC 550 - Shaaban #116 Final Review Spring 2000 5 -9 -2000

Memory Bandwidth Improvement Techniques • Wider Main Memory: Memory width is increased to a number of words (usually the size of a second level cache block). Þ Memory bandwidth is proportional to memory width. e. g Doubling the width of cache and memory doubles memory bandwidth • Simple Interleaved Memory: Memory is organized as a number of banks each one word wide. – Simultaneous multiple word memory reads or writes are accomplished by sending memory addresses to several memory banks at once. – Interleaving factor: Refers to the mapping of memory addressees to memory banks. e. g. using 4 banks, bank 0 has all words whose address is: (word address) (mod) 4 = 0 EECC 550 - Shaaban #117 Final Review Spring 2000 5 -9 -2000

Wider memory, bus and cache Narrow bus and cache with interleaved memory Three examples of bus width, memory width, and memory interleaving to achieve higher memory bandwidth Simplest design: Everything is the width of one word EECC 550 - Shaaban #118 Final Review Spring 2000 5 -9 -2000

Memory Interleaving EECC 550 - Shaaban #119 Final Review Spring 2000 5 -9 -2000

Memory Width, Interleaving: An Example Given a base system with following parameters: Cache Block size = 1 word, Memory bus width = 1 word, Miss rate = 3% Miss penalty = 32 cycles, broken down as follows: (4 cycles to send address, 24 cycles access time/word, 4 cycles to send a word) Memory access/instruction = 1. 2 Ideal execution CPI (ignoring cache misses) = 2 Miss rate (block size=2 word) = 2% Miss rate (block size=4 words) = 1% • The CPI of the base machine with 1 -word blocks = 2 + (1. 2 x. 03 x 32) = 3. 15 • Increasing the block size to two words gives the following CPI: – 32 -bit bus and memory, no interleaving = 2 + (1. 2 x. 02 x 32) = 3. 54 – 32 -bit bus and memory, interleaved = 2 + (1. 2 x. 02 x (4 + 24 + 8) = 2. 86 – 64 -bit bus and memory, no interleaving = 2 + (1. 2 x. 02 x 1 x 32) = 2. 77 • Increasing the block size to four words; resulting CPI: – 32 -bit bus and memory, no interleaving = 2 + (1. 2 x 1% x 4 x 32) = 3. 54 – 32 -bit bus and memory, interleaved = 2 + (1. 2 x 1% x (4 +24 + 16) = 2. 53 – 64 -bit bus and memory, no interleaving = 2 + (1. 2 x 2% x 2 x 32) = 2. 77 EECC 550 - Shaaban #120 Final Review Spring 2000 5 -9 -2000

Virtual Memory Benefits – Illusion of having more physical main memory – Allows program relocation – Protection from illegal memory access Virtual address 31 30 29 28 27 15 14 13 12 11 10 9 8 Virtual page number 3 2 1 0 Page offset Translation 2 9 28 27 15 14 13 12 11 10 9 8 Physical page number 3 2 1 0 Page offset Physical address EECC 550 - Shaaban #121 Final Review Spring 2000 5 -9 -2000

Page Table Page table register Virtual address 3 1 30 2 9 28 2 7 1 5 1 4 1 3 12 1 1 1 0 9 8 Virtual page number Page offset 20 V a lid 3 2 1 0 12 Physical page number Two memory accesses needed: First to page table Second to item Page table 18 If 0 then page is not present in memory 29 28 27 15 14 13 Physical page number 12 1 1 10 9 8 3 2 1 0 Page offset Physical address EECC 550 - Shaaban #122 Final Review Spring 2000 5 -9 -2000

Virtual Memory Issues/Strategies • Main memory block placement: Fully associative placement is used to lower the miss rate. • Block replacement: The least recently used (LRU) block is replaced when a new block is brought into main memory from disk. • Write strategy: Write back is used and only those pages changed in main memory are written to disk (dirty bit scheme is used). • To locate blocks in main memory a page table is utilized. The page table is indexed by the virtual page number and contains the physical address of the block. – In paging: Offset is concatenated to this physical page address. – In segmentation: Offset is added to the physical segment address. • To limit the size of the page table to the number of physical pages in main memory a hashing scheme is used. • Utilizing address locality, a translation look-aside buffer (TLB) is usually used to cache recent address translations and prevent a second memory access to read the page table. EECC 550 - Shaaban #123 Final Review Spring 2000 5 -9 -2000

Speeding Up Address Translation: Translation Lookaside Buffer (TLB) • • TLB: A small on-chip fully-associative cache used for address translations. If a virtual address is found in TLB (a TLB hit), the page table in main memory is not accessed. Virtual Page Number Valid Tag Physical Page Address 1 TLB (on-chip) 128 -256 Entries Physical Memory 1 128 -256 TLB Entries 0 1 Valid Physical Page or Disk Address 1 1 Page Table (in main memory) Disk Storage 0 1 1 0 1 EECC 550 - Shaaban #124 Final Review Spring 2000 5 -9 -2000

TLB Operation TLB & Cache Operation Virtual address TLB access Cache is physically-addressed TLB miss use page table No Yes TLB hit? Physical address No Try to read data from cache Cache miss stall No Cache hit? Yes Write? No Yes Write protection exception Cache operation W rite a ccess bit on? Yes W rite data into ca che, update the tag, a nd put the data and the addre ss into the write buffer Deliver data to the CPU EECC 550 - Shaaban #125 Final Review Spring 2000 5 -9 -2000