EECS 252 Graduate Computer Architecture Lecture 2 0

Review: Amdahl’s Law Best you could ever hope to do: 10/19/2021 CS 252 S

CPI Review: Computer Performance inst count CPU time = Seconds = Instructions x Program

Today: Quick review of everything you should have learned 0 ( A countably infinite

Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock

Example: Calculating CPI bottom up Run benchmark and collect workload characterization (simulate, machine counters,

Definition: Performance • Performance is in units of things per sec – bigger is

A "Typical" RISC ISA • • 32 bit fixed format instruction (3 formats) 32

Example: MIPS ( MIPS) Register-Register 31 26 25 Op 21 20 Rs 1 16

Datapath vs Control Datapath Controller signals Control Points • Datapath: Storage, FU, interconnect sufficient

5 Steps of MIPS Datapath Figure A. 2, Page A 8 Instruction Fetch Instr.

5 Steps of MIPS Datapath Figure A. 3, Page A 9 Execute Addr. Calc

Visualizing Pipelining Figure A. 2, Page A 8 Time (clock cycles) 10/19/2021 Ifetch DMem

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction

One Memory Port/Structural Hazards Figure A. 4, Page A 14 Time (clock cycles) Instr

One Memory Port/Structural Hazards (Similar to Figure A. 5, Page A 15) Time (clock

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 10/19/2021 CS

Example: Dual port vs. Single port • Machine A: Dual ported memory (“Harvard Architecture”)

CS 252 Administrivia • Resources for course on web site: – Check out the

Data Hazard on R 1 Figure A. 6, Page A 17 Time (clock cycles)

Three Generic Data Hazards • Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards • Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards • Write After Write (WAW) Instr. J writes operand before

Forwarding to Avoid Data Hazard Figure A. 7, Page A 19 or r 8,

HW Change for Forwarding Figure A. 23, Page A 37 Next. PC mux MEM/WR

Forwarding to Avoid LW SW Data Hazard Figure A. 8, Page A 20 or

Data Hazard Even with Forwarding Figure A. 9, Page A 21 and r 6,

Data Hazard Even with Forwarding (Similar to Figure A. 10, Page A 21) and

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU

Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles =>

Pipelined MIPS Datapath Figure A. 24, page A 38 Instruction Fetch Memory Access Write

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch

Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER

Scheduling Branch Delay Slots (Fig A. 14) A. From before branch add $1, $2,

Delayed Branch • Compiler effectiveness for single branch delay slot: – Fills about 60%

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch untaken, 10% conditional branch

Problems with Pipelining • Exception: An unusual event happens to an instruction during its

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and

Since 1980, CPU has outpaced DRAM. . . Performance (1/latency) 1000 CPU 100 CPU

1977: DRAM faster than microprocessors Apple ][ (1977) CPU: 1000 ns DRAM: 400 ns

Memory Hierarchy of a Modern Computer • Take advantage of the principle of locality

The Principle of Locality • The Principle of Locality: – Program access a relatively

Memory Address (one dot per access) Programs with locality cache well. . . Bad

Memory Hierarchy: Apple i. Mac G 5 Managed by compiler Managed by hardware Managed

i. Mac’s Power. PC 970: All caches on chip L 1 (64 K Instruction)

Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level

4 Questions for Memory Hierarchy • Q 1: Where can a block be placed

Q 1: Where can a block be placed in the upper level? • Block

A Summary on Sources of Cache Misses • Compulsory (cold start or process migration,

Q 2: How is a block found if it is in the upper level?

Direct Mapped Cache • Direct Mapped 2 N byte cache: – The uppermost (32

Set Associative Cache • N way set associative: N entries per Cache Index –

Fully Associative Cache • Fully Associative: Every block can hold any line – Address

Q 3: Which block should be replaced on a miss? • Easy for Direct

Q 4: What happens on a write? Write Through Policy Data written to cache

Write Buffers for Write Through Caches Cache Processor Lower Level Memory Write Buffer Holds

5 Basic Cache Optimizations • 1. 2. 3. Reducing Miss Rate Larger Block size

What is virtual memory? Virtual Address Space Physical Address Space Virtual Address 10 offset

Three Advantages of Virtual Memory • Translation: – Program can be given consistent view

Large Address Space Support 10 bits Virtual Address: P 1 index P 2 index

VM and Disk: Page replacement policy Page Table Dirty bit: page written. dirty used

Translation Look Aside Buffers • Translation Look Aside Buffers (TLB) – Cache on translations

What Actually Happens on a TLB Miss? • Hardware traversed page tables: – On

Example: R 3000 pipeline MIPS R 3000 Pipeline Dcd/ Reg Inst Fetch TLB I

Reducing translation time further • As described, TLB lookup is in serial with cache

Overlapping TLB & Cache Access • Here is how this might work with a

Problems With Overlapped TLB Access • Overlapped access requires address bits used to index

Summary: Control and Pipelining • • Next time: Read Appendix A Control VIA State

Summary #1/3: The Cache Design Space • Several interacting dimensions – – – Cache

Summary #2/3: Caches • The Principle of Locality: – Program access a relatively small

Summary #3/3: TLB, Virtual Memory • Page tables map virtual address to physical address

Slides: 77

Download presentation

EECS 252 Graduate Computer Architecture Lecture 2 0 Review of Instruction Sets, Pipelines, and Caches John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http: //www. eecs. berkeley. edu/~kubitron/cs 252 http: //www inst. eecs. berkeley. edu/~cs 252

Review: Amdahl’s Law Best you could ever hope to do: 10/19/2021 CS 252 S 07, Lecture 02 2

CPI Review: Computer Performance inst count CPU time = Seconds = Instructions x Program Cycles Cycle time x Seconds Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X Technology 10/19/2021 X X CS 252 S 07, Lecture 02 3

Today: Quick review of everything you should have learned 0 ( A countably infinite set of computer architecture concepts ) 10/19/2021 CS 252 S 07, Lecture 02 4

Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Instruction Frequency” 10/19/2021 CS 252 S 07, Lecture 02 5

Example: Calculating CPI bottom up Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine Op ALU Load Store Branch (Reg / Freq 50% 20% 10% 20% Reg) Cycles 1 2 2 2 Typical Mix of instruction types in program CPI(i). 5. 4. 2. 4 1. 5 (% Time) (33%) (27%) (13%) (27%) Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks. 10/19/2021 CS 252 S 07, Lecture 02 6

Definition: Performance • Performance is in units of things per sec – bigger is better • If we are primarily concerned with response time performance(x) = 1 execution_time(x) • X is n times faster than Y means: n 10/19/2021 = Performance(X) Performance(Y) = Execution_time(Y) Execution_time(X) CS 252 S 07, Lecture 02 7

ISA Implementation Review

A "Typical" RISC ISA • • 32 bit fixed format instruction (3 formats) 32 32 bit GPR (R 0 contains zero, DP take pair) 3 address, reg arithmetic instruction Single address mode for load/store: base + displacement – no indirection • Simple branch conditions • Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM Power. PC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 10/19/2021 CS 252 S 07, Lecture 02 9

Example: MIPS ( MIPS) Register-Register 31 26 25 Op 21 20 Rs 1 16 15 Rs 2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs 1 16 15 Rd immediate 0 Branch 31 26 25 Op Rs 1 21 20 16 15 Rs 2/Opx immediate 0 Jump / Call 31 26 25 Op 10/19/2021 target CS 252 S 07, Lecture 02 0 10

Datapath vs Control Datapath Controller signals Control Points • Datapath: Storage, FU, interconnect sufficient to perform the desired functions – Inputs are Control Points – Outputs are signals • Controller: State machine to orchestrate operation on the data path 10/19/2021 – Based on desired function and signals CS 252 S 07, Lecture 02 11

5 Steps of MIPS Datapath Figure A. 2, Page A 8 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 L M D MUX Data Memory ALU Imm MUX RD Reg File Inst Memory Address IR <= mem[PC]; Zero? RS 1 RS 2 Write Back MUX Next PC Memory Access Sign Extend PC <= PC + 4 Reg[IRrd] <= Reg[IRrs] op. IRop Reg[IRrt] 10/19/2021 WB Data CS 252 S 07, Lecture 02 12

Simple Pipelining Review

5 Steps of MIPS Datapath Figure A. 3, Page A 9 Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 RD RD RD MUX Sign Extend MEM/WB Data Memory EX/MEM ALU Imm MUX A <= Reg[IRrs]; B <= Reg[IRrt] ID/EX IR <= mem[PC]; PC <= PC + 4 Reg File IF/ID Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch rslt <= A op. IRop B WB <= rslt Reg[IRrd] <= WB 10/19/2021 • Data stationary control – local decode for each instruction phase / pipeline stage CS 252 S 07, Lecture 02 14

Visualizing Pipelining Figure A. 2, Page A 8 Time (clock cycles) 10/19/2021 Ifetch DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg CS 252 S 07, Lecture 02 Reg DMem Reg 15

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 10/19/2021 CS 252 S 07, Lecture 02 16

One Memory Port/Structural Hazards Figure A. 4, Page A 14 Time (clock cycles) Instr 2 Instr 3 Instr 4 10/19/2021 Ifetch DMem Reg ALU Instr 1 Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Ifetch CS 252 S 07, Lecture 02 Reg Reg DMem Reg 17

One Memory Port/Structural Hazards (Similar to Figure A. 5, Page A 15) Time (clock cycles) Instr 1 Instr 2 Stall Reg Ifetch DMem Reg ALU Ifetch Bubble Instr 3 Reg DMem Bubble Ifetch Reg Bubble ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble DMem Reg How do you “bubble” the pipe? 10/19/2021 CS 252 S 07, Lecture 02 18

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 10/19/2021 CS 252 S 07, Lecture 02 19

Example: Dual port vs. Single port • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1. 05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed Speed. Up. A = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth Speed. Up. B = Pipeline Depth/(1 + 0. 4 x 1) x (clockunpipe/(clockunpipe / 1. 05) = (Pipeline Depth/1. 4) x 1. 05 = 0. 75 x Pipeline Depth Speed. Up. A / Speed. Up. B = Pipeline Depth/(0. 75 x Pipeline Depth) = 1. 33 • Machine A is 1. 33 times faster 10/19/2021 CS 252 S 07, Lecture 02 20

CS 252 Administrivia • Sign up! Web site is (doesn’t quite work!): http: //www. cs. berkeley. edu/~kubitron/cs 252 • In class exam on Wednesday January 24 th – – Improve 252 experience if recapture common background Bring 1 sheet of paper with notes on both sides Doesn’t affect grade, only admission into class 2 grades: Admitted or audit/take CS 152 1 st (before class Friday) • Review: Chapter 1, Appendix A, CS 152 home page, maybe “Computer Organization and Design (COD)2/e” – If did take a class, be sure COD Chapters 2, 5, 6, 7 are familiar – Copies in Bechtel Library on 2 hour reserve 10/19/2021 CS 252 S 07, Lecture 02 21

CS 252 Administrivia • Resources for course on web site: – Check out the ISCA (International Symposium on Computer Architecture) 25 th year retrospective on web site. Look for “Additional reading” below text book description – Pointers to previous CS 152 exams and resources – Lots of old CS 252 material – Interesting links. Check out the: WWW Computer Architecture Home Page • Size of class seems ok: – I asked Michael David to put everyone on waitlist into class – Check to make sure 10/19/2021 CS 252 S 07, Lecture 02 22

Data Hazard on R 1 Figure A. 6, Page A 17 Time (clock cycles) and r 6, r 1, r 7 or r 8, r 1, r 9 Ifetch DMem Reg DMem Ifetch Reg ALU sub r 4, r 1, r 3 Reg ALU Ifetch ALU O r d e r add r 1, r 2, r 3 xor r 10, r 11 10/19/2021 WB ALU I n s t r. MEM ALU IF ID/RF EX CS 252 S 07, Lecture 02 Reg Reg DMem Reg 23

Three Generic Data Hazards • Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 10/19/2021 CS 252 S 07, Lecture 02 24

Three Generic Data Hazards • Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 • Called an “anti dependence” by compiler writers. This results from reuse of the name “r 1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 10/19/2021 CS 252 S 07, Lecture 02 25

Three Generic Data Hazards • Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in more complicated pipes 10/19/2021 CS 252 S 07, Lecture 02 26

Forwarding to Avoid Data Hazard Figure A. 7, Page A 19 or r 8, r 1, r 9 Reg DMem Ifetch Reg ALU and r 6, r 1, r 7 Ifetch DMem ALU sub r 4, r 1, r 3 Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) xor r 10, r 11 10/19/2021 CS 252 S 07, Lecture 02 Reg Reg DMem Reg 27

HW Change for Forwarding Figure A. 23, Page A 37 Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate What circuit detects and resolves this hazard? 10/19/2021 CS 252 S 07, Lecture 02 28

Forwarding to Avoid LW SW Data Hazard Figure A. 8, Page A 20 or r 8, r 6, r 9 Reg DMem Ifetch Reg ALU sw r 4, 12(r 1) Ifetch DMem ALU lw r 4, 0(r 1) Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) xor r 10, r 9, r 11 10/19/2021 CS 252 S 07, Lecture 02 Reg Reg DMem Reg 29

Data Hazard Even with Forwarding Figure A. 9, Page A 21 and r 6, r 1, r 7 or 10/19/2021 r 8, r 1, r 9 DMem Ifetch Reg DMem Reg Ifetch CS 252 S 07, Lecture 02 Reg Reg DMem ALU O r d e r sub r 4, r 1, r 6 Reg ALU lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem Reg 30

Data Hazard Even with Forwarding (Similar to Figure A. 10, Page A 21) and r 6, r 1, r 7 or 10/19/2021 r 8, r 1, r 9 DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch CS 252 S 07, Lecture 02 Reg DMem ALU O r d e r sub r 4, r 1, r 6 Reg ALU lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem 31

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW 10/19/2021 Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd Fast code: LW LW LW ADD LW SW SUB SW CS 252 S 07, Lecture 02 Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd 32

Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU 18: or Reg ALU 14: and r 2, r 3, r 5 Ifetch ALU 10: beq r 1, r 3, 36 ALU Control Hazard on Branches Three Stage Stall 22: add r 8, r 1, r 9 36: xor r 10, r 11 Reg Reg DMem Reg What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? 10/19/2021 CS 252 S 07, Lecture 02 33

Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1. 9! • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • MIPS branch tests if register = 0 or 0 • MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 10/19/2021 CS 252 S 07, Lecture 02 34

Pipelined MIPS Datapath Figure A. 24, page A 38 Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Zero? RS 1 RD RD RD MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX Imm ID/EX Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch • Interplay of instruction set design and cycle time. 10/19/2021 CS 252 S 07, Lecture 02 35

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – – – Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% MIPS branches taken on average – But haven’t calculated branch target address in MIPS » MIPS still incurs 1 cycle branch penalty » Other machines: branch target known before outcome 10/19/2021 CS 252 S 07, Lecture 02 36

Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2. . . . sequential successorn branch target if taken Branch delay of length n – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – MIPS uses this 10/19/2021 CS 252 S 07, Lecture 02 37

Scheduling Branch Delay Slots (Fig A. 14) A. From before branch add $1, $2, $3 if $2=0 then delay slot becomes B. From branch target sub $4, $5, $6 add $1, $2, $3 if $1=0 then delay slot becomes if $2=0 then add $1, $2, $3 if $1=0 then sub $4, $5, $6 C. From fall through add $1, $2, $3 if $1=0 then delay slot sub $4, $5, $6 becomes add $1, $2, $3 if $1=0 then sub $4, $5, $6 • A is the best choice, fills delay slot & reduces instruction count (IC) • In B, the sub instruction may need to be copied, increasing IC • In B and C, must be okay to execute sub when branch fails 10/19/2021 CS 252 S 07, Lecture 02 38

Delayed Branch • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot – Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches – Growth in available transistors has made dynamic approaches relatively cheaper 10/19/2021 CS 252 S 07, Lecture 02 39

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch untaken, 10% conditional branch taken Scheduling Branch CPI speedup v. scheme penalty unpipelined stall Stall pipeline 3 1. 60 3. 1 1. 0 Predict taken 1 1. 20 4. 2 1. 33 Predict not taken 1 1. 14 4. 4 1. 40 Delayed branch 0. 5 1. 10 4. 5 1. 45 10/19/2021 CS 252 S 07, Lecture 02 40

Problems with Pipelining • Exception: An unusual event happens to an instruction during its execution – Examples: divide by zero, undefined opcode • Interrupt: Hardware signal to switch the processor to a new instruction stream – Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) • Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) – The effect of all instructions up to and including Ii is totalling complete – No effect of any instruction after Ii can take place • The interrupt (exception) handler either aborts program or restarts at instruction Ii+1 10/19/2021 CS 252 S 07, Lecture 02 41

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages. 10/19/2021 CS 252 S 07, Lecture 02 42

Memory Hierarchy Review

Since 1980, CPU has outpaced DRAM. . . Performance (1/latency) 1000 CPU 100 CPU 60% per yr 2 X in 1. 5 yrs Gap grew 50% per year DRAM 9% per yr DRAM 2 X in 10 yrs 10 0 198 0 199 200 0 Year • How do architects address this gap? – Put small, fast “cache” memories between CPU and DRAM. – Create a “memory hierarchy”

1977: DRAM faster than microprocessors Apple ][ (1977) CPU: 1000 ns DRAM: 400 ns Steve Jobs Steve Wozniak

Memory Hierarchy of a Modern Computer • Take advantage of the principle of locality to: – Present as much memory as in the cheapest technology – Provide access at speed offered by the fastest technology Processor Control 1 s Size (bytes): 100 s 10/19/2021 On-Chip Cache Speed (ns): Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) 10 s-100 s Ks-Ms Ms CS 252 S 07, Lecture 02 Secondary Storage (Disk) Tertiary Storage (Tape) 10, 000 s 10, 000, 000 s (10 s ms) (10 s sec) Gs Ts 46

The Principle of Locality • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e. g. , loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e. g. , straightline code, array access) • Last 15 years, HW relied on locality for speed 10/19/2021 CS 252 S 07, Lecture 02 47

Memory Address (one dot per access) Programs with locality cache well. . . Bad locality behavior Temporal Locality Spatial Locality Time Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168 -192 (1971)

Memory Hierarchy: Apple i. Mac G 5 Managed by compiler Managed by hardware Managed by OS, hardware, application 07 Reg L 1 Inst L 1 Data L 2 DRAM Disk Size 1 K 64 K 32 K 512 K 256 M 80 G 1, 0. 6 ns 3, 1. 9 ns 11, 6. 9 ns 88, 55 ns 107, 12 ms Latency Cycles, Time Goal: Illusion of large, fast, cheap memory Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access i. Mac G 5 1. 6 GHz

i. Mac’s Power. PC 970: All caches on chip L 1 (64 K Instruction) Registers 512 K L 2 (1 K) L 1 (32 K Data)

Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) – Miss Rate = 1 (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty (500 instructions on 21264!) To Processor Upper Level Memory Lower Level Memory Blk X From Processor 10/19/2021 Blk Y CS 252 S 07, Lecture 02 51

4 Questions for Memory Hierarchy • Q 1: Where can a block be placed in the upper level? (Block placement) • Q 2: How is a block found if it is in the upper level? (Block identification) • Q 3: Which block should be replaced on a miss? (Block replacement) • Q 4: What happens on a write? (Write strategy) 10/19/2021 CS 252 S 07, Lecture 02 52

Q 1: Where can a block be placed in the upper level? • Block 12 placed in 8 block cache: – Fully associative, direct mapped, 2 way set associative – S. A. Mapping = Block Number Modulo Number Sets Full Mapped Direct Mapped (12 mod 8) = 4 2 -Way Assoc (12 mod 4) = 0 01234567 Cache 111112222233 0123456789012345678901 Memory 10/19/2021 CS 252 S 07, Lecture 02 53

A Summary on Sources of Cache Misses • Compulsory (cold start or process migration, first reference): first access to a block – “Cold” fact of life: not a whole lot you can do about it – Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant • Capacity: – Cache cannot contain all blocks access by the program – Solution: increase cache size • Conflict (collision): – Multiple memory locations mapped to the same cache location – Solution 1: increase cache size – Solution 2: increase associativity • Coherence (Invalidation): other process (e. g. , I/O) updates memory 10/19/2021 CS 252 S 07, Lecture 02 54

Q 2: How is a block found if it is in the upper level? Block Address Index Tag Block offset Select Data Select • Index Used to Lookup Candidates in Cache – Index identifies the set • Tag used to identify actual copy – If no candidates match, then declare cache miss • Block is minimum quantum of caching – Data select field used to select data within block – Many caching applications don’t have data select field 10/19/2021 CS 252 S 07, Lecture 02 55

Direct Mapped Cache • Direct Mapped 2 N byte cache: – The uppermost (32 N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 M) • Example: 1 KB Direct Mapped Cache with 32 B Blocks – Index chooses potential block – Tag checked to verify block – Byte select chooses byte within block 31 Cache Tag Ex: 0 x 50 Cache Tag 0 x 50 4 0 Byte Select Ex: 0 x 00 Cache Data Byte 31 Byte 0 0 Byte 63 Byte 32 1 2 : : Valid Bit 9 Cache Index Ex: 0 x 01 3 : : Byte 1023 10/19/2021 CS 252 S 07, Lecture 02 : : Byte 992 31 56

Set Associative Cache • N way set associative: N entries per Cache Index – N direct mapped caches operates in parallel • Example: Two way set associative cache – Cache Index selects a “set” from the cache – Two tags in the set are compared to input in parallel – Data is selected based on the tag result 31 Cache Tag 8 Cache Index 4 0 Byte Select Valid Cache Tag Cache Data Cache Block 0 Cache Tag Valid : : : Compare Sel 1 1 Mux 0 Sel 0 Compare OR 10/19/2021 Hit. CS 252 S 07, Lecture 02 Block Cache 57

Fully Associative Cache • Fully Associative: Every block can hold any line – Address does not include a cache index – Compare Cache Tags of all Cache Entries in Parallel • Example: Block Size=32 B blocks – We need N 27 bit comparators – Still have byte select to choose from within block 31 4 Cache Tag (27 bits long) Cache Tag Byte Select Ex: 0 x 01 Cache Data Valid Bit Byte 31 Byte 0 Byte 63 Byte 32 : : = 0 = = 10/19/2021 : CS 252 S 07, Lecture 02 : : 58

Q 3: Which block should be replaced on a miss? • Easy for Direct Mapped • Set Associative or Fully Associative: – LRU (Least Recently Used): Appealing, but hard to implement for high associativity – Random: Easy, but – how well does it work? Assoc: 10/19/2021 2 way 4 way 8 way Size LRU Ran 16 K 5. 2% 5. 7% 4. 7% 5. 3% 4. 4% 5. 0% 64 K 1. 9% 2. 0% 1. 5% 1. 7% 1. 4% 1. 5% 256 K 1. 15% 1. 17% 1. 13% 1. 12% CS 252 S 07, Lecture 02 59

Q 4: What happens on a write? Write Through Policy Data written to cache block Write Back Write data only to the cache also written to lowerlevel memory Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Yes No Additional option let writes to an un cached address allocate a new cache line (“write allocate”).

Write Buffers for Write Through Caches Cache Processor Lower Level Memory Write Buffer Holds data awaiting write through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write A. Yes! Drain buffer before (RAW) hazards an issue next read, or check write for write buffer? buffers for match on reads

5 Basic Cache Optimizations • 1. 2. 3. Reducing Miss Rate Larger Block size (compulsory misses) Larger Cache size (capacity misses) Higher Associativity (conflict misses) • Reducing Miss Penalty 4. Multilevel Caches • Reducing hit time 5. Giving Reads Priority over Writes • 10/19/2021 E. g. , Read complete before earlier writes in write buffer CS 252 S 07, Lecture 02 62

Virtual Memory

What is virtual memory? Virtual Address Space Physical Address Space Virtual Address 10 offset V page no. Page Table Base Reg index into page table Page Table V Access Rights PA table located in physical P page no. memory offset 10 Physical Address • Virtual memory => treat memory as a cache for the disk • Terminology: blocks in this cache are called “Pages” – Typical size of a page: 1 K — 8 K • Page table maps virtual page numbers to physical frames – “PTE” = Page Table Entry 10/19/2021 CS 252 S 07, Lecture 02 64

Three Advantages of Virtual Memory • Translation: – Program can be given consistent view of memory, even though physical memory is scrambled – Makes multithreading reasonable (now used a lot!) – Only the most important part of program (“Working Set”) must be in physical memory. – Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. • Protection: – Different threads (or processes) protected from each other. – Different pages can be given special behavior » (Read Only, Invisible to user programs, etc). – Kernel data protected from User programs – Very important for protection from malicious programs • Sharing: – Can map same physical page to multiple users (“Shared memory”) 10/19/2021 CS 252 S 07, Lecture 02 65

Large Address Space Support 10 bits Virtual Address: P 1 index P 2 index 12 bits Physical Address: Page # Offset 4 KB Page. Table. Ptr 4 bytes • Single Level Page Table Large – 4 KB pages for a 32 bit address 1 M entries – Each process needs own page table! • Multi Level Page Table – Can allow sparseness of page table – Portions of table can be swapped to disk 10/19/2021 CS 252 S 07, Lecture 02 4 bytes 66

VM and Disk: Page replacement policy Page Table Dirty bit: page written. dirty used Used bit: set to 1 on any reference Set of all pages in Memory Head pointer Place pages on free list if used bit is still clear. Schedule pages with dirty bit set to be written to disk. 1 1 0 Tail pointer: Clear the used bit in the page table Architect’s role: support setting dirty and used bits 0 0 1 1 0 . . . Freelist Free Pages

Translation Look Aside Buffers • Translation Look Aside Buffers (TLB) – Cache on translations – Fully Associative, Set Associative, or Direct Mapped hit PA VA CPU Translation with a TLB miss Cache Main Memory hit Translation data • TLBs are: – Small – typically not more than 128 – 256 entries – Fully Associative 10/19/2021 CS 252 S 07, Lecture 02 68

What Actually Happens on a TLB Miss? • Hardware traversed page tables: – On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) » If PTE valid, hardware fills TLB and processor never knows » If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards • Software traversed Page tables (like MIPS) – On TLB miss, processor receives TLB fault – Kernel traverses page table to find PTE » If PTE valid, fills TLB and returns from fault » If PTE marked as invalid, internally calls Page Fault handler • Most chip sets provide hardware traversal – Modern operating systems tend to have more TLB faults since they use translation for many things – Examples: » shared segments » user level portions of an operating system 10/19/2021 CS 252 S 07, Lecture 02 69

Example: R 3000 pipeline MIPS R 3000 Pipeline Dcd/ Reg Inst Fetch TLB I Cache RF ALU / E. A Memory Operation E. A. TLB Write Reg WB D Cache TLB 64 entry, on chip, fully associative, software TLB fault handler Virtual Address Space ASID 6 V. Page Number 20 Offset 12 0 xx User segment (caching based on PT/TLB entry) 100 Kernel physical space, cached 101 Kernel physical space, uncached 11 x Kernel virtual space Allows context switching among 64 user processes without TLB flush 10/19/2021 CS 252 S 07, Lecture 02 70

Reducing translation time further • As described, TLB lookup is in serial with cache lookup: Virtual Address 10 offset V page no. TLB Lookup V Access Rights PA P page no. offset 10 Physical Address • Machines with TLBs go one step further: they overlap TLB lookup with cache access. – Works because offset available early 10/19/2021 CS 252 S 07, Lecture 02 71

Overlapping TLB & Cache Access • Here is how this might work with a 4 K cache: 32 assoc lookup index TLB 10 2 disp 00 20 page # Hit/ Miss FN 4 K Cache = 1 K 4 bytes FN Data Hit/ Miss • What if cache size is increased to 8 KB? – Overlap not complete – Need to do something else. See CS 152/252 • Another option: Virtual Caches – Tags in cache are virtual addresses – Translation only happens on cache misses 10/19/2021 CS 252 S 07, Lecture 02 72

Problems With Overlapped TLB Access • Overlapped access requires address bits used to index into cache do not change as result translation – This usually limits things to small caches, large page sizes, or high – n way set associative caches if you want a large cache • Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 cache index 20 virt page # 2 00 12 disp This bit is changed by VA translation, but is needed for cache lookup Solutions: go to 8 K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 10 10/19/2021 1 K 4 4 CS 252 S 07, Lecture 02 2 way set assoc cache 73

Summary: Control and Pipelining • • Next time: Read Appendix A Control VIA State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then: • Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW, WAR, WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction • Exceptions, Interrupts add complexity • Next time: Read Appendix C, record bugs online! 10/19/2021 CS 252 S 07, Lecture 02 74

Summary #1/3: The Cache Design Space • Several interacting dimensions – – – Cache Size cache size block size associativity replacement policy write through vs write back write allocation Associativity • The optimal choice is a compromise – depends on access characteristics » workload » use (I cache, D cache, TLB) – depends on technology / cost • Simplicity often wins 10/19/2021 Block Size Bad Good CS 252 S 07, Lecture 02 Factor A Less Factor B More 75

Summary #2/3: Caches • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. » Temporal Locality: Locality in Time » Spatial Locality: Locality in Space • Three Major Categories of Cache Misses: – Compulsory Misses: sad facts of life. Example: cold start misses. – Capacity Misses: increase cache size – Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! • Write Policy: Write Through vs. Write Back • Today CPU time is a function of (ops, cache misses) vs. just f(ops): affects Compilers, Data structures, and Algorithms 10/19/2021 CS 252 S 07, Lecture 02 76

Summary #3/3: TLB, Virtual Memory • Page tables map virtual address to physical address • TLBs are important for fast translation • TLB misses are significant in processor performance – funny times, as most systems can’t access all of 2 nd level cache without TLB misses! • Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? • Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy benefits, but computers insecure • Prepare for debate + quiz on Wednesday 10/19/2021 CS 252 S 07, Lecture 02 77