Advanced Computer Architectures Module3 Memory Hierarchy Design and

  • Slides: 74
Download presentation
Advanced Computer Architectures Module-3: Memory Hierarchy Design and Optimizations 1

Advanced Computer Architectures Module-3: Memory Hierarchy Design and Optimizations 1

Introduction ● Even a sophisticated processor may perform well below an ordinary processor: –

Introduction ● Even a sophisticated processor may perform well below an ordinary processor: – ● Unless supported by matching performance by the memory system. The focus of this module: – Study how memory system performance has been enhanced through various innovations and optimizations. 2

Typical Memory Hierarchy Proc/Regs Bigger L 1 -Cache L 2 -Cache Faster L 3

Typical Memory Hierarchy Proc/Regs Bigger L 1 -Cache L 2 -Cache Faster L 3 -Cache (optional) Memory Disk, Tape, etc. ● Here we focus on L 1/L 2/L 3 caches, virtual memory and main memory 3

What is the Role of a Cache? A small, fast storage used to improve

What is the Role of a Cache? A small, fast storage used to improve average access time to a slow memory. ● Improves memory system performance: ● – Exploits locality spatial and temporal 4

Four Basic Questions ● Q 1: Where can a block be placed in the

Four Basic Questions ● Q 1: Where can a block be placed in the cache? (Block placement) – Fully ● Associative, Set Associative, Direct Mapped Q 2: How is a block found if it is in the cache? (Block identification) – Tag/Block ● Q 3: Which block should be replaced on a miss? (Block replacement) – Random, ● LRU Q 4: What happens on a write? (Write strategy) – Write Back or Write Through (with Write Buffer) 5

Block Placement ● ● ● If a block has only one possible place in

Block Placement ● ● ● If a block has only one possible place in the cache: direct mapped If a block can be placed anywhere: fully associative If a block can be placed in a restricted subset of the possible places: set associative – ● If there are n blocks in each subset: nway set associative Note that direct-mapped = 1 -way set associative 6

Trade-offs ● n-way set associative becomes increasingly difficult (costly) to implement for large n

Trade-offs ● n-way set associative becomes increasingly difficult (costly) to implement for large n – ● Most caches today are either 1 -way (direct mapped), 2 -way, or 4 -way set associative The larger n the lower the likelihood of thrashing – e. g. , two blocks competing for the same block frame and being accessed in sequence over and over 7

Block Identification ● ● cont… Given an address, how do we find where it

Block Identification ● ● cont… Given an address, how do we find where it goes in the cache? This is done by first breaking down an address into three parts Index of the set offset of the address in the cache block tag used for identifying a match tag set index Block address block offset 8

Cache Write Policies ● ● Write-through: Information is written to both the block in

Cache Write Policies ● ● Write-through: Information is written to both the block in the cache and the block in memory Write-back: Information is written back to memory only when a block frame is replaced: Uses a “dirty” bit to indicate whether a block was actually written to, – Saves unnecessary writes to memory when a block is “clean” – 9

Trade-offs ● Write back – Faster because writes occur at the speed of the

Trade-offs ● Write back – Faster because writes occur at the speed of the cache, not the memory. – Faster because multiple writes to the same block is written back to memory only once, uses less memory bandwidth. ● Write through – Easier to implement 10

Write Allocate, No-write Allocate – On a read miss, a block has to be

Write Allocate, No-write Allocate – On a read miss, a block has to be brought in from a lower level memory ● ● What happens on a write miss? Two options: – Write allocate: A block allocated in cache. – No-write allocate: No block allocation, but just written to in main memory. 11

Write Allocate, No-write Allocate cont… ● In no-write allocate, Only blocks that are read

Write Allocate, No-write Allocate cont… ● In no-write allocate, Only blocks that are read from can be in cache. – Write-only blocks are never in cache. – ● But typically: write-allocate used with write-back – no-write allocate used with write-through – 12

Memory System Performance ● Memory system performance is largely captured by three parameters, –

Memory System Performance ● Memory system performance is largely captured by three parameters, – ● Latency: – ● Latency, Bandwidth, Average memory access time (AMAT). The time it takes from the issue of a memory request to the time the data is available at the processor. Bandwidth: – The rate at which data can be pumped to the processor by the memory system. 13

Average Memory Access Time (AMAT) • AMAT: The average time it takes for the

Average Memory Access Time (AMAT) • AMAT: The average time it takes for the processor to get a data item it requests. • The time it takes to get requested data to the processor can vary: – due to the memory hierarchy. • AMAT can be expressed as: 14

Modern Computer Architectures Lecture 21: Memory System Basics 15

Modern Computer Architectures Lecture 21: Memory System Basics 15

Cache Performance Parameters ● Performance of a cache is largely determined by: Cache miss

Cache Performance Parameters ● Performance of a cache is largely determined by: Cache miss rate: number of cache misses divided by number of accesses. – Cache hit time: the time between sending address and data returning from cache. – Cache miss penalty: the extra processor stall cycles caused by access to the next -level cache. – 16

Impact of Memory System on Processor Performance CPU Performance with Memory Stall= CPI without

Impact of Memory System on Processor Performance CPU Performance with Memory Stall= CPI without stall + Memory Stall CPI = Miss per inst × miss penalty = % Memory Access/Instr × Miss rate × Miss Penalty Example: Assume 20% memory acc/instruction, 2% miss rate, 400 -cycle miss penalty. How much is memory stall CPI? Memory Stall CPI= 0. 2*0. 02*400=1. 6 cycles 17

CPU Performance with Memory Stall= CPI Memory Stall CPI without stall + 18

CPU Performance with Memory Stall= CPI Memory Stall CPI without stall + 18

Performance Example 1 ● Suppose: – Clock Rate = 200 MHz (5 ns per

Performance Example 1 ● Suppose: – Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1. 1 – 50% Arith/logic, 30% load/store, 20% control – 10% of data memory operations get 50 cycles miss penalty – 1% of instruction memory operations also get 50 cycles miss penalty ● Compute AMAT. 19

Performance Example 1 cont… ● ● CPI = ideal CPI + average stalls per

Performance Example 1 cont… ● ● CPI = ideal CPI + average stalls per instruction = 1. 1(cycles/ins) + [ 0. 30 (Data. Mops/ins) x 0. 10 (miss/Data. Mop) x 50 (cycle/miss)] +[ 1 (Inst. Mop/ins) x 0. 01 (miss/Inst. Mop) x 50 (cycle/miss)] = (1. 1 + 1. 5 +. 5) cycle/ins = 3. 1 AMAT=(1/1. 3)x[1. 1+0. 01 x 50]+(0. 3/1. 3)x[1. 1+0. 1 x 50]=2. 63 20

Example 2 ● ● ● Assume 20% Load/Store instructions Assume CPI without memory stalls

Example 2 ● ● ● Assume 20% Load/Store instructions Assume CPI without memory stalls is 1 Cache hit time = 1 cycle Cache miss penalty = 100 cycles Miss rate = 1% What is: stall cycles per instruction? – average memory access time? – CPI with and without cache – 21

Example 2: Answer ● ● ● Average memory accesses per instruction = 1. 2

Example 2: Answer ● ● ● Average memory accesses per instruction = 1. 2 AMAT = 1 + 1. 2*0. 01*100= 2. 2 cycles Stall cycles = 1. 2 cycles CPI with cache = 1+1. 2=2. 2 CPI without cache=1+1. 2*100=121 22

Example 3 ● Which has a lower miss rate? – ● ● ● A

Example 3 ● Which has a lower miss rate? – ● ● ● A split cache (16 KB instruction cache +16 KB Data cache) or a 32 KB unified cache? Compute the respective AMAT also. 40% Load/Store instructions Hit time = 1 cycle Miss penalty = 100 cycles Simulator showed: – – – 40 misses per thousand instructions for data cache 4 misses per thousand instr for instruction cache 44 misses per thousand instr for unified cache 23

Example 3: Answer ● Miss rate = (misses/instructions)/(mem accesses/instruction) ● Instruction cache miss rate=

Example 3: Answer ● Miss rate = (misses/instructions)/(mem accesses/instruction) ● Instruction cache miss rate= 4/1000=0. 004 ● Data cache miss rate = (40/1000)/0. 4 =0. 1 ● ● Unified cache miss rate = (44/1000)/1. 4 =0. 04 Overall miss rate for split cache = 0. 3*0. 1+0. 7*0. 004=0. 0303 24

Example 3: Answer cont… ● ● AMAT (split cache)= 0. 7*(1+0. 004*100)+0. 3(1+0. 1*100)=4.

Example 3: Answer cont… ● ● AMAT (split cache)= 0. 7*(1+0. 004*100)+0. 3(1+0. 1*100)=4. 3 AMAT (Unified)= 0. 7(1+0. 04*100)+0. 3(1+1+0. 04*100)=4. 5 25

Unified vs Split Caches ● A Load or Store instruction requires two memory accesses:

Unified vs Split Caches ● A Load or Store instruction requires two memory accesses: – One for the instruction itself One for the data – As opposed to “unified” or “mixed” caches – Instruction and data address to the two ports. – Size, associativity, etc. – ● ● Therefore, unified cache causes a structural hazard! Modern processors use separate data and instruction L 1 caches: The CPU sends simultaneously: Both caches can be configured differently 26

Modern Computer Architectures Lecture 22: Memory Hierarchy Optimizations 27

Modern Computer Architectures Lecture 22: Memory Hierarchy Optimizations 27

Unified vs Split Caches ● Separate Instruction and Data caches: – Avoids structural hazard

Unified vs Split Caches ● Separate Instruction and Data caches: – Avoids structural hazard – Also each cache can be tailored specific to need. Processor Unified Cache-1 Unified Cache-2 Unified Cache I-Cache-1 D-Cache-1 Unified Cache-2 Split Cache 28

Example 4 Assume 16 KB Instruction and Data Cache: – Inst miss rate=0. 64%,

Example 4 Assume 16 KB Instruction and Data Cache: – Inst miss rate=0. 64%, Data miss rate=6. 47% – 32 KB unified: Aggregate miss rate=1. 99% – Assume 33% data ops 75% accesses from instructions (1. 0/1. 33) – hit time=1, miss penalty=50 – Data hit has 1 additional stall for unified cache – Which is better (ignore L 2 cache)? AMATSplit =75%x(1+0. 64%x 50)+25%x(1+6. 47%x 50) = 2. 05 AMATUnified=75%x(1+1. 99%x 50)+25%x(1+1+1. 99%x 50)= 2. 24 – 29

Example 5 ● ● What is the impact of 2 different cache organizations on

Example 5 ● ● What is the impact of 2 different cache organizations on the performance of CPU? Clock cycle time 1 nsec 50% load/store instructions Size of both caches 64 KB: – – ● ● ● Both caches have block size 64 KB one is direct mapped the other is 2 -way sa. Cache miss penalty=75 ns for both caches Miss rate DM= 1. 4% Miss rate SA=1% CPU cycle time must be stretched 25% to accommodate the multiplexor for the SA 30

Example 5: Solution ● AMAT DM= 1+(0. 014*75)=2. 05 nsec ● AMAT SA=1*1. 25+(0.

Example 5: Solution ● AMAT DM= 1+(0. 014*75)=2. 05 nsec ● AMAT SA=1*1. 25+(0. 01*75)=2 ns ● ● ● CPU Time= IC*(CPI+(Misses/Instr)*Miss Penalty)* Clock cycle time CPU Time DM= IC*(2*1. 0+(1. 5*0. 014*75)=3. 58*IC CPU Time SA= IC*(2*1. 25+(1. 5*0. 01*75)=3. 63*IC 31

How to Improve Cache Performance? 1. Reduce miss rate. 2. Reduce miss penalty. 3.

How to Improve Cache Performance? 1. Reduce miss rate. 2. Reduce miss penalty. 3. Reduce miss penalty or miss rates via parallelism. 4. Reduce hit time. 32

1. Reducing Miss Rates ● Techniques: – – – Larger block size Larger cache

1. Reducing Miss Rates ● Techniques: – – – Larger block size Larger cache size Higher associativity Way prediction Pseudo-associativity Compiler optimization 33

2. Reducing Miss Penalty ● Techniques: – – Multilevel caches Victim caches Read miss

2. Reducing Miss Penalty ● Techniques: – – Multilevel caches Victim caches Read miss first Critical word first 34

3. Reducing Miss Penalty or Miss Rates via Parallelism ● Techniques: – Non-blocking caches

3. Reducing Miss Penalty or Miss Rates via Parallelism ● Techniques: – Non-blocking caches – Hardware prefetching – Compiler prefetching 35

4. Reducing Cache Hit Time ● Techniques: – Small and simple caches – Avoiding

4. Reducing Cache Hit Time ● Techniques: – Small and simple caches – Avoiding address translation – Pipelined cache access – Trace caches 36

Modern Computer Architectures Lecture 23: Cache Optimizations 37

Modern Computer Architectures Lecture 23: Cache Optimizations 37

Reducing Miss Penalty ● Techniques: – – – Multilevel caches Victim caches Read miss

Reducing Miss Penalty ● Techniques: – – – Multilevel caches Victim caches Read miss first Critical word first Non-blocking caches 38

Reducing Miss Penalty(1): Multi-Level Cache Add a second-level cache. ● L 2 Equations: ●

Reducing Miss Penalty(1): Multi-Level Cache Add a second-level cache. ● L 2 Equations: ● AMAT = Hit Time. L 1 + Miss Rate. L 1 x Miss Penalty. L 1 = Hit Time. L 2 + Miss Rate. L 2 x Miss Penalty. L 2 AMAT = Hit Time. L 1 + Miss Rate. L 1 x (Hit Time. L 2 + Miss Rate. L 2 × Miss Penalty. L 2) 39

Multi-Level Cache: Some Definitions ● ● Local miss rate— misses in this cache divided

Multi-Level Cache: Some Definitions ● ● Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate. L 2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU – ● Local Miss Rate. L 1 x Local Miss Rate. L 2 L 1 Global miss rate = L 1 Local miss rate 40

Global vs. Local Miss Rates ● At lower level caches (L 2 or L

Global vs. Local Miss Rates ● At lower level caches (L 2 or L 3), global miss rates provide more useful information: – Indicate how effective is cache in reducing AMAT. – Who cares if the miss rate of L 3 is 50% as long as only 1% of processor memory accesses ever benefit from it? 41

Performance Improvement Due to L 2 Cache: Example 6 Assume: • For 1000 instructions:

Performance Improvement Due to L 2 Cache: Example 6 Assume: • For 1000 instructions: – 40 misses in L 1, – 20 misses in L 2 L 1 hit time: 1 cycle, L 2 hit time: 10 cycles, L 2 miss penalty=100 1. 5 memory references per instruction Assume ideal CPI=1. 0 Find: Local miss rate, AMAT, stall cycles per instruction, and those without L 2 cache. • • • 42

Example 6: Solution ● With L 2 cache: Local miss rate = 50% –

Example 6: Solution ● With L 2 cache: Local miss rate = 50% – AMAT=1+4%X(10+50%X 100)=3. 4 – Average Memory Stalls per Instruction = (3. 4 -1. 0)x 1. 5=3. 6 – ● Without L 2 cache: AMAT=1+4%X 100=5 – Average Memory Stalls per Inst=(5 -1. 0)x 1. 5=6 – ● Perf. Improv. with L 2 =(6+1)/(3. 6+1)=52% 43

Multilevel Cache ● The speed (hit time) of L 1 cache affects the clock

Multilevel Cache ● The speed (hit time) of L 1 cache affects the clock rate of CPU: – ● Speed of L 2 cache only affects miss penalty of L 1. Inclusion Policy: Many designers keep L 1 and L 2 block sizes the same. – Otherwise on a L 2 miss, several L 1 blocks may have to be invalidated. – 44

Reducing Miss Penalty (2): Victim Cache ● How to combine fast hit time of

Reducing Miss Penalty (2): Victim Cache ● How to combine fast hit time of direct mapped cache: – yet still avoid conflict misses? Add a fully associative buffer (victim cache) to keep data discarded from cache. ● Jouppi [1990]: ● –A 4 -entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache. Used in Alpha, HP machines. ● AMD uses 8 -entry victim buffer. ● 45

Reducing Miss Penalty (3): Read Priority over Write on Miss ● In a write-back

Reducing Miss Penalty (3): Read Priority over Write on Miss ● In a write-back scheme: – Normally a dirty block is stored in a write buffer temporarily. – Usual: ● Write all blocks from the write buffer to memory, and then do the read. – Instead: Check write buffer first, if not found, then initiate read. ● CPU stall cycles would be less. ● 46

Reducing Miss Penalty (3): Read Priority over Write on Miss ● A write buffer

Reducing Miss Penalty (3): Read Priority over Write on Miss ● A write buffer with a write through: – Allows cache writes to occur at the speed of the cache. ● Write buffer however complicates memory access: – They may hold the updated value of a location needed on a read miss. 47

Reducing Miss Penalty (3): Read Priority over Write on Miss ● Write-through with write

Reducing Miss Penalty (3): Read Priority over Write on Miss ● Write-through with write buffers: – Read priority over write: Check write buffer contents before read; if no conflicts, let the memory access continue. – Write priority over read: Waiting for write buffer to first empty, can increase read miss penalty. 48

Reducing Miss Penalty (4): Early Restart and Critical Word First ● Simple idea: Don’t

Reducing Miss Penalty (4): Early Restart and Critical Word First ● Simple idea: Don’t wait for full block to be loaded before restarting CPU --- CPU needs only 1 word: – Early restart —As soon as the requested word of the block arrives, send it to the CPU. – Critical Word First —Request the missed word first from memory and send it to the CPU as soon as it arrives; Requested ● Generally useful for large blocks. word block 49

Example 7 ● ● ● AMD Athlon has 64 -byte cache blocks. L 2

Example 7 ● ● ● AMD Athlon has 64 -byte cache blocks. L 2 cache takes 11 cycles to get the critical 8 bytes. To fetch the rest of the block: – 2 clock cycles per 8 bytes. 50

Solution ● ● 11+(8 -1)*2=25 clock cycles for the CPU to read a full

Solution ● ● 11+(8 -1)*2=25 clock cycles for the CPU to read a full cache block. Without critical word first it would take 25 cycles to get the full block. 51

Reduce Miss Penalty (5): Nonblocking Caches ● Non-blocking cache: – Allow data cache to

Reduce Miss Penalty (5): Nonblocking Caches ● Non-blocking cache: – Allow data cache to continue to serve other requests during a miss. – Meaningful only with out-of-order execution processor. – Requires multi-bank memories. – Pentium Pro allows 4 outstanding memory misses. 52

Non-blocking Caches to Reduce Stalls on Misses ● ● Hit under miss reduces the

Non-blocking Caches to Reduce Stalls on Misses ● ● Hit under miss reduces the effective miss penalty by working during miss. “Hit under multiple miss” or “miss under miss” may further lower the effective miss penalty: By overlapping multiple misses. – Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses. – Requires multiple memory banks. – 53

Non-Blocking Caches ● Multiple memory controllers: Allow memory banks to operate almost independently. –

Non-Blocking Caches ● Multiple memory controllers: Allow memory banks to operate almost independently. – Each bank needs separate address lines and data lines. – ● Input device may be served by one controller: – At the same time, other controllers may serve cache read/write requests. 54

Hit Time Reduction: Fast Hit Time (1): Simple L 1 Cache ● ● ●

Hit Time Reduction: Fast Hit Time (1): Simple L 1 Cache ● ● ● Small and simple (direct mapped) caches have lower hit time (why? ). This is possibly the reason why all modern processors have direct mapped and small L 1 cache: – In Pentium 4 L 1 cache size has reduced from 16 KB to 8 KB for later versions. L 2 cache can be larger and set-associative. 55

Hit Time Reduction (2): Simultaneous Tag Comparison and Data Reading • After indexing: –

Hit Time Reduction (2): Simultaneous Tag Comparison and Data Reading • After indexing: – Tag can be compared and at the same time block can be fetched. – If it’s a miss --then no harm done, miss must be dealt with. TAGS DATA Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy 56

Reducing Hit Time (3): Write Strategy ● There are many more reads than writes

Reducing Hit Time (3): Write Strategy ● There are many more reads than writes All instructions must be read! – Consider: 37% load instructions, 10% store instructions – The fraction of memory accesses that are writes is: . 10 / (1. 0 +. 37 +. 10) ~ 7% – The fraction of data memory accesses that are writes is: . 10 / (. 37 +. 10) ~ 21% – ● Remember the fundamental principle: make the common cases fast. 57

Write Strategy ● Cache designers have spent most of their efforts on making reads

Write Strategy ● Cache designers have spent most of their efforts on making reads fast: – not ● so much writes. But, if writes are extremely slow, then Amdahl’s law tells us that overall performance will be poor. – Writes also need to be made faster. 58

Making Writes Fast ● Several strategies exist for making reads fast: – Simultaneous reading,

Making Writes Fast ● Several strategies exist for making reads fast: – Simultaneous reading, etc. ● tag comparison and block Unfortunately making writes fast can not be done the same way. – Tag comparison cannot be simultaneous with block writes: – One must be sure one doesn’t overwrite a block frame that isn’t a hit! 59

Write Policy 1: Write-Through vs Write-Back ● Write-through: all writes update cache and underlying

Write Policy 1: Write-Through vs Write-Back ● Write-through: all writes update cache and underlying memory/cache. – Can always discard cached data - most up-to -date data is in memory. – Cache ● control bit: only a valid bit. Write-back: all writes only update cache – Memory – Cache write during block replacement. control bits: both valid and dirty bits. 60

Write-Through vs Write -Back ● Relative Advantages: – Write-through: Memory always has latest data.

Write-Through vs Write -Back ● Relative Advantages: – Write-through: Memory always has latest data. ● Simpler management of cache. ● – Write-back: Requires much lower bandwidth, since data often overwritten multiple times. ● Better tolerance to long-latency memory. ● 61

Write Policy 2: Write Allocate vs Non-Allocate ● Write allocate: allocate new block on

Write Policy 2: Write Allocate vs Non-Allocate ● Write allocate: allocate new block on a miss: – Usually means that you have to do a “read miss” to fill in rest of the cache-line! ● Write non-allocate(or “write-around”): – Simply send write data through to underlying memory/cache - don’t allocate new cache line! 62

Reducing Hit Time (4): Way Prediction ● Extra bits are associated with each set.

Reducing Hit Time (4): Way Prediction ● Extra bits are associated with each set. Predicts the next block to be accessed. – Multiplexor could be set early to select the predicted block. – ● Alpha 21264 uses way prediction on its 2 -way set associative instruction cache. If prediction is correct, access is 1 cycle. – Experiments with SPEC 95 suggests higher that 85% prediction success. – 63

Pipelined Cache Access ● L 1 cache access can take multiple cycles: – Giving

Pipelined Cache Access ● L 1 cache access can take multiple cycles: – Giving fast cycle time and slower hits but lower average cache access time. ● Pentium takes 1 cycle to access cache: – Pentium III takes 2 cycles. – Pentium 4 takes 4 cycles. 64

Trace caches 65

Trace caches 65

Modern Computer Architectures Lecture 25: Some More Cache Optimizations 66

Modern Computer Architectures Lecture 25: Some More Cache Optimizations 66

Reducing Miss Rate: An Anatomy of Cache Misses ● To be able to reduce

Reducing Miss Rate: An Anatomy of Cache Misses ● To be able to reduce miss rate, we should be able to classify misses by causes (3 Cs): – Compulsory—To first time. ● bring blocks into cache for the Also called cold start misses or first reference misses. Misses in even an Infinite Cache. – Capacity—Cache is not large enough, some blocks are discarded and later retrieved. ● Misses even in Fully Associative cache. – Conflict—Blocks can be discarded and later retrieved if too many blocks map to a set. ● Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) 67

Classifying Cache Misses ● Later we shall discuss a 4 th “C”: – Coherence

Classifying Cache Misses ● Later we shall discuss a 4 th “C”: – Coherence - Misses caused by cache coherence. To be discussed in multiprocessors part. 68

Complier optimizations 69

Complier optimizations 69

Reducing Miss Rate (1): Hardware Prefetching ● Prefetch both data and instructions: – Instruction

Reducing Miss Rate (1): Hardware Prefetching ● Prefetch both data and instructions: – Instruction prefetching done in almost every processor. – Processors usually fetch two blocks on a miss: requested and the next block. ● Ultra Sparc III computes strides in data access: – Prefetches data based on this. 70

Reducing Miss Rate (1): Software Prefetching ● Prefetch data: – Load data into register

Reducing Miss Rate (1): Software Prefetching ● Prefetch data: – Load data into register (HP PA-RISC loads) – Cache Prefetch: load into a special prefetch cache (MIPS IV, Power. PC, SPARC v. 9) – Special prefetching instructions cannot cause faults; a form of speculative fetch. 71

Compiler Controlled Prefetching ● ● Compiler inserts instructions to prefetch data before it is

Compiler Controlled Prefetching ● ● Compiler inserts instructions to prefetch data before it is needed. Two flavors: – Binding prefetch: Requests to load directly into register. ● – Non-Binding prefetch: Load into cache. ● ● Must be correct address and register! Can be incorrect. What if Faults occur? Prefetch instructions incur instruction execution overhead: – Is cost of prefetch issues less than savings 72 in reduced misses?

Example 8 ● By how much would AMAT increase when prefetch buffer is removed?

Example 8 ● By how much would AMAT increase when prefetch buffer is removed? – Assume for a 64 KB data cache. – 30 data misses per 1000 instructions. – L 1 Miss penalty 15 cycles. – 25% are load/store instructions. – Prefetching reduces miss rate by 20%. – 1 extra clock cycle is incurred if miss in cache, but found in prefetch buffer. 73

Solution ● ● Miss rate (data)=(30/1000)*(100/25) =12/100=0. 12 AMAT (prefetch)= 1+(0. 12*20%*1)+(0. 12*(1 -20%)*15)

Solution ● ● Miss rate (data)=(30/1000)*(100/25) =12/100=0. 12 AMAT (prefetch)= 1+(0. 12*20%*1)+(0. 12*(1 -20%)*15) =1+0. 12*0. 2+0. 12*0. 8*15 =1+0. 24+1. 44= 2. 68 cycles AMAT (No-prefetch)=1+0. 12*15 =1+ 1. 8 = 2. 8 7% increase in AMAT 74