CS 152 Computer Architecture and Engineering Lecture 20

The Big Picture: Where are We Now? ° The Five Classic Components of a

The Art of Memory System Design Workload or Benchmark programs Processor reference stream <op,

Example: 1 KB Direct Mapped Cache with 32 B Blocks ° For a 2

Set Associative Cache ° N-way set associative: N entries for each Cache Index •

Disadvantage of Set Associative Cache ° N-way Set Associative Cache versus Direct Mapped Cache:

Example: Fully Associative ° Fully Associative Cache • Forget about the Cache Index •

A Summary on Sources of Cache Misses ° Compulsory (cold start or process migration,

Design options at constant cost Direct Mapped Cache Size Compulsory Miss Big Same N-way

Recap: Four Questions for Caches and Memory Hierarchy ° Q 1: Where can a

Q 1: Where can a block be placed in the upper level? ° Block

Q 2: How is a block found if it is in the upper level?

Q 3: Which block should be replaced on a miss? ° Easy for Direct

Q 4: What happens on a write? ° Write through—The information is written to

Write Buffer for Write Through Processor Cache DRAM Write Buffer ° A Write Buffer

Write Buffer Saturation Processor Cache DRAM Write Buffer ° Store frequency (w. r. t.

RAW Hazards from Write Buffer! ° Write-Buffer Issues: Could introduce RAW Hazard with memory!

Write-miss Policy: Write Allocate versus Not Allocate ° Assume: a 16 -bit write to

Impact of Memory Hierarchy on Algorithms ° Today CPU time is a function of

Quicksort vs. Radix as vary number keys: Instructions Radix sort Quick sort Instructions/key Job

Quicksort vs. Radix as vary number keys: Instrs & Time Radix sort Time Quick

Quicksort vs. Radix as vary number keys: Cache misses Radix sort Cache misses Quick

Summary #1/ 2: ° The Principle of Locality: • Program likely to access a

Summary #2 / 2: The Cache Design Space ° Several interacting dimensions • •

Slides: 24

Download presentation

CS 152 Computer Architecture and Engineering Lecture 20 Caches Lec 20. 1

The Big Picture: Where are We Now? ° The Five Classic Components of a Computer Processor Input Control Memory Datapath Output ° Today’s Topics: • • Recap last lecture Simple caching techniques Many ways to improve cache performance Virtual memory? Lec 20. 2

The Art of Memory System Design Workload or Benchmark programs Processor reference stream <op, addr>, . . . op: i-fetch, read, write Memory $ MEM Optimize the memory system organization to minimize the average memory access time for typical workloads Lec 20. 3

Example: 1 KB Direct Mapped Cache with 32 B Blocks ° For a 2 ** N byte cache: • The uppermost (32 - N) bits are always the Cache Tag • The lowest M bits are the Byte Select (Block Size = 2 M) • One cache miss, pull in complete “Cache Block” (or “Cache Line”) Cache Tag Example: 0 x 50 Stored as part of the cache “state” Valid Bit Cache Tag 0 x 50 : 9 Cache Index Ex: 0 x 01 Cache Data Byte 31 Byte 63 4 0 Byte Select Ex: 0 x 00 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : 31 : : Block address Byte 992 31 Lec 20. 4

Set Associative Cache ° N-way set associative: N entries for each Cache Index • N direct mapped caches operates in parallel ° Example: Two-way set associative cache • Cache Index selects a “set” from the cache • The two tags in the set are compared to the input in parallel • Data is selected based on the tag result Valid Cache Tag : : Adr Tag Compare Cache Index Cache Data Cache Block 0 : : Sel 1 1 Mux 0 Sel 0 Cache Tag Valid : : Compare OR Hit Cache Block Lec 20. 5

Disadvantage of Set Associative Cache ° N-way Set Associative Cache versus Direct Mapped Cache: • N comparators vs. 1 • Extra MUX delay for the data • Data comes AFTER Hit/Miss decision and set selection ° In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: • Possible to assume a hit and continue. Recover later if miss. Valid Cache Tag : : Adr Tag Compare Cache Index Cache Data Cache Block 0 : : Sel 1 1 Mux 0 Sel 0 Cache Tag Valid : : Compare OR Hit Cache Block Lec 20. 6

Example: Fully Associative ° Fully Associative Cache • Forget about the Cache Index • Compare the Cache Tags of all cache entries in parallel • Example: Block Size = 32 B blocks, we need N 27 -bit comparators ° By definition: Conflict Miss = 0 for a fully associative cache 31 4 0 Byte Select Ex: 0 x 01 Cache Tag (27 bits long) Cache Tag Valid Bit Cache Data Byte 31 Byte 0 Byte 63 Byte 32 : : = = = : : : Lec 20. 7

A Summary on Sources of Cache Misses ° Compulsory (cold start or process migration, first reference): first access to a block • “Cold” fact of life: not a whole lot you can do about it • Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant ° Capacity: • Cache cannot contain all blocks access by the program • Solution: increase cache size ° Conflict (collision): • Multiple memory locations mapped to the same cache location • Solution 1: increase cache size • Solution 2: increase associativity ° Coherence (Invalidation): other process (e. g. , I/O) updates memory Lec 20. 8

Design options at constant cost Direct Mapped Cache Size Compulsory Miss Big Same N-way Set Associative Medium Same Fully Associative Small Same Conflict Miss High Medium Zero Capacity Miss Low Medium High Coherence Miss Same Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant (except for streaming media types of programs). Lec 20. 9

Recap: Four Questions for Caches and Memory Hierarchy ° Q 1: Where can a block be placed in the upper level? (Block placement) ° Q 2: How is a block found if it is in the upper level? (Block identification) ° Q 3: Which block should be replaced on a miss? (Block replacement) ° Q 4: What happens on a write? (Write strategy) Lec 20. 10

Q 1: Where can a block be placed in the upper level? ° Block 12 placed in 8 block cache: • Fully associative, direct mapped, 2 -way set associative • S. A. Mapping = Block Number Modulo Number Sets Direct mapped: block 12 can go only into block 4 (12 mod 8) Fully associative: block 12 can go anywhere Block no. 01234567 Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no. Block-frame address Block no. 01234567 Set Set 0 1 2 3 111112222233 0123456789012345678901 Lec 20. 11

Q 2: How is a block found if it is in the upper level? Block Address Tag Block offset Index Set Select Data Select ° Direct indexing (using index and block offset), tag compares, or combination ° Increasing associativity shrinks index, expands tag Lec 20. 12

Q 3: Which block should be replaced on a miss? ° Easy for Direct Mapped ° Set Associative or Fully Associative: • Random • LRU (Least Recently Used) Associativity: -way 2 -way Size LRU Random 4 -way 8 LRU 16 KB 5. 2% 5. 7% 5. 0% 4. 7% 5. 3% 4. 4% 64 KB 1. 9% 2. 0% 1. 5% 1. 7% 1. 4% 256 KB 1. 12% 1. 15% 1. 17% 1. 13% 1. 12% 1. 13% Lec 20. 13

Q 4: What happens on a write? ° Write through—The information is written to both the block in the cache and to the block in the lowerlevel memory. ° Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. • is block clean or dirty? ° Pros and Cons of each? • WT: read misses cannot result in writes • WB: no writes of repeated writes ° WT always combined with write buffers so that don’t wait for lower level memory Lec 20. 14

Write Buffer for Write Through Processor Cache DRAM Write Buffer ° A Write Buffer is needed between the Cache and Memory • Processor: writes data into the cache and the write buffer • Memory controller: write contents of the buffer to memory ° Write buffer is just a FIFO: • Typical number of entries: 4 • Must handle bursts of writes • Works fine if: Store frequency (w. r. t. time) << 1 / DRAM write cycle Lec 20. 15

Write Buffer Saturation Processor Cache DRAM Write Buffer ° Store frequency (w. r. t. time) > 1 / DRAM write cycle • If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): - Store buffer will overflow no matter how big you make it - The CPU Cycle Time <= DRAM Write Cycle Time ° Solution for write buffer saturation: • Use a write back cache • Install a second level (L 2) cache: (does this always work? ) Processor Cache L 2 Cache DRAM Write Buffer Lec 20. 16

RAW Hazards from Write Buffer! ° Write-Buffer Issues: Could introduce RAW Hazard with memory! • Write buffer may contain only copy of valid data Reads to memory may get wrong result if we ignore write buffer ° Solutions: • Simply wait for write buffer to empty before servicing reads: - Might increase read miss penalty (old MIPS 1000 by 50% ) • Check write buffer contents before read (“fully associative”); - If no conflicts, let the memory access continue - Else grab data from buffer ° Can Write Buffer help with Write Back? • Read miss replacing dirty block - Copy dirty block to write buffer while starting read to memory • CPU stall less since restarts as soon as do read Lec 20. 17

Write-miss Policy: Write Allocate versus Not Allocate ° Assume: a 16 -bit write to memory location 0 x 0 and causes a miss • Do we allocate space in cache and possibly read in the block? - Yes: Write Allocate - No: Not Write Allocate Valid Bit Example: 0 x 00 Cache Tag 0 x 50 : Cache Data Byte 31 Byte 63 4 0 Byte Select Ex: 0 x 00 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : Cache Tag 9 Cache Index Ex: 0 x 00 : : 31 Byte 992 31 Lec 20. 18

Impact of Memory Hierarchy on Algorithms ° Today CPU time is a function of (ops, cache misses) ° What does this mean to Compilers, Data structures, Algorithms? • Quicksort: fastest comparison based sorting algorithm when keys fit in memory • Radix sort: also called “linear time” sort For keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys ° “The Influence of Caches on the Performance of Sorting” by A. La. Marca and R. E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, 370 -379. • For Alphastation 250, 32 byte blocks, direct mapped L 2 2 MB cache, 8 byte keys, from 4000 to 4000000 Lec 20. 19

Quicksort vs. Radix as vary number keys: Instructions Radix sort Quick sort Instructions/key Job size in keys Lec 20. 20

Quicksort vs. Radix as vary number keys: Instrs & Time Radix sort Time Quick sort Instructions Job size in keys Lec 20. 21

Quicksort vs. Radix as vary number keys: Cache misses Radix sort Cache misses Quick sort Job size in keys What is proper approach to fast algorithms? Lec 20. 22

Summary #1/ 2: ° The Principle of Locality: • Program likely to access a relatively small portion of the address space at any instant of time. - Temporal Locality: Locality in Time - Spatial Locality: Locality in Space ° Three (+1) Major Categories of Cache Misses: • Compulsory Misses: sad facts of life. Example: cold start misses. • Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! • Capacity Misses: increase cache size • Coherence Misses: Caused by external processors or I/O devices ° Cache Design Space • • total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Lec 20. 23

Summary #2 / 2: The Cache Design Space ° Several interacting dimensions • • • Cache Size cache size block size associativity replacement policy Associativity write-through vs write-back write allocation Block Size ° The optimal choice is a compromise • depends on access characteristics - workload - use (I-cache, D-cache, TLB) • depends on technology / cost ° Simplicity often wins Bad Good Factor A Less Factor B More Lec 20. 24