Platformbased Design Data Management Part d Data Layout
Platform-based Design Data Management Part d: Data Layout for Caches 5 KK 70 TU/e Henk Corporaal Bart Mesman H. C. Platform-based Design 5 KK 70
Data layout for caches • Caches are hardware controled • Therefore: no explicit reuse copy code needed ! • What can we still do to improve performance? • Topics: – Cache principles – The 3 C's: Compulsory, Capacity and Conflict misses – Data layout examples reducing misses H. C. Platform-based Design 5 KK 70 2
Memory / Lower level Cache operation (direct mapped cache) H. C. Platform-based Design 5 KK 70 Cache / Higher level block or line tags data 3
Why does a cache work? • Principle of Locality – Temporal locality • an accessed item has a high probability being accessed in the near future – Spatial locality • items close in space to a recently accessed item have a high probability of being accessed next • Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses – Regular programs have high instruction and data locality H. C. Platform-based Design 5 KK 70 4
Direct mapped cache Hit Address (bit positions) 31 30 13 12 11 2 10 Byte offset 10 20 Tag Data Index Valid Tag Data 0 1 2 1021 1022 1023 20 H. C. Platform-based Design 5 KK 70 32 5
Direct mapped cache: larger blocks • Taking advantage of spatial locality: Address (bit positions) H. C. Platform-based Design 5 KK 70 6
Performance • Increasing the block size tends to decrease miss rate: H. C. Platform-based Design 5 KK 70 7
Cache principles data main memory Cache Line or Block tag 2 k lines 2 m bytes p-k-m Hit? CPU H. C. Platform-based Design 5 KK 70 p-k-m tag k index address m byte address 8
Cache Architecture Fundamentals • Block placement – Where in the cache will a new block be placed? • Block identification – How is a block found in the cache? • Block replacement policy – Which block is evicted from the cache? • Updating policy – How is a block written from cache to memory? H. C. Platform-based Design 5 KK 70 9
Block placement policies Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15. . . H. C. Platform-based Design 5 KK 70 Cache 0 1 2 3 4 5 6 7. . . Mapping? Direct mapped (one-to-one) 0 1 2 3 4 5 6 7 Here only! Fully associative (one-to-many) 0 1 2 3 4 5 6 7 Anywhere in cache 10
4 -way associative cache H. C. Platform-based Design 5 KK 70 11
Performance 1 KB 2 KB 8 KB H. C. Platform-based Design 5 KK 70 12
Cache Basics • Cache_size = Nsets x Associativity x Block_size • Block_address = Byte_address DIV Block_size in bytes • Index = Block_address MOD Nsets • Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently block address tag 31 … H. C. Platform-based Design 5 KK 70 index block offset … 210 13
Example 1 • Assume – Cache of 4 K blocks, with 4 word block size – 32 bit addresses • Direct mapped (associativity=1) : – – 16 bytes per block = 2^4 4 (2+2) bits for byte and word offsets 32 bit address : 32 -4=28 bits for index and tag #sets=#blocks/ associativity : log 2 of 4 K=12 : 12 for index Total number of tag bits : (28 -12)*4 K=64 Kbits • 2 -way associative – #sets=#blocks/associativity : 2 K sets – 1 bit less for indexing, 1 bit more for tag (compared to direct mapped) – Tag bits : (28 -11) * 2 K=68 Kbits • 4 -way associative – #sets=#blocks/associativity : 1 K sets – 2 bits less for indexing, 2 bits more for tag (compared to direct mapped) – Tag bits : (28 -10) * 4 * 1 K=72 Kbits H. C. Platform-based Design 5 KK 70 14
Example 2 3 caches consisting of 4 one-word blocks: • Cache 1 : fully associative • Cache 2 : two-way set associative • Cache 3 : direct mapped Suppose following sequence of block addresses: 0, 8, 0, 6, 8 H. C. Platform-based Design 5 KK 70 15
Example 2: Direct Mapped Block address Cache Block 0 0 mod 4=0 6 6 mod 4=2 8 8 mod 4=0 Address of Hit or Location 0 Location 1 Location 2 Location memory block miss 3 0 miss Mem[0] 8 miss Mem[8] 0 miss Mem[0] 6 miss Mem[0] Mem[6] 8 miss Mem[8] Mem[6] Coloured = new entry = miss H. C. Platform-based Design 5 KK 70 16
Example 2: 2 -way Set Associative: (4/2 = 2 sets) Block address Cache Block 0 0 mod 2=0 6 6 mod 2=0 8 8 mod 2=0 Address of memory block Hit or miss SET 0 entry 0 0 Miss Mem[0] 8 Miss Mem[0] Mem[8] 0 Hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[6] 8 Miss Mem[8] Mem[6] (so all in set/location 0) SET 0 entry 1 SET 1 entry 0 SET 1 entry 1 LEAST RECENTLY USED BLOCK H. C. Platform-based Design 5 KK 70 17
Example 2: Fully associative (4 way assoc. , 4/4 = 1 set) Address of memory block Hit or miss Block 0 0 Miss Mem[0] 8 Miss Mem[0] Mem[8] 0 Hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[8] Mem[6] 8 Hit Mem[0] Mem[6] H. C. Platform-based Design 5 KK 70 Block 1 Block 2 Block 3 18
Cache Fundamentals The “Three C's” • Compulsory Misses – 1 st access to a block: never in the cache • Capacity Misses – Cache cannot contain all the blocks – Blocks are discarded and retrieved later – Avoided by increasing cache size • Conflict Misses – Too many blocks mapped to same set – Avoided by increasing associativity • Some add 4 th C: Coherence Misses H. C. Platform-based Design 5 KK 70 19
Compulsory miss example for(i=0; i<10; i++) A[i] = f(B[i]); Cache(@ i=2) B[0] A[0] B[1] A[1] B[2] A[2] ----- H. C. Platform-based Design 5 KK 70 Cache(@ i=3) • B[3], A[3] required • B[3] never loaded before loaded into cache • A[3] never loaded before allocates new line 20
Capacity miss example Cache size: 8 blocks of 1 word Fully associative for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 B[3] B[0] A[0] B[4] B[1] A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] • 11 compulsory misses (+8 write misses) • 5 capacity misses H. C. Platform-based Design 5 KK 70 21
Conflict miss example Memory address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] B[0][1] B[1][1] B[2][1] B[3][1] B[0][2] B[1][2] B[2][2] B[3][2] B[0][3] 31 B[3][9] H. C. Platform-based Design 5 KK 70 Cache address 0 1 2 3 4 5 6 7 0. . . 7 A[0] multiply loaded for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] read 10 times Cache (@ i=0) 0 A[0]/B[0][j] 1 2 3 4 B[0][j] 5 6 7 j=odd -> A[0] flushed in favor B[0][j] -> Miss j=even 22
“Three C's” vs Cache size [Gee 93] H. C. Platform-based Design 5 KK 70 23
Data layout may reduce cache misses H. C. Platform-based Design 5 KK 70
Example 1: Capacity & Compulsory miss reduction for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 B[3] B[0] A[0] B[4] B[1] A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] • 11 compulsory misses (+8 write misses) • 5 capacity misses H. C. Platform-based Design 5 KK 70 25
Fit data in cache with in-place mapping #Words for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words Detailed Analysis: max=15 words 15 B[] A[] 0 H. C. Platform-based Design 5 KK 70 6 12 Main Memory Cache Memory ] w e n [ B A i (16 words) 26
Remove capacity / compulsory misses with in-place mapping for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] • 11 compulsory misses • 5 cache hits (+8 write hits) H. C. Platform-based Design 5 KK 70 27
Example 2: Conflict miss reduction Memory address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] B[0][1] B[1][1] B[2][1] B[3][1] B[0][2] B[1][2] B[2][2] B[3][2] B[0][3] 31 B[3][9] H. C. Platform-based Design 5 KK 70 Cache address 0 1 2 3 4 5 6 7 0. . . 7 A[0] multiply loaded for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] read 10 times Cache (@ i=0) 0 A[0]/B[0][j] 1 2 3 4 B[0][j] 5 6 7 j=odd -> A[0] flushed in favor B[0][j] -> Miss j=even 28
Avoid conflict miss with main memory data layout Main Memory 0 1 2 3 4 5 6 7 12 13 14 15 0 1 2 3 4 5 6 7 Leave gap. . . B[0][1] 4 B[1][1] 5 B[2][1] 6 B[3][1] 7 Leave gap. . . A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] 18 B[0][2] 31 B[3][9] H. C. Platform-based Design 5 KK 70 4. . . 7 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] multiply loaded A[i] multiple x read Cache (@ i=0) No conflict 0 1 2 3 4 5 6 7 A[0] B[0][j] j=any © imec 2001 29
Data Layout Organization for Direct Mapped Caches H. C. Platform-based Design 5 KK 70 30
Conclusion on Data Management • In multi-media applications exploring data transfer and storage issues should be done at source code level • DMM method – Reducing number of external memory accesses – Reducing external memory size – Trade-offs between internal memory complexity and speed – Platform independent high-level transformations – Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) – Substantial energy reduction • Although caches are hardware controlled data layout can largely influence the miss-rate H. C. Platform-based Design 5 KK 70 31
- Slides: 31