Memory Hierarchy Cache and Virtual memory Design 5

  • Slides: 88
Download presentation
Memory Hierarchy Cache and Virtual memory Design (5. 1, 5. 2, 5. 4 4

Memory Hierarchy Cache and Virtual memory Design (5. 1, 5. 2, 5. 4 4 th ed) 11/16 memory. 1

The Memory Hierarchy: : Egyptian Pyramid Hard Drives can be magnetic OR solid state

The Memory Hierarchy: : Egyptian Pyramid Hard Drives can be magnetic OR solid state NAND Flash 11/16 memory. 2

Why have Memory Hierarchy Capacity Access Time Cost Today’s CPU Registers 100 s Bytes

Why have Memory Hierarchy Capacity Access Time Cost Today’s CPU Registers 100 s Bytes <0. 5 ns Cache K Bytes 0. 3 -10 ns 1 -0. 1 cents/bit Main Memory few GBytes 50 ns- 70 ns $. 0001 -. 00001 cents /bit Disk Tera Bytes, (5 ms) -5 -6 10 - 10 cents/bit Laptop CPU has 3 levels of Cache 11/16 Registers Upper Level $$$$ faster Staging Xfer Unit Instr. Operands L 1 Cache L 2 Cache prog. /compiler 4 -8 bytes Blocks Memory cache line 8 -256 bytes Pages Disk OS page 512 -4 K bytes user/operator Mbytes Larger Cheap - slow Lower Level memory. 3

Why Memory Hierarchy? • Processor much faster than DRAM • Tradeoff: speed VS size

Why Memory Hierarchy? • Processor much faster than DRAM • Tradeoff: speed VS size VS cost $$ • Principle of locality: programs access small portion of their address space --- if an item is referenced – it will be referenced again soon e. g. Loop --temporal locality – nearby items will be referenced soon e. g. straight line code, arrays-- spatial locality • Goal of memory hierarchy: – speed of highest level (cache) – size of lowest level (disk) – Balance cost 11/16 memory. 4

Cache concepts, READ • Block (aka line): unit of copying – May be multiple

Cache concepts, READ • Block (aka line): unit of copying – May be multiple words • If accessed data is present in upper level – Hit: access satisfied by upper level » Hit ratio: hits/accesses • If accessed data is absent – Miss: block copied from lower level » Time taken: miss penalty » Miss ratio: misses/accesses = 1 – hit ratio – Then accessed data supplied from upper level 11/16 memory. 5

Cache Organization Address Processor Address CACHE Data copy of main memory location 100 Address

Cache Organization Address Processor Address CACHE Data copy of main memory location 100 Address Tag 100 Data Byte 304 Data Byte Data Main Memory copy of main memory location 101 Line / Block 6848 416 Data Block 11/16 memory. 6

Example II: 2 KB Direct Mapped Cache with 32 B Blocks 31 top 21

Example II: 2 KB Direct Mapped Cache with 32 B Blocks 31 top 21 bits = Cache Tag lowest 5 bits = Byte Select (block offset, Block Size = 25 = 32) 2 KB cache : 26 (64 lines) * 25 32 bytes / line) On cache miss, “Cache Block” ( “Cache Line”) read from memory Block address Cache Tag=21 bits Stored for each line of the cache Cache Data Byte 31 Byte 63 Cache Tag : 5 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 2047 11/16 4 0 Byte Select 6 : : Valid Bit 10 Cache Index : – – Byte 2016 63 memory. 7

3 cache mapping organizations ? Block Number 11111 22222 33 0123456789 01 Memory Set

3 cache mapping organizations ? Block Number 11111 22222 33 0123456789 01 Memory Set Number 0 0 1 2 3 01234567 Cache Block 12 can be placed 11/16 Fully Associative anywhere (2 -way) Set Associative two options (12 mod 4) Direct Mapped only into block 4 (12 mod 8) memory. 8

Set Associative Cache Example 2 KB Two-way set associative cache, 32 B line –

Set Associative Cache Example 2 KB Two-way set associative cache, 32 B line – – – 2 direct mapped caches operate in parallel, 32 lines * 32 bytes each Cache Index selects a “set” from the cache Index=5 two tags in the set are compared to address in parallel Tag=22 Data selected based on tag comparison Two 22 bit comparators to compare tags Byte select=5 22 bit Tag Index=5 Valid Cache Tag=22 : : Adr Tag Cache Index=5 Cache Data Cache Block 0 Compare How many lines on each side ? ? 11/16 1 KB Sel 1 1 1 KB Mux 0 Sel 0 Cache Tag=22 : Valid : Compare OR Hit Cache Block memory. 9

Set associative Example 2 11/16 memory. 10

Set associative Example 2 11/16 memory. 10

Fully Associative Cache • Example: 2 KB cache, 32 byte line – No Cache

Fully Associative Cache • Example: 2 KB cache, 32 byte line – No Cache Index (still have 64 lines) – Compare Cache Tags of all cache entries in parallel!! – Example: Block Size = 32 B blocks, we need 64 x 27 -bit comparators • Conflict Miss = 0 for a fully associative cache 31 4 0 Byte Select Cache Tag (27 bits ) 5 Cache Tag Valid Bit Cache Data Byte 31 Byte 0 Byte 63 Byte 32 : : = = = 11/16 : # lines ? ? ? : : memory. 11

Cache Hits vs. Misses Summary • Read hits – this is what we want!

Cache Hits vs. Misses Summary • Read hits – this is what we want! • Read misses – Stall CPU, fetch block/line from memory, deliver to cache & CPU • Write hits: two policies CPU – replace data in cache and memory (write-through) Cache – write the data only into the cache Memory (write-back memory later) • Write misses: – read the entire block/line into the cache, then write the word 11/16 memory. 12

Write-Through • Data-write hit: update block in cache – Problem: cache and memory would

Write-Through • Data-write hit: update block in cache – Problem: cache and memory would be inconsistent • Write through: also update memory • makes writes take longer – e. g. , if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles » Effective CPI = 1 + 0. 1× 100 = 11 • Solution: write buffer – Holds data waiting to be written to memory – CPU continues immediately » Only stalls on write if write buffer is already full 11/16 memory. 13

Cache – Write back • On data-write hit, just update block in cache –

Cache – Write back • On data-write hit, just update block in cache – Keep track of whether each block is dirty • When dirty block is replaced – Write it back to memory – write buffer used as temp – allows read of replacing block first 11/16 memory. 14

Cache – Mapping Summary • Direct mapping – each word in main mem. only

Cache – Mapping Summary • Direct mapping – each word in main mem. only one fixed location in cache » many one mapping – replacing, replace the block occupying the location – cache address = lower-order bits of memory address, excluding byte /block offset – Cheap & simple to implement – Higher miss ratio • Other mapping variations – Fully associative mapping, $$ expensive, specialty cache – set associative mapping, more practical, popular 11/16 memory. 15

Summary: How a block/line is found in cache Block Address Tag Block offset Index

Summary: How a block/line is found in cache Block Address Tag Block offset Index Set Select Data Select • Direct indexing (using index and block offset), tag compares, or combination • Increasing associativity shrinks index, expands tag 11/16 memory. 16

Modern Memory Hierarchy – 1 Core Per core Split I & D caches Unified

Modern Memory Hierarchy – 1 Core Per core Split I & D caches Unified L 2 CPU RF Multiported register file 11/16 L 1 Instruction Cache L 1 Data Cache Multiple interleaved memory banks (off-chip DRAM) L Unified L 2 Cache L C Memory Large unified secondary cache LLC – last level cache shared by all cores memory. 17

Intel Nehalem i 7 Cache Hierarchy 3 cache levels L 1 – L 3,

Intel Nehalem i 7 Cache Hierarchy 3 cache levels L 1 – L 3, separate IL 1, DL 1 11/16 memory. 18

Improving Cache Performance Define AMAT = Hit. Time + Miss. Rate * Miss. Penalty

Improving Cache Performance Define AMAT = Hit. Time + Miss. Rate * Miss. Penalty • Miss. Rate = # of times item not in cache / total # of references • AMAT average memory access time – (cache & Mem) • Example CPU with 1 ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% AMAT = 1 + 0. 05 × 20 = 2 ns 2 cycles per instruction 11/16 memory. 19

Improving Cache Performance AMAT = Hit. Time + Miss. Rate * Miss. Penalty 1.

Improving Cache Performance AMAT = Hit. Time + Miss. Rate * Miss. Penalty 1. Reduce the miss rate, 2. Reduce the miss penalty 3. Reduce the cache hit time. 11/16 memory. 20

Miss Rate Reduction 1. Large cache 2. Large Block Size 3. Higher Associativity 4.

Miss Rate Reduction 1. Large cache 2. Large Block Size 3. Higher Associativity 4. HW Prefetching Instr, Data 5. SW Prefetching Data 6. Compiler Optimizations 11/16 memory. 21

Cache miss classification - 3 Cs – Compulsory—The first access to a block is

Cache miss classification - 3 Cs – Compulsory—The first access to a block is not in the cache. Block must be brought into cache. [cold start misses first reference misses] (Misses even if Infinite Cache) – Capacity—Cache cannot contain all blocks needed. [capacity misses]. Blocks discarded / retrieved. (Misses in Fully Associative Size X Cache) – Conflict—If set associative or direct mapped. Block discarded / retrieved – same index collision misses or interference misses. (Misses in N-way Associative, Size X Cache) • 4 th “C”: – Coherence - Misses caused by cache coherence. 11/16 memory. 22

3 Cs Absolute Miss Rate (SPEC 92) your cache simulation HW Miss rate Conflict

3 Cs Absolute Miss Rate (SPEC 92) your cache simulation HW Miss rate Conflict 11/16 memory. 23

Larger Block Size Effect (fixed cache size & assoc) larger compulsory misses 11/16 Increased

Larger Block Size Effect (fixed cache size & assoc) larger compulsory misses 11/16 Increased Conflict Misses Fewer lines / cache memory. 24

Associativity Conflict 11/16 memory. 25

Associativity Conflict 11/16 memory. 25

Miss Rate Reduction Reduce Misses by: 1. Larger cache 2. Larger Block Size 3.

Miss Rate Reduction Reduce Misses by: 1. Larger cache 2. Larger Block Size 3. Higher Associativity 4. HW Prefetching Instr, Data 5. SW Prefetching Data 6. Compiler Optimizations 11/16 memory. 26

Hardware Prefetching of Instructions & Data • Instruction Prefetching – IBM POWER 5 fetches

Hardware Prefetching of Instructions & Data • Instruction Prefetching – IBM POWER 5 fetches 2 -5 blocks per miss, depending on level (L 2, L 3, . . ) – Extra block placed in “stream buffer” – stream buffer checked On miss • Works with data cache also • Needs extra memory bandwidth 11/16 memory. 27

Reduce Miss penalty Multilevel cache ; L 2, L 3 • L 2 Equations

Reduce Miss penalty Multilevel cache ; L 2, L 3 • L 2 Equations AMAT = Hit Time. L 1 + Miss Rate. L 1 x Miss Penalty. L 1 = Hit Time. L 2 + Miss Rate. L 2 x Miss Penalty. L 2 AMAT = Hit Time. L 1 + Miss Rate. L 1 x (Hit Time. L 2 + Miss Rate. L 2 x Miss Penalty. L 2) • Definitions: 11/16 – Local miss rate— misses in cache divided by total number of memory accesses to this cache – Global miss rate—misses in cache divided by total number of memory accesses generated by the CPU Using above - global miss rate L 1 = local Miss rate. L 1, – for L 2 cahe, Global rate = Miss rate. L 1 × Miss rate. L 2. – Global Miss Rate is more relevant; L 2 miss rate high – only occasional access memory. 28

Error Correction – Improves reliability in memory & buses • Memory systems generate errors

Error Correction – Improves reliability in memory & buses • Memory systems generate errors (accidentally flipped-bits) – – DRAMs charge very small “Soft” errors: cells struck by alpha particles / upsets. “hard” errors can occur when chips permanently fail. Problem worse as memories get denser • Where is “perfect” memory required? – servers, spacecraft/military computers, Mars, … • ECC – Error Correction Code used • bits added to each data-word – Detects / corrects faults in memory system – data word value mapped to a unique “code word”. A fault changes a valid code word to an invalid one - can be detected. 11/16 memory. 29

Increasing Memory Bandwidth n 4 -word wide memory n n n 4 -bank interleaved

Increasing Memory Bandwidth n 4 -word wide memory n n n 4 -bank interleaved memory n n 11/16 Miss penalty = 1 + 15 + 1 = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0. 94 B/cycle Miss penalty = 1 + 15 + 4× 1 = 20 bus cycles Bandwidth = 16 bytes / 20 cycles = 0. 8 B/cycle memory. 30

Increasing Bandwidth - Interleaving Access Pattern without Interleaving: D 1 available Start Access for

Increasing Bandwidth - Interleaving Access Pattern without Interleaving: D 1 available Start Access for D 1 CPU Memory Start Access for D 2 Memory Bank 0 Access Pattern with 4 -way Interleaving: CPU Memory Bank 1 Access Bank 0 Memory Bank 2 11/16 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again Memory Bank 3 memory. 31

Double-Data Rate (DDR 2) DRAM 200 MHz Row Column Precharge Row’ Data [ Micron,

Double-Data Rate (DDR 2) DRAM 200 MHz Row Column Precharge Row’ Data [ Micron, 256 Mb DDR 2 SDRAM datasheet ] 11/16 400 Mb/s Data Rate memory. 32

Virtual Memory : : Disk hierarchy 11/16 memory. 33

Virtual Memory : : Disk hierarchy 11/16 memory. 33

The Memory Hierarchy: : Pyramid 11/16 memory. 34

The Memory Hierarchy: : Pyramid 11/16 memory. 34

Operating System Basics • Program brought (from disk) into memory and placed within a

Operating System Basics • Program brought (from disk) into memory and placed within a process to be run • CPU can only directly access Main memory and registers • Protection of memory required to ensure correct operation 11/16 memory. 35

Operating System & User Application compile Load code execute code 11/16 memory. 36

Operating System & User Application compile Load code execute code 11/16 memory. 36

Operating System Basics process VS program • Program: Static code and static data •

Operating System Basics process VS program • Program: Static code and static data • Process: Dynamic instance of code and data Process state • anything code can affect or be affected by eg. • General-purpose reg, floating point, status, program counter, stack pointer • Address space: what process can address 11/16 memory. 37

Memory Management Logical vs. Physical Address Space • logical address space bound to a

Memory Management Logical vs. Physical Address Space • logical address space bound to a separate physical address space memory management – Logical address – generated CPU; a. k. a virtual address – Physical address – address seen by the memory unit • Logical and physical addresses are the same in compile-time and load-time address-binding schemes; logical (virtual) and physical addresses differ in execution-time address-binding scheme 11/16 memory. 38

OS Generates System & Application Processes - Threads process can generate several threads Process

OS Generates System & Application Processes - Threads process can generate several threads Process can be in several states 11/16 memory. 39

OS Runs Multiple Processes on CPU stack PT for PPT powerpoint Memory CPU registers

OS Runs Multiple Processes on CPU stack PT for PPT powerpoint Memory CPU registers Processor One process at a time . . . Word PT for Word CPU registers } Windows context PPT pages Word pages CPU registers 11/16 memory. 40

How swapping happens 11/16 memory. 41

How swapping happens 11/16 memory. 41

Context-Switched Between Processes • Time-slicing – Time-slice – OS kernel’s scheduler controls & perform

Context-Switched Between Processes • Time-slicing – Time-slice – OS kernel’s scheduler controls & perform context -switch • Preemption – higher-priority task preempts currently running task Time-slice powerpoint Word IE Context switches powerpoint Word IE Embedded Apps. Context switches 11/16 memory. 42

Virtual Memory Concept • Main memory used as “cache” for disk • Managed by

Virtual Memory Concept • Main memory used as “cache” for disk • Managed by CPU and OS • Programs share main memory – Each process gets its own private virtual address space 4 GB in MIPS / IA-32 – Protected from other programs • CPU and OS translate virtual addresses to physical addresses (mapping) – VM “block” is called a page eg 4 KB – VM translation “miss” is called a page fault 11/16 memory. 43

Modern Virtual Memory Systems Illusion of a large, private, uniform store Protection & Privacy

Modern Virtual Memory Systems Illusion of a large, private, uniform store Protection & Privacy OS Several apps - with private and shared address spaces App. Demand Paging Can run programs larger than primary memory Primary Memory Swapping Store Hides differences in machine configurations Price = address translation on each memory reference 11/16 virtual VA mapping TLB physical PA memory. 44

Paging – how it works Virtual address 11/16 physical address memory. 45

Paging – how it works Virtual address 11/16 physical address memory. 45

PTE contents M-bit 1 R-bit 1 V-bit Protection bits 1 1 -2 Page Frame

PTE contents M-bit 1 R-bit 1 V-bit Protection bits 1 1 -2 Page Frame Number about 20 • Modify bit M: whether page has been modified. Aka “dirty” ü updated each time a WRITE to the page occurs. • Reference bit R: whether page has been accessed ü updated each time a READ or a WRITE occurs • V bit: whether or not the PTE can be used aka Valid ü checked each time the virtual address is used • Protection bits: what operations are allowed on this page ü READ, WRITE, EXECUTE • The Page Frame Number: where in memory is the page 11/16 memory. 46

page number offset Each Process has its own Private Address Space – 232 bytes

page number offset Each Process has its own Private Address Space – 232 bytes Powerpoint VA 1 1 2 Page Table safari VA 1 Physical Memory 0 OS pages 0 0 1 2 0 1 PPT 0 safari Page Table 0 chat VA 1 2 Page Table free • Each process has its own page table • Page table contains an entry for each user page 11/16 chat 1 2 3 3 memory. 47

Address Translation Mechanism 11/16 memory. 48

Address Translation Mechanism 11/16 memory. 48

Page Table example— virtual physical address 11/16 memory. 49

Page Table example— virtual physical address 11/16 memory. 49

Page Fault Penalty & Demand Paging load page from disk on demand • Page

Page Fault Penalty & Demand Paging load page from disk on demand • Page fault (page not present), page fetched from disk – Takes millions of clock cycles – If no free page left, a page is replaced (LRU) – page table updated to point to new location of page on disc – Handled by OS code • To minimize page fault rate – Fully associative placement – Smart replacement algorithms LRU 11/16 memory. 50

Page Fault Page not present 11/16 memory. 51

Page Fault Page not present 11/16 memory. 51

Page Table Summary • Stores placement information – Array of page table entries PTE,

Page Table Summary • Stores placement information – Array of page table entries PTE, indexed by virtual page number – Page table register in CPU points to page table in physical memory • If page is present in memory – PTE stores the physical page number – Plus other status bits (referenced R, dirty D, . ) • If page is not present - fault – PTE can refer to location in swap space on disk 11/16 • Each PTE contains physical page pointer, D (dirty), P (present), A (accessed), R/W (read – write), PCD (cache disable) memory. 52

Page Tables are in Physical Memory PT for PPT VA 1 PT for Word

Page Tables are in Physical Memory PT for PPT VA 1 PT for Word PPT pages VA 1 Word 11/16 needs one reference to retrieve the page base address and another to access the data doubles the number of memory references! Word pages memory. 53

Linear Page Table • Page Table Entry (PTE) contains: – A bit to indicate

Linear Page Table • Page Table Entry (PTE) contains: – A bit to indicate if a page exists – PPN (physical page number) for a memory-resident page – DPN (disk page number) for a page on the disk – Status bits for protection and usage • OS sets the Page Table Base Register whenever active user process changes PT Base Register 11/16 Data Pages Page Table PPN DPN PPN Data word Offset DPN PPN DPN VPN DPN PPN VPN Offset Virtual address memory. 54

Size of Linear (Single level) Page Table is huge With 32 -bit addresses, 4

Size of Linear (Single level) Page Table is huge With 32 -bit addresses, 4 -KB pages & 4 -byte PTEs: Þ 220 PTEs, i. e, 4 MB page table per user Larger pages possible: • Internal fragmentation (Not all memory in a page is used) • Larger page fault penalty (more time to read from disk) What about 64 -bit virtual address space 11/16 memory. 55

Hierarchical (2 -level ) Page Table (used in Intel processors) Virtual Address 31 22

Hierarchical (2 -level ) Page Table (used in Intel processors) Virtual Address 31 22 21 p 1 0 12 11 p 2 offset 10 -bit L 1 index L 2 index Root of the Current Page Table offset p 2 p 1 (Processor Register) Level 1 Page Table page in primary memory page in secondary memory PTE of a nonexistent page 11/16 Level 2 Page Tables Data Pages memory. 56

Disk Writes & Page Replacement • least-recently used (LRU) replacement – Reference bit in

Disk Writes & Page Replacement • least-recently used (LRU) replacement – Reference bit in PTE set to 1 on access to page – Periodically cleared to 0 by OS – A page with reference bit = 0 has not been used recently • Disk writes take millions of cycles – Complete Page (4 k. B) , not individual locations – Use write-back – Write through to Mem & Disk impractical – Dirty bit in PTE set when page is written 11/16 memory. 57

Fast address Translation Using a Translation Lookaside Buffer -TLB • TLB: – a fast

Fast address Translation Using a Translation Lookaside Buffer -TLB • TLB: – a fast cache of PTEs on chip – 64– 512 PTEs Typical – Misses handled by hardware – Could be fully associative • No extra memory references for Address translation for – PTE access – actual page (memory) access 11/16 memory. 58

Fast Translation Using a TLB 11/16 memory. 59

Fast Translation Using a TLB 11/16 memory. 59

TLB Miss Handling – PTE invalid page in memory • If page in memory

TLB Miss Handling – PTE invalid page in memory • If page in memory – Load PTE from page table (memory) and retry – handled by hardware – Or software » special exception » with optimized » handler Load TLB with valid PTE from page table - memory 11/16 memory. 60

TLB Miss Handling – PTE invalid page not in memory • If page not

TLB Miss Handling – PTE invalid page not in memory • If page not in memory (page fault, not present) – OS fetches page form disk and updates page table see above – Restart faulting instruction 11/16 memory. 61

TLB and Cache Interaction 11/16 memory. 62

TLB and Cache Interaction 11/16 memory. 62

Memory Protection • Hardware support for OS protection – Protection bits in PTE (U/S,

Memory Protection • Hardware support for OS protection – Protection bits in PTE (U/S, R/W, . . ) – Privileged supervisor mode U/S (aka kernel mode) – Privileged instructions – Page tables information only accessible in supervisor mode • Different tasks can share parts of their virtual address spaces – need protection against errant access – OS assisted 11/16 memory. 63

Summary: Three Advantages of Virtual Memory • Translation: – Program can be given consistent

Summary: Three Advantages of Virtual Memory • Translation: – Program can be given consistent view of memory, even though physical memory is scrambled – Makes multithreading reasonable – Only the most important part of program (“Working Set”) must be in physical memory. – Contiguous structures (like stacks) use only as much physical memory as necessary yet can still grow later. • Protection: – Different threads (or processes) protected from each other. – Different pages can be given special behavior » (Read Only, Invisible to user programs, etc). – Kernel data protected from User programs – Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows • Sharing: – Can map same physical page to multiple users (“Shared memory”) 11/16 memory. 64

Nehalem Core IA instruc. RISC Micro ops Program order Nehalem Microarchitecture 11/16 memory. 65

Nehalem Core IA instruc. RISC Micro ops Program order Nehalem Microarchitecture 11/16 memory. 65

Nehalem Microarchitecture 11/16 memory. 66

Nehalem Microarchitecture 11/16 memory. 66

Skylake Core Microarchitecture 11/16 memory. 67

Skylake Core Microarchitecture 11/16 memory. 67

IBM Power 4 11/16 memory. 68

IBM Power 4 11/16 memory. 68

Power 4 Chip • Power 4 chip has two processors • Unified L 2

Power 4 Chip • Power 4 chip has two processors • Unified L 2 Cache through Core Interface Unit (CIU) • Each L 2 cache controller feed 32 bytes of data per cycle to each Core • CIU connects each of the three L 2 controllers to core’s data & instruction caches 11/16 memory. 69

IBM POWER 4 now up to POWER 8 • Dual processor core • 8

IBM POWER 4 now up to POWER 8 • Dual processor core • 8 -way superscalar Out of Order execution – 2 Load / Store units – 2 Fixed Point units – 2 Floating Point units – Logical operations on Condition Register – Branch Execution unit • > 200 instructions in flight • Hardware instruction and data prefetch 11/16 memory. 70

Power 4 Core 11/16 memory. 71

Power 4 Core 11/16 memory. 71

Power 4 Pipeline (IF = instruction fetch, IC = instruction cache, BP = branch

Power 4 Pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D 0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping(rename), ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F 6 = sixcycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit). 11/16 memory. 72

Multi-Processing Memory Hierarchy Conventional Memory Proc Hierarchy L 1 Cache L 2 Cache Memory

Multi-Processing Memory Hierarchy Conventional Memory Proc Hierarchy L 1 Cache L 2 Cache Memory • 11/16 Proc L 1 Cache L 2 Cache Multi-Processor Memory Hierarchy interconnect L 3 Cache Proc Shared L 3 Cache Shared Memory Multi- processors, collectively, have large, fast cache 73 memory. 73

SMP: Shared Memory Multiprocessor – Hardware provides single physical address space for all processors

SMP: Shared Memory Multiprocessor – Hardware provides single physical address space for all processors – Synchronize shared variables using locks – Memory access time » UMA (uniform) vs. NUMA (nonuniform) 11/16 memory. 74

Distributed Memory: Message Passing • Each processor has private physical address space • Communication

Distributed Memory: Message Passing • Each processor has private physical address space • Communication by sending /receiving messages between processors 11/16 memory. 75

Multicore Systems 2 × quad-core Intel Xeon e 5345 (Clovertown) 2 × quad-core AMD

Multicore Systems 2 × quad-core Intel Xeon e 5345 (Clovertown) 2 × quad-core AMD Opteron X 4 2356 (Barcelona) 11/16 memory. 76

Multicore Examples 2 × oct-core Sun Ultra. SPARC T 2 5140 (Niagara 2) 2

Multicore Examples 2 × oct-core Sun Ultra. SPARC T 2 5140 (Niagara 2) 2 × oct-core IBM Cell QS 20 11/16 memory. 77

And Their Rooflines • Kernels – Sp. MV (left) – LBHMD (right) • Some

And Their Rooflines • Kernels – Sp. MV (left) – LBHMD (right) • Some optimizations change arithmetic intensity • x 86 systems have higher peak GFLOPs – But harder to achieve, given memory bandwidth 11/16 memory. 78

Example: NVIDIA Tesla Streaming multiprocessor 8 × Streaming processors 11/16 memory. 79

Example: NVIDIA Tesla Streaming multiprocessor 8 × Streaming processors 11/16 memory. 79

Parallel Programming is difficult • Parallel software is the problem • Need to get

Parallel Programming is difficult • Parallel software is the problem • Need to get significant performance improvement – Otherwise, just use a faster uniprocessor, since it’s easier! • Difficulties – Partitioning – Coordination – Communications overhead 11/16 memory. 80

Loosely Coupled Clusters • Network of independent computers – Each has private memory and

Loosely Coupled Clusters • Network of independent computers – Each has private memory and OS – Connected using I/O system » E. g. , Ethernet/switch, Internet • Suitable for applications with independent tasks – Web servers, databases, simulations, … • High availability, scalable, affordable • Problems – Administration cost (prefer virtual machines) – Low interconnect bandwidth 11/16 » c. f. processor/memory bandwidth on an SMP memory. 81

REVIEW Example: Direct Mapped Cache with 8 entries Memory size = 32 B 000

REVIEW Example: Direct Mapped Cache with 8 entries Memory size = 32 B 000 001 010 011 100 101 110 111 Cache 00001 00101 01001 01101 10001 10101 11001 11101 Memory • Higher-order 2 bits of M. address = Cache index • Block size = 1 in this example (line size =1) • Processor accesses data with main memory address 11/16 –is requested data in cache? –Where is it in cache? memory. 82

Cache Example • 8 -blocks, 1 word/block, direct mapped • Initial state, all entries

Cache Example • 8 -blocks, 1 word/block, direct mapped • Initial state, all entries invalid V = N 11/16 Index V 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N Tag Data memory. 83

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 22 10 110 Miss

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 Index V 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 111 N Tag Data 10 Mem[10110] memory. 84

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 26 11 010 Miss

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 Index V 000 N 001 N 010 Y 011 N 100 N 101 N 110 Y 111 N Tag Data 11 Mem[11010] 10 Mem[10110] memory. 85

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 22 10 110 Hit

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010 Index V 000 N 001 N 010 Y 011 N 100 N 101 N 110 Y 111 N Tag Data 11 Mem[11010] 10 Mem[10110] memory. 86

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 16 10 000 Miss

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011 16 10 000 Hit 000 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N memory. 87

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 18 10 010 Miss

Cache Example 11/16 Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Location 18 replaces location 26 memory. 88