CSCE 513 Computer Architecture Lec 09 Memory Hierarchy

  • Slides: 28
Download presentation
CSCE 513 Computer Architecture Lec 09 Memory Hierarchy yet again Topics n Memory Hierarchy

CSCE 513 Computer Architecture Lec 09 Memory Hierarchy yet again Topics n Memory Hierarchy Review l Terminology review l Basic Equations l 6 Basic Optimizations n Memory Hierarchy – Chapter 2 Readings: Appendix B, Chapter 2 – 1– September 21, 2016 CSCE 513 Fall 2016

AMAT Equations Terminology (abbreviations) • AMAT • HT – Hit. Time • MR miss

AMAT Equations Terminology (abbreviations) • AMAT • HT – Hit. Time • MR miss Rate • MP miss Penalty – 2– CSCE 513 Fall 2016

AMAT – weighted average – 3– CSCE 513 Fall 2016

AMAT – weighted average – 3– CSCE 513 Fall 2016

AMAT – weighted average (continued) – 4– CSCE 513 Fall 2016

AMAT – weighted average (continued) – 4– CSCE 513 Fall 2016

2. 2 - 10 Advanced Cache Optimizations Five Categories 1. Reducing Hit Time-Small and

2. 2 - 10 Advanced Cache Optimizations Five Categories 1. Reducing Hit Time-Small and simple first-level caches and wayprediction. Both techniques also generally decrease power consumption. 2. Increasing cache bandwidth— Pipelined caches, multibanked caches, and nonblocking caches. These techniques have varying impacts on power consumption. 3. Reducing the miss penalty— Critical word first and merging write buffers. These optimizations have little impact on power. 4. Reducing the miss rate— Compiler optimizations 5. Reducing the miss penalty or miss rate via parallelism— Hardware prefetching and compiler prefetching. – 5– CSCE 513 Fall 2016

To improve hit time, predict the way to pre-set mux Advanced Optimizations Way Prediction

To improve hit time, predict the way to pre-set mux Advanced Optimizations Way Prediction Mis-prediction gives longer hit time n Prediction accuracy n l > 90% for two-way l > 80% for four-way l I-cache has better accuracy than D-cache First used on MIPS R 10000 in mid-90 s n Used on ARM Cortex-A 8 n Extend to predict block as well “Way selection” n Increases mis-prediction penalty n – 6– Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Pipeline cache access to improve bandwidth n Examples: Advanced Optimizations Pipelining Cache l Pentium:

Pipeline cache access to improve bandwidth n Examples: Advanced Optimizations Pipelining Cache l Pentium: 1 cycle l Pentium Pro – Pentium III: 2 cycles l Pentium 4 – Core i 7: 4 cycles Increases branch mis-prediction penalty Makes it easier to increase associativity – 7– Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Allow hits before previous misses complete n n Advanced Optimizations Nonblocking Caches “Hit under

Allow hits before previous misses complete n n Advanced Optimizations Nonblocking Caches “Hit under miss” “Hit under multiple miss” L 2 must support this In general, processors can hide L 1 miss penalty but not L 2 miss penalty – 8– Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Organize cache as independent banks to support simultaneous access Advanced Optimizations Multibanked Caches ARM

Organize cache as independent banks to support simultaneous access Advanced Optimizations Multibanked Caches ARM Cortex-A 8 supports 1 -4 banks for L 2 n Intel i 7 supports 4 banks for L 1 and 8 banks for L 2 n Interleave banks according to block address – 9– Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Critical word first Request missed word from memory first n Send it to the

Critical word first Request missed word from memory first n Send it to the processor as soon as it arrives n Advanced Optimizations Critical Word First, Early Restart Early restart Request words in normal order n Send missed work to the processor as soon as it arrives n Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched – 10 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

When storing to a block that is already pending in the write buffer, update

When storing to a block that is already pending in the write buffer, update write buffer Advanced Optimizations Merging Write Buffer Reduces stalls due to full write buffer Do not apply to I/O addresses No write buffering Write buffering – 11 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Loop Interchange n Swap nested loops to access memory in sequential order Advanced Optimizations

Loop Interchange n Swap nested loops to access memory in sequential order Advanced Optimizations Compiler Optimizations Blocking Instead of accessing entire rows or columns, subdivide matrices into blocks n Requires more memory accesses but improves locality of accesses n – 12 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Fetch two blocks on miss (include next sequential block) Advanced Optimizations Hardware Prefetching Pentium

Fetch two blocks on miss (include next sequential block) Advanced Optimizations Hardware Prefetching Pentium 4 Pre-fetching – 13 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions Advanced Optimizations

Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions Advanced Optimizations Compiler Prefetching Register prefetch n Loads data into register Cache prefetch n Loads data into cache Combine with loop unrolling and software pipelining – 14 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

– 15 – Copyright © 2012, Elsevier Inc. All rights reserved. Advanced Optimizations Summary

– 15 – Copyright © 2012, Elsevier Inc. All rights reserved. Advanced Optimizations Summary CSCE 513 Fall 2016

Memory Technology Performance metrics Latency is concern of cache n Bandwidth is concern of

Memory Technology Performance metrics Latency is concern of cache n Bandwidth is concern of multiprocessors and I/O n Access time n l Time between read request and when desired word arrives n Cycle time l Minimum time between unrelated requests to memory DRAM used for main memory, SRAM used for cache – 16 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Memory Technology SRAM Requires low power to retain bit n Requires 6 transistors/bit n

Memory Technology SRAM Requires low power to retain bit n Requires 6 transistors/bit n DRAM Must be re-written after being read n Must also be periodically refeshed n l Every ~ 8 ms l Each row can be refreshed simultaneously One transistor/bit n Address lines are multiplexed: n l Upper half of address: row access strobe (RAS) l Lower half of address: column access strobe (CAS) – 17 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Memory Technology Amdahl: n n Memory capacity should grow linearly with processor speed Unfortunately,

Memory Technology Amdahl: n n Memory capacity should grow linearly with processor speed Unfortunately, memory capacity and speed has not kept pace with processors Some optimizations: n n Multiple accesses to same row Synchronous DRAM l Added clock to DRAM interface l Burst mode with critical word first n n n – 18 – Wider interfaces Double data rate (DDR) Multiple banks on each DRAM device Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

– 19 – Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Memory

– 19 – Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Memory Optimizations CSCE 513 Fall 2016

– 20 – Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Memory

– 20 – Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Memory Optimizations CSCE 513 Fall 2016

Memory Technology Memory Optimizations DDR: n DDR 2 l Lower power (2. 5 V

Memory Technology Memory Optimizations DDR: n DDR 2 l Lower power (2. 5 V -> 1. 8 V) l Higher clock rates (266 MHz, 333 MHz, 400 MHz) n DDR 3 l 1. 5 V l 800 MHz n DDR 4 l 1 -1. 2 V l 1600 MHz GDDR 5 is graphics memory based on DDR 3 – 21 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Memory Technology Memory Optimizations Graphics memory: n Achieve 2 -5 X bandwidth per DRAM

Memory Technology Memory Optimizations Graphics memory: n Achieve 2 -5 X bandwidth per DRAM vs. DDR 3 l Wider interfaces (32 vs. 16 bit) l Higher clock rate » Possible because they are attached via soldering instead of socketted DIMM modules Reducing power in SDRAMs: Lower voltage n Low power mode (ignores clock, continues to refresh) n – 22 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

– 23 – Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Memory

– 23 – Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Memory Power Consumption CSCE 513 Fall 2016

Memory Technology Flash Memory Type of EEPROM Must be erased (in blocks) before being

Memory Technology Flash Memory Type of EEPROM Must be erased (in blocks) before being overwritten Non volatile Limited number of write cycles Cheaper than SDRAM, more expensive than disk Slower than SRAM, faster than disk – 24 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Understand Ready. Boost and whether it will Speed Up your System Windows 7 supports

Understand Ready. Boost and whether it will Speed Up your System Windows 7 supports Windows Ready. Boost. • • • This feature uses external USB flash drives as a hard disk cache to improve disk read performance. Supported external storage types include USB thumb drives, SD cards, and CF cards. Since Ready. Boost will not provide a performance gain when the primary disk is an SSD, Windows 7 disables Ready. Boost when reading from an SSD drive. External storage must meet the following requirements: • • • Capacity of at least 256 MB, with at least 64 kilobytes (KB) of free space. The 4 -GB limit of Windows Vista has been removed. At least a 2. 5 MB/sec throughput for 4 -KB random reads At least a 1. 75 MB/sec throughput for 1 -MB random writes – 25 – http: //technet. microsoft. com/en-us/magazine/ff 356869. aspx CSCE 513 Fall 2016

Memory Technology Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors

Memory Technology Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors n Detected and fixed by error correcting codes (ECC) Hard errors: permanent errors n Use sparse rows to replace defective rows Chipkill: a RAID-like error recovery technique – 26 – Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 513 Fall 2016

Solid State Drives http: //en. wikipedia. org/wiki/Solid-state_drive http: //www. tomshardware. com/charts/hard-drives-andssds, 3. html •

Solid State Drives http: //en. wikipedia. org/wiki/Solid-state_drive http: //www. tomshardware. com/charts/hard-drives-andssds, 3. html • • – 27 – Hard Drives 34 dimensions: eg Desktop performance SSD - CSCE 513 Fall 2016

Windows Experience Index Control PanelAll Control Panel ItemsPerformance Information and Tools – 28 –

Windows Experience Index Control PanelAll Control Panel ItemsPerformance Information and Tools – 28 – Control PanelAll Control Panel ItemsPerformance Information and Tools CSCE 513 Fall 2016