Memory Access Cycle and the Measurement of Memory

  • Slides: 28
Download presentation
Memory Access Cycle and the Measurement of Memory Systems Xian-He Sun Dawei Wang November

Memory Access Cycle and the Measurement of Memory Systems Xian-He Sun Dawei Wang November 2011

Memory Wall Problem Processor-DRAM Memory Gap µProc 1. 20/yr. “Moore’s Law” µProc 1. 52/yr.

Memory Wall Problem Processor-DRAM Memory Gap µProc 1. 20/yr. “Moore’s Law” µProc 1. 52/yr. (2 X/1. 5 yr) DRAM Processor-Memory 7%/yr. Performance Gap: (grows 50% / year) (2 X/10 yrs) • 1980: no cache in micro-processor; 2010: 3 -level cache on chip, 4 -level cache off chip • 1989 the first Intel processor with on-chip L 1 cache was Intel 486, 8 KB size • 1995 the first Intel processor with on-chip L 2 cache was Intel Pentium Pro, 256 KB size • 2003 the first Intel processor with on-chip L 3 cache was Intel Itanium 2, 6 MB size Source: Computer Architecture A Quantitative Approach

Extremely Unbalanced Operation Latency 450 400 IO Access 5~15 M cycles 350 Cycles 300

Extremely Unbalanced Operation Latency 450 400 IO Access 5~15 M cycles 350 Cycles 300 250 200 150 100 50 0 1 2 4 4 10 ALU Inst FP Cmp FP Mul L 1 Access FP Div 20 L 2 Access L 3 Access MM Access

Data Access becomes THE Bottleneck q Applications become data intensive o o Animation and

Data Access becomes THE Bottleneck q Applications become data intensive o o Animation and Visualization applications Data mining, information retrieval Geographic information system, etc Scientific and engineering simulation Source: Gromacs q Need a better understanding of memory system performance q Need a new performance metric for memory systems Source: MPQC Source: Multi-grid solver Source: Na. St 3 DGP 4

Complexity of Memory Hierarchy Capacity Access Time, Bandwidth CPU Registers <8 KB <0. 2~0.

Complexity of Memory Hierarchy Capacity Access Time, Bandwidth CPU Registers <8 KB <0. 2~0. 5 ns, 500~800 GB/s/core Cache <50 MB 1 -10 ns, 50~150 GB/s/core Registers Instr. Operands OS 4 K-4 M bytes Disk Files Tape Peta Bytes or infinite sec-min cache cntl 32 -128 bytes Memory Pages Disk Tera Bytes, 5 ms 100~300 MB/s prog. /compiler 1 -8 bytes Cache Blocks Main Memory Giga Bytes 50 ns-100 ns 5~10 GB/s/channel Upper Level faster Staging Xfer Unit Tape user/operator Mbytes Larger Lower Level

Complexity of Data Access q The complexity of CPU Design o Out-of-order Execution o

Complexity of Data Access q The complexity of CPU Design o Out-of-order Execution o Multithreading technology o Speculation mechanisms q The complexity of Memory Design o Advanced Cache Technologies o Allow tens or hundreds of cache accesses to overlap with each other o Processor continue execution instructions under multiple cache misses

Existing Memory Metrics q Miss Rate(MR) o {the number of miss memory accesses} over

Existing Memory Metrics q Miss Rate(MR) o {the number of miss memory accesses} over {the number of total memory accesses} q Misses Per Kilo-Instructions(MPKI) o {the number of miss memory accesses} over {the number of total committed Instructions × 1000} q Average Miss Penalty(AMP) o {the summary of single miss latency} over {the number of miss memory accesses} q Average Memory Access Time (AMAT) o AMAT = Hit time + MR×AMP q Flaw of Existing Metrics o Focus on a single component or o A single memory access

Measure Memory Performance: The Requirements q Separate but closely related to CPU performance o

Measure Memory Performance: The Requirements q Separate but closely related to CPU performance o Not Flop or IPC, but a major factor q Provide the total performance of the memory system as well as the performance of each tier of the memory hierarchy q Cover the complexity of modern memory systems q Simple, easy to use, and easy to understand

The Introduction of APC q Access Per Cycle (APC) q APC is measured as

The Introduction of APC q Access Per Cycle (APC) q APC is measured as the number of memory accesses per cycle o Measures the overall memory system performance o Each memory level has its own APC value o Dominating overall CPU performance q Benefits of APC o Separate memory evaluation from CPU evaluation o A better understanding of memory system as a whole o A better understanding of the match between computing capacity and memory system performance

APC in Detail q APC is the overall memory accesses requested at a certain

APC in Detail q APC is the overall memory accesses requested at a certain memory level (i. e. L 1, L 2, L 3, Main Memory) divided by the total number of memory access cycles at that level o APC = M/T o Different level has different APC » APCD L 1 Data Cache » APCI L 1 Instruction Cache » APCM Main Memory q APC performance is hierarchical

APC Measurement q The difficulty is measuring the total cycle T o Hundreds of

APC Measurement q The difficulty is measuring the total cycle T o Hundreds of memory accesses co-exist the memory system q Measure T based on the overlapping mode o When there are several memory accesses co-existing during the same clock cycle, T only increases by one o Measure the concurrence at each level

APC Measure Logic (AML) q Detects memory access activities from MSHR, cache and CPU

APC Measure Logic (AML) q Detects memory access activities from MSHR, cache and CPU q If one active, Cycle ++ q Hardware cost analyze o CPU/Cache interface detecting logic<=bit-width of the command data buses o Cache detecting logic = length of the pipeline stage of cache access o MSHR table empty status, 1 bit o Total less than 1 K bits

APCM Measurement q Last Level Cache Measurement o DRAM Accesses Count o LLC MSHR

APCM Measurement q Last Level Cache Measurement o DRAM Accesses Count o LLC MSHR Cycles o APCM = DRAM Accesses Count / LLC MSHR Cycles q Hardware cost o DRAM Access Count usually provided by CPU performance counters o LLC MSHR Cycles only need 1 bit to detect MSHR empty or not o Available on some microprocessors

Validation Testing Methodology q System performance is the ultimate interest q A good memory

Validation Testing Methodology q System performance is the ultimate interest q A good memory metric should influence system performance directly q Use IPC (Instruction Per Cycle) as the system performance q Use Correlation Coefficient to measure the correlation o Better correlation, better metric

Correlation Coefficient q Correlation coefficient (CC) describes the proximity between two variables changing trends

Correlation Coefficient q Correlation coefficient (CC) describes the proximity between two variables changing trends from a statistics viewpoint. q It measures how well two variables match with each other Range Relation 1, -1 Perfectly Match ≥ 0. 9 Dominant relation ≥ 0. 8 Strong relation ≥ 0. 5 Weak relation 0 No relation

Experiment Environment q Detailed out-of-order Alpha 21264 -like CPU model in the M 5

Experiment Environment q Detailed out-of-order Alpha 21264 -like CPU model in the M 5 simulator o o Superscalar: out-of-order, speculation, 8 -issue Private split L 1 caches + Shared L 2 cache Non-blocking cache, pipelined cache, cache prefetching Single core & Multi-core q Simulate a serial of configurations with changing one or two memory parameters q Spec CPU 2006, 26 benchmarks, 1 B instructions q Test on different configurations & benchmarks

Default Simulation Configuration Parameter Processor Function units ROB, LSQ size L 1 caches L

Default Simulation Configuration Parameter Processor Function units ROB, LSQ size L 1 caches L 2 cache DRAM latency/Width Value 1 core, 2 GHz, 8 -issue width, 6 Int. ALU 1 cycle, 1 Int. Mul 3 cycles, 2 FPAdd 2 cycles, 1 FPCmp 2 cycles, 1 FPCvt 2 cycles, 1 FPMul 4 cycles, 1 FPDiv 12 cycles ROB 192, LQ 32, SQ 32 32 KB Inst/32 KB Data, 2 -way, 64 B line, hit latency: 2 cycle Inst/2 cycle Data, ICache 10 MSHR Entry, DCache 10 MSHR Entry 2 MB, 8 -way, 64 B line, 12 -cycle hit latency, 20 MSHR Entry 200 -cycle access latency/64 bits

A set of Simulation Configurations ID C 1 Description L 1: 32 KB, 2

A set of Simulation Configurations ID C 1 Description L 1: 32 KB, 2 way; L 2: 2 MB, 8 way; Mem 100 ns C 2 L 1: 32 KB, 4 way; L 2: 2 MB, 8 way; Mem 100 ns C 3 L 1: 32 KB, 8 way; L 2: 2 MB, 8 way; Mem 100 ns C 4 L 1: 64 KB, 2 way; L 2: 2 MB, 8 way; Mem 100 ns C 5 L 1: 64 KB, 4 way; L 2: 2 MB, 8 way; Mem 100 ns C 6 L 1: 64 KB, 8 way; L 2: 2 MB, 8 way; Mem 100 ns C 7 L 1: I$32 KB, 2 way, D$64 KB, 2 way; L 2: 2 MB, 8 way; Mem 100 ns C 8 L 1: I$64 KB, 2 way, D$32 KB, 2 way; L 2: 2 MB, 8 way; Mem 100 ns C 9 L 1: I$64 KB, 4 way, D$32 KB, 2 way; L 2: 2 MB, 8 way; Mem 100 ns C 10 L 1: I$64 KB, 8 way, D$32 KB, 2 way; L 2: 2 MB, 8 way; Mem 100 ns Changed Parameter/s Default Config C 11 L 1 Cache Assoc. C 13 L 1 Cache Assoc. C 14 L 1 Cache Size C 15 L 1 Cache Size & Assoc. Only DCache Size C 16 Only ICache Size & Assoc. L 1: 32 KB, 2 way; L 2: 4 MB, 8 way; Mem 100 ns L 1: 32 KB, 2 way; L 2: 8 MB, 8 way; Mem 100 ns L 1: 32 KB, 2 way; L 2: 2 MB, 16 way; Mem 100 ns L 1: 32 KB, 2 way; L 2: 4 MB, 16 way; Mem 100 ns L 1: 32 KB, 2 way; L 2: 8 MB, 16 way; Mem 100 ns L 1: 32 KB, 2 way; L 2: 2 MB, 8 way; Mem 30 ns L 1: 32 KB, 2 way; L 2: 2 MB, 8 way; Mem 60 ns L 1: 32 KB, 2 way, MSHR 1; L 2: 2 MB, 8 way; Mem 100 ns L 2 Cache Size C 19 L 1: 32 KB, 2 way, MSHR 2; L 2: 2 MB, 8 way; Mem 100 ns MSHR Entry C 20 L 1: 32 KB, 2 way, MSHR 16; L 2: 2 MB, 8 way; Mem 100 ns MSHR Entry C 12 C 17 C 18 L 2 Cache Size L 2 Cache Assoc. L 2 Cache Size & Assoc. Main memory latency MSHR Entry

APC and IPC with Different Applications q q APC has the strongest relation with

APC and IPC with Different Applications q q APC has the strongest relation with IPC (CC = 0. 871) AMAT is the second best with average CC value of -0. 670 APC improves correlation value by 30. 0% HR has almost the same correlation value with AMAT

APC & IPC with Different Configurations

APC & IPC with Different Configurations

Experiments Results q APC has the highest correlation coefficient value with IPC, the average

Experiments Results q APC has the highest correlation coefficient value with IPC, the average value for all application is 0. 9632 o APC and IPC has a directly dominant relationship q AMAT has the second highest correlation with IPC, with an average value of -0. 9393 o AMAT is a pretty good metric in reflecting memory performance variation without considering Non-blocking cache optimization q For other metrics, there are some misleading indications

APC & IPC: Changing Cache Parallelism q Changing the number of MSHR entries (1

APC & IPC: Changing Cache Parallelism q Changing the number of MSHR entries (1 2 10 16) q APC still has the dominant correlation, with average value of 0. 9656 q AMAT does not correlate with IPC for most applications o o APC record the CPU blocked cycles by MSHR cycles AMAT cannot records block cycles, it only measure the issued memory requests

Exhausted Testing q With different benchmarks, and with different configurations q With advanced cache

Exhausted Testing q With different benchmarks, and with different configurations q With advanced cache technologies o o Non-block cache Pipelined cache Multi-port cache Hardware prefetcher q With single core or multicore q APC always has the highest CC values among all the memory metrics

APC Applications q Find the lowest level that has a dominating correlation with IPC

APC Applications q Find the lowest level that has a dominating correlation with IPC q Find the contribution of concurrence q Quantitatively define data intensiveness q Provide a mean to study the matching between memory organization and microprocessor architecture, q Provide a mean to study the matching between memory organization and a given application

A Definition of Data Intensiveness q The IPC and APC correlation value provides a

A Definition of Data Intensiveness q The IPC and APC correlation value provides a quantitative definition of data intensive q Use the correlation value of APCM to quantify the degree of data intensive o Do not count data re-use as part of data-intensiveness unless it has to be read from main memory again o Assuming the "memory-wall" problem is actually due to the slow speed of main memory o Could define differently for small kernel application or off-core application Definition coe(APCM, IPC) ≥ 0. 9

Data-intensive Definition q The correlation value of APCM are divided into three intervals, that

Data-intensive Definition q The correlation value of APCM are divided into three intervals, that is (-1, 0. 3), [0. 3, 0. 9), [0. 9, 1) q Reason for picking 0. 9 as the threshold According to mathematical definition of correlation coefficient When CC >= 0. 9, then the two variables have a dominant relation

Related Work q Traditional Memory Metrics o Miss Rate (MR), Miss Per Kilo-Instructions (MPKI),

Related Work q Traditional Memory Metrics o Miss Rate (MR), Miss Per Kilo-Instructions (MPKI), o Average Miss Penalty (AMP), Average Memory Access Time (AMAT) q Memory Level Parallelism (MLP) o Average number of long-latency main memory outstanding accesses when there is at least one such outstanding access o Assuming each off-chip memory access has a constant latency, say m cycles, APCM=MLP/m o That means APCM is directly proportional to MLP o APC is superset of MLP

Conclusion q Contribution o Proposed new memory metric APC o APC links memory performance

Conclusion q Contribution o Proposed new memory metric APC o APC links memory performance to CPU performance o APC links the performance of each tier of a memory hierarchy together q Future Work o o Extend to file system APCIO Extend to network environment APCNet Measure APCM , APCIO , and APCNet Use APC to analyze the bottleneck of data-centric algorithms