ASR Adaptive Selective Replication for CMP Caches Brad
ASR: Adaptive Selective Replication for CMP Caches Brad Beckmann†, Mike Marty, and David Wood Multifacet Project University of Wisconsin-Madison 12/13/06 † currently at Microsoft
Introduction: Shared Cache CPU 3 CPU 2 CPU 1 CPU 0 L 1 I$ L 1 D$ L 2 Bank 40+ Cycles. A L 2 Bank L 2 Bank L 1 I$ L 1 D$ CPU 4 CPU 5 Maximize Cache Capacity Slow Access Latency CPU 6 CPU 7 2
Introduction: Private Caches CPU 3 CPU 2 CPU 1 CPU 0 L 1 I $ Private L 1 L 2 D$ A Private L 2 A Private L 2 L 1 I$ L 1 D$ CPU 4 Fast Access Latency CPU 5 Lower Effective Capacity CPU 6 CPU 7 Desire both Fast Access & High Capacity 3
Introduction • Previous hybrid proposals – Victim Replication, CMP-Nu. Rapid, Cooperative Caching – Achieve fast access and high capacity • Under certain workloads & system configurations • Utilize static rules – Non-adaptive • Adaptive Selective Replication: ASR – Dynamically monitor workload behavior – Adapt the L 2 cache to workload demand – Up to 12% improvement vs. previous proposals Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 4
Outline • Introduction • Understanding L 2 Replication • • Benefit Cost Key Observation Solution • ASR: Adaptive Selective Replication • Evaluation Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 5
Understanding L 2 Replication • Three L 2 block sharing types 1. Single requestor – All requests by a single processor 2. Shared read only – Read only requests by multiple processors 3. Shared read-write – Read and write requests by multiple processors • Profile L 2 blocks during their on-chip lifetime – 8 processor CMP – 16 MB shared L 2 cache – 64 -byte block size Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 6
Understanding L 2 Replication High Locality Low Locality Mid Locality Apache Jbb Oltp Zeus Shared Read-only Shared Read-write Single Requestor Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 7
L 2 Hit Cycles Understanding L 2 Replication: Benefit Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 8
L 2 Miss Cycles Understanding L 2 Replication: Cost Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 9
L 2 Hit Cycles Understanding L 2 Replication: Key Observation Replicate Top 3% of Shared Read-only blocks. Frequently satisfy Requested Blocks First 70% of Shared Read-only requests Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 10
Understanding L 2 Replication: Solution Total Cycles Total Cycle Curve Property of Workload Not Fixed Cache Interaction Must Adapt Optimal Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 11
Outline • Wires and CMP caches • Understanding L 2 Replication • ASR: Adaptive Selective Replication – – • SPR: Selective Probabilistic Replication Monitoring and adapting to workload behavior Evaluation Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 12
SPR: Selective Probabilistic Replication • Mechanism for Selective Replication – Relax L 2 inclusion property • L 2 evictions do not force L 1 evictions • Non-exclusive cache hierarchy – Ring Writebacks • L 1 Writebacks passed clockwise between private L 2 caches • Merge with other existing L 2 copies • Probabilistically choose between – Local writeback allow replication – Ring writeback disallow replication • Replicates frequently requested blocks Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 13
SPR: Selective Probabilistic Replication CPU 3 CPU 2 CPU 1 CPU 0 L 1 I$ L 1 D$ Private L 2 Private L 2 L 1 I$ L 1 D$ CPU 4 CPU 5 CPU 6 CPU 7 14
SPR: Selective Probabilistic Replication Level 0 1 2 3 4 5 Prob. of Replication 0 1/64 1/16 1/4 1/2 1 Replication Capacity Current Level 0 1 2 3 4 5 Replication Levels Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 15
L 2 Hit Cycles Monitoring and Adapting to Workload Behavior Replication Benefit Curve lower level current level higher level Replication Capacity 1. Decrease in Replication Benefit – Bit marks replicas of the current, but not lower level 2. Increase in Replication Benefit – Store 8 -bit partial tags of next higher level replications Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 16
L 2 Miss Cycles Monitoring and Adapting to Workload Behavior Replication Cost Curve lower level current level higher level Replication Capacity 3. Decrease in Replication Cost – Stores 16 -bit partial tags of recently evicted blocks 4. Increase in Replication Cost – Way and Set counters track soon-to-be-evicted blocks Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 17
Outline • Wires and CMP caches • Understanding L 2 Replication • ASR: Adaptive Selective Replication • Evaluation Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 18
Methodology • Full system simulation – Simics – Wisconsin’s GEMS Timing Simulator • Out-of-order processor • Memory system • Workloads – Commercial • apache, jbb, otlp, zeus – Scientific (see paper) • Spec. OMP: apsi & art • Splash: barnes & ocean Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 19
System Parameters [ 8 core CMP, 45 nm technology ] Memory System Dynamically Scheduled Processor L 1 I & D caches 64 KB, 4 -way, 3 cycles Clock frequency 5. 0 GHz Unified L 2 cache 16 MB, 16 -way Reorder buffer / scheduler 128 / 64 entries L 1 / L 2 prefetching Unit & Non-unit strided prefetcher (similar Power 4) Pipeline width 4 -wide fetch & issue Memory latency 500 cycles Pipeline stages 30 Memory bandwidth 50 GB/s Direct branch predictor 3. 5 KB YAGS Memory size 4 GB of DRAM Return address stack 64 entries Outstanding memory request / CPU 16 Indirect branch predictor 256 entries (cascaded) Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 20
Replication Benefit, Cost, & Effectiveness Curves Benefit Beckmann, Marty, & Wood Cost ASR: Adaptive Selective Replication for CMP Caches 21
Replication Benefit, Cost, & Effectiveness Curves Effectiveness Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 22
Comparison of Replication Policies • • SPR multiple possible policies Evaluated 4 shared read-only replication policies 1. VR: Victim Replication – – 2. NR: CMP-Nu. Rapid – – 3. Previously proposed [Chishti ISCA 05] Replicate upon the second request CC: Cooperative Caching – – 4. Previously proposed [Zhang ISCA 05] Disallow replicas to evict shared owner blocks Previously proposed [Chang ISCA 06] Replace replicas first Spill singlets to remote caches Tunable parameter 100%, 70%, 30%, 0% Lack Dynamic Adaptation ASR: Adaptive Selective Replication – – Our proposal Monitor and adjust to workload demand Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 23
ASR: Performance S: CMP-Shared P: CMP-Private V: SPR-VR N: SPR-NR C: SPR-CC A: SPR-ASR Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 24
Conclusions • CMP Cache Replication – – – No replications conservers capacity All replications reduces on-chip latency Previous hybrid proposals • Work well for certain criteria • Non-adaptive • Adaptive Selective Replication – – Probabilistic policy favors frequently requested blocks Dynamically monitor replication benefit & cost Replicate benefit > cost Improves performance up to 12% vs. previous schemes Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 25
Backup Slides
ASR: Memory Cycles S: CMP-Shared P: CMP-Private V: SPR-VR N: SPR-NR C: SPR-CC A: SPR-ASR Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 27
L 2 Cache Requests Breakdown
L 2 Cache Requests Breakdown: User & OS
Shared Read-write Requests Breakdown
Shared Read-write Block Breakdown
L 2 Hit Cycles ASR: Decrease-in-replication Benefit lower level current level Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 32
ASR: Decrease-in-replication Benefit • Goal – Determine replication benefit decrease of the next lower level • Mechanism – Current Replica Bit • Per L 2 cache block • Set for replications of the current level • Not set for replications of lower level – Current replica hits would be remote hits with next lower level • Overhead – 1 -bit x 256 K L 2 blocks = 32 KB Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 33
L 2 Hit Cycles ASR: Increase-in-replication Benefit current level higher level Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 34
ASR: Increase-in-replication Benefit • Goal – Determine replication benefit increase of the next higher level • Mechanism – Next Level Hit Buffers (NLHBs) • 8 -bit partial tag buffer • Store replicas of the next higher – NLHB hits would be local L 2 hits with next higher level • Overhead – 8 -bits x 16 K entries x 8 processors = 128 KB Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 35
L 2 Miss Cycles ASR: Decrease-in-replication Cost lower level current level Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 36
ASR: Decrease-in-replication Cost • Goal – Determine replication cost decrease of the next lower level • Mechanism – Victim Tag Buffers (VTBs) • 16 -bit partial tags • Store recently evicted blocks of current replication level – VTB hits would be on-chip hits with next lower level • Overhead – 16 -bits x 1 K entry x 8 processors = 16 KB Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 37
L 2 Miss Cycles ASR: Increase-in-replication Cost current level higher level Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 38
ASR: Increase-in-replication Cost • Goal – Determine replication cost increase of the next higher level • Mechanism – Way and Set counters [Suh et al. HPCA 2002] • Identify soon-to-be-evicted blocks • 16 -way pseudo LRU • 256 set groups – On-chip hits that would be off-chip with next higher level • Overhead – 255 -bit pseudo LRU tree x 8 processors = 255 B Ø Overall storage overhead: 212 KB or 1. 2% of total storage Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 39
ASR: Triggering a Cost. Benefit Analysis • Goal – Dynamically adapt to workload behavior – Avoid unnecessary replication level changes • Mechanism – Evaluation trigger • Local replications or NLHB allocations exceed 1 K – Replication change • Four consecutive evaluations in the same direction Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 40
ASR: Adaptive Algorithm Decrease in Replication Cost > Increase in Replication Benefit Decrease in Replication Benefit > Go in direction with Increase in greater value Replication Cost Decrease in Replication Benefit < Increase in Replication Cost Beckmann, Marty, & Wood Decrease in Replication Cost < Increase in Replication Benefit Increase Replication Decrease Replication ASR: Adaptive Selective Replication for CMP Caches Do Nothing 41
ASR: Adapting to Workload Behavior Oltp: All CPUs Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 42
ASR: Adapting to Workload Behavior Apache: All CPUs Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 43
ASR: Adapting to Workload Behavior Apache: CPU 0 Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 44
ASR: Adapting to Workload Behavior Apache: CPUs 1 -7 Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 45
Replication Capacity Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 46
Replication Capacity 4 MB 150 Memory Latency In-order processors Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 47
Replication Benefit, Cost, & Effectiveness Curves Benefit Beckmann, Marty, & Wood Cost ASR: Adaptive Selective Replication for CMP Caches 4 MB 150 Memory Latency In-order processors 48
Replication Benefit, Cost, & Effectiveness Curves Effectiveness Beckmann, Marty, & Wood 4 MB 150 Memory Latency In-order processors ASR: Adaptive Selective Replication for CMP Caches 49
Replication Benefit, Cost, & Effectiveness Curves Benefit Beckmann, Marty, & Wood Cost ASR: Adaptive Selective Replication for CMP Caches 16 MB 500 Memory Latency In-order processors 50
Replication Benefit, Cost, & Effectiveness Curves Effectiveness Beckmann, Marty, & Wood 16 MB 500 Memory Latency In-order processors ASR: Adaptive Selective Replication for CMP Caches 51
Replication Analytic Model • Utilize workload characterization data • Goal: initutition not accuracy • Optimal point of replication – Sensitive to cache size – Sensitive to memory latency Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 52
Replication Model: Selective Replication Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 53
ASR: Memory Cycles S: CMP-Shared P: CMP-Private V: SPR-VR N: SPR-NR ASR: Adaptive Selective Replication for CMP Caches C: SPR-CC A: SPR-ASR 4 MB 150 Memory Latency In-order processors Beckmann, Marty, & Wood 54
ASR: Performance 4 MB 150 Memory Latency In-order processors Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches S: CMP-Shared P: CMP-Private V: SPR-VR N: SPR-NR C: SPR-CC A: SPR-ASR 55
ASR: Memory Cycles S: CMP-Shared P: CMP-Private V: SPR-VR N: SPR-NR ASR: Adaptive Selective Replication for CMP Caches C: SPR-CC A: SPR-ASR 16 MB 250 Memory Latency Out-of-order processors Beckmann, Marty, & Wood 56
ASR: Performance S: CMP-Shared P: CMP-Private 16 MB V: SPR-VR 250 Memory Latency N: SPR-NR Out-of-order processors C: SPR-CC ASR: Adaptive Selective Replication for CMP Caches Beckmann, Marty, & Wood A: SPR-ASR 57
ASR: Memory Cycles S: CMP-Shared P: CMP-Private V: SPR-VR N: SPR-NR ASR: Adaptive Selective Replication for CMP Caches C: SPR-CC A: SPR-ASR 16 MB 500 Memory Latency Out-of-order processors Beckmann, Marty, & Wood 58
ASR: Performance 16 MB 500 Memory Latency Out-of-order processors Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches S: CMP-Shared P: CMP-Private V: SPR-VR N: SPR-NR C: SPR-CC A: SPR-ASR 59
Token Coherence • Proposed for SMPs [Martin 03], CMPs [Marty 05] • Provides a simple correctness substrate – One token to read – All tokens to write • Advantages – Permits a broadcast protocol on unordered network without acknowledgement messages – Supports multiple allocation policies • Disadvantages – All blocks must be written back (cannot destroy tokens) – Token counts at memory – Persistent request can be a performance bottleneck Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 60
- Slides: 60