Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri
- Slides: 28
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006
Motivation • Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs • Two important demands for CMP caching – Capacity: reduce off-chip accesses – Latency: reduce remote on-chip references • Need to combine the strength of both private and shared cache designs CMP Cooperative Caching / ISCA 2006 2
Yet Another Hybrid CMP Cache - Why? • Private cache based design – – Lower latency and per-cache associativity Lower cross-chip bandwidth requirement Self-contained for resource management Easier to support Qo. S, fairness, and priority • Need a unified framework – Manage the aggregate on-chip cache resources – Can be adopted by different coherence protocols CMP Cooperative Caching / ISCA 2006 3
CMP Cooperative Caching • Form an aggregate global cache via cooperative private caches – Use private caches to attract data for fast reuse – Share capacity through cooperative policies – Throttle cooperation to find an optimal sharing point • Inspired by cooperative file/web caches – Similar latency tradeoff – Similar algorithms P L 1 I P L 1 D L 2 L 1 I L 1 D L 2 Network L 2 L 1 I L 1 D P CMP Cooperative Caching / ISCA 2006 L 2 L 1 I L 1 D P 4
Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation • Conclusion CMP Cooperative Caching / ISCA 2006 5
Policies to Reduce Off-chip Accesses • Cooperation policies for capacity sharing – (1) Cache-to-cache transfers of clean data – (2) Replication-aware replacement – (3) Global replacement of inactive data • Implemented by two unified techniques – Policies enforced by cache replacement/placement – Information/data exchange supported by modifying the coherence protocol CMP Cooperative Caching / ISCA 2006 6
Policy (1) - Make use of all on-chip data • Don’t go off-chip if on-chip (clean) data exist • Beneficial and practical for CMPs – Peer cache is much closer than next-level storage – Affordable implementations of “clean ownership” • Important for all workloads – Multi-threaded: (mostly) read-only shared data – Single-threaded: spill into peer caches for later reuse CMP Cooperative Caching / ISCA 2006 7
Policy (2) – Control replication • Intuition – increase # of unique on-chip data “singlets” • Latency/capacity tradeoff – Evict singlets only when no replications exist – Modify the default cache replacement policy • “Spill” an evicted singlet into a peer cache – Can further reduce on-chip replication – Randomly choose a recipient cache for spilling CMP Cooperative Caching / ISCA 2006 8
Policy (3) - Global cache management • Approximate global-LRU replacement – Combine global spill/reuse history with local LRU • Identify and replace globally inactive data – First become the LRU entry in the local cache – Set as MRU if spilled into a peer cache – Later become LRU entry again: evict globally • 1 -chance forwarding (1 -Fwd) – Blocks can only be spilled once if not reused CMP Cooperative Caching / ISCA 2006 9
Cooperation Throttling • Why throttling? – Further tradeoff between capacity/latency • Two probabilities to help make decisions – Cooperation probability: control replication – Spill probability: throttle spilling Shared CC 100% Private Cooperative Caching CC 0% Policy (1) CMP Cooperative Caching / ISCA 2006 10
Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation • Conclusion CMP Cooperative Caching / ISCA 2006 11
Hardware Implementation • Requirements – Information: singlet, spill/reuse history – Cache replacement policy – Coherence protocol: clean owner and spilling • Can modify an existing implementation • Proposed implementation – Central Coherence Engine (CCE) – On-chip directory by duplicating tag arrays CMP Cooperative Caching / ISCA 2006 12
Information and Data Exchange • Singlet information – Directory detects and notifies the block owner • Sharing of clean data – PUTS: notify directory of clean data replacement – Directory sends forward request to the first sharer • Spilling – Currently implemented as a 2 -step data transfer – Can be implemented as recipient-issued prefetch CMP Cooperative Caching / ISCA 2006 13
Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation • Conclusion CMP Cooperative Caching / ISCA 2006 14
Performance Evaluation • Full system simulator – Modified GEMS Ruby to simulate memory hierarchy – Simics MAI-based Oo. O processor simulator • Workloads – Multithreaded commercial benchmarks (8 -core) • OLTP, Apache, JBB, Zeus – Multiprogrammed SPEC 2000 benchmarks (4 -core) • 4 heterogeneous, 2 homogeneous • Private / shared / cooperative schemes – Same total capacity/associativity CMP Cooperative Caching / ISCA 2006 15
Multithreaded Workloads - Throughput • CC throttling - 0%, 30%, 70% and 100% • Ideal – Shared cache with local bank latency CMP Cooperative Caching / ISCA 2006 16
Multithreaded Workloads - Avg. Latency • Low off-chip miss rate • High hit ratio to local L 2 • Lower bandwidth needed than a shared cache CMP Cooperative Caching / ISCA 2006 17
Multiprogrammed Workloads Off-chip Remote L 2 Local L 2 L 1 CMP Cooperative Caching / ISCA 2006 Off-chip Remote L 2 Local L 2 L 1 18
Sensitivity - Varied Configurations • In-order processor with blocking caches Performance Normalized to Shared Cache CMP Cooperative Caching / ISCA 2006 19
Comparison with Victim Replication Singlethreaded SPECOMP Cooperative Caching / ISCA 2006 20
A Spectrum of Related Research Shared Caches Hybrid Schemes • Speight et al. ISCA’ 05 (CMP-NUCA proposals, Victim replication, - Private caches. CMP-Nu. RAPID, Fast Speight et al. ISCA’ 05, -&Writeback into caching, peer caches Fair, synergistic etc) Best capacity …… Private Caches Best latency Cooperative Caching Configurable/malleable Caches (Liu et al. HPCA’ 04, Huh et al. ICS’ 05, cache partitioning proposals, etc) CMP Cooperative Caching / ISCA 2006 21
Conclusion • CMP cooperative caching – Exploit benefits of private cache based design – Capacity sharing through explicit cooperation – Cache replacement/placement policies for replication control and global management – Robust performance improvement Thank you! CMP Cooperative Caching / ISCA 2006 22
Backup Slides
Shared vs. Private Caches for CMP • Shared cache – best capacity – No replication, dynamic capacity sharing • Private cache – best latency – High local hit ratio, no interference from other cores • Need to combine the strengths of both Shared Desired CMP Cooperative Caching / ISCA 2006 Private Duplication 24
CMP Caching Challenges/Opportunities • Future hardware – More cores, limited on-chip cache capacity – On-chip wire-delay, costly off-chip accesses • Future software – Diverse sharing patterns, working set sizes – Interference due to capacity sharing • Opportunities – Freedom in designing on-chip structures/interfaces – High-bandwidth, low-latency on-chip communication CMP Cooperative Caching / ISCA 2006 25
Multi-cluster CMPs (Scalability) P L 1 I P L 1 D L 2 P L 1 I L 2 P L 1 D CCE L 2 L 1 D L 1 I L 2 CCE L 1 D L 2 L 1 D L 1 P P P P L 1 D L 2 L 1 D CCE L 2 L 1 D P L 1 L 2 CCE L 1 D P L 2 L 1 D P CMP Cooperative Caching / ISCA 2006 L 1 D L 1 P 26
Central Coherence Engine (CCE). . . Network. . . P 1 Spilling buffer Output queue Directory Memory P 8 L 1 Tag Array …… L 2 Tag Array …… 4 -to-1 mux Input queue 4 -to-1 mux State vector gather State vector . . . Network. . . CMP Cooperative Caching / ISCA 2006 27
Cache/Network Configurations • Parameters – 128 -byte block size; 300 -cycle memory latency – L 1 I/D split caches: 32 KB, 2 -way, 2 -cycle – L 2 caches: Sequential tag/data access, 15 -cycle – On-chip network: 5 -cycle per-hop • Configurations – 4 -core: 2 X 2 mesh network – 8 -core: 3 X 3 or 4 X 2 mesh network – Same total capacity/associativity for private/shared/cooperative caching schemes CMP Cooperative Caching / ISCA 2006 28
- Requirements engineering a roadmap
- Jichuan chang
- Gestão unificada de recursos institucionais
- Guri sohi
- Pasqyra e deshirave joanne rowling
- Guri larsen
- Methylation & chip-on-chip microarray platform
- Multiple processor systems
- What is interprocessor arbitration
- Multiprocessors are classified as
- Uma multiprocessors using multistage switching networks
- Offline caching greedy algorithm
- Web content caching and distribution
- Adaptive insertion policies for high performance caching
- Tembang
- Adaptive insertion policies for high performance caching
- Hdfs caching
- Smärtskolan kunskap för livet
- Bris för vuxna
- Mat för idrottare
- Jiddisch
- Frgar
- För och nackdelar med firo
- Humanitr
- Datorkunskap för nybörjare
- Blomman för dagen drog
- Vad står k.r.å.k.a.n för
- Rita perspektiv
- Redogör för vad psykologi är