Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri

  • Slides: 28
Download presentation
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006

Motivation • Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs •

Motivation • Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs • Two important demands for CMP caching – Capacity: reduce off-chip accesses – Latency: reduce remote on-chip references • Need to combine the strength of both private and shared cache designs CMP Cooperative Caching / ISCA 2006 2

Yet Another Hybrid CMP Cache - Why? • Private cache based design – –

Yet Another Hybrid CMP Cache - Why? • Private cache based design – – Lower latency and per-cache associativity Lower cross-chip bandwidth requirement Self-contained for resource management Easier to support Qo. S, fairness, and priority • Need a unified framework – Manage the aggregate on-chip cache resources – Can be adopted by different coherence protocols CMP Cooperative Caching / ISCA 2006 3

CMP Cooperative Caching • Form an aggregate global cache via cooperative private caches –

CMP Cooperative Caching • Form an aggregate global cache via cooperative private caches – Use private caches to attract data for fast reuse – Share capacity through cooperative policies – Throttle cooperation to find an optimal sharing point • Inspired by cooperative file/web caches – Similar latency tradeoff – Similar algorithms P L 1 I P L 1 D L 2 L 1 I L 1 D L 2 Network L 2 L 1 I L 1 D P CMP Cooperative Caching / ISCA 2006 L 2 L 1 I L 1 D P 4

Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation •

Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation • Conclusion CMP Cooperative Caching / ISCA 2006 5

Policies to Reduce Off-chip Accesses • Cooperation policies for capacity sharing – (1) Cache-to-cache

Policies to Reduce Off-chip Accesses • Cooperation policies for capacity sharing – (1) Cache-to-cache transfers of clean data – (2) Replication-aware replacement – (3) Global replacement of inactive data • Implemented by two unified techniques – Policies enforced by cache replacement/placement – Information/data exchange supported by modifying the coherence protocol CMP Cooperative Caching / ISCA 2006 6

Policy (1) - Make use of all on-chip data • Don’t go off-chip if

Policy (1) - Make use of all on-chip data • Don’t go off-chip if on-chip (clean) data exist • Beneficial and practical for CMPs – Peer cache is much closer than next-level storage – Affordable implementations of “clean ownership” • Important for all workloads – Multi-threaded: (mostly) read-only shared data – Single-threaded: spill into peer caches for later reuse CMP Cooperative Caching / ISCA 2006 7

Policy (2) – Control replication • Intuition – increase # of unique on-chip data

Policy (2) – Control replication • Intuition – increase # of unique on-chip data “singlets” • Latency/capacity tradeoff – Evict singlets only when no replications exist – Modify the default cache replacement policy • “Spill” an evicted singlet into a peer cache – Can further reduce on-chip replication – Randomly choose a recipient cache for spilling CMP Cooperative Caching / ISCA 2006 8

Policy (3) - Global cache management • Approximate global-LRU replacement – Combine global spill/reuse

Policy (3) - Global cache management • Approximate global-LRU replacement – Combine global spill/reuse history with local LRU • Identify and replace globally inactive data – First become the LRU entry in the local cache – Set as MRU if spilled into a peer cache – Later become LRU entry again: evict globally • 1 -chance forwarding (1 -Fwd) – Blocks can only be spilled once if not reused CMP Cooperative Caching / ISCA 2006 9

Cooperation Throttling • Why throttling? – Further tradeoff between capacity/latency • Two probabilities to

Cooperation Throttling • Why throttling? – Further tradeoff between capacity/latency • Two probabilities to help make decisions – Cooperation probability: control replication – Spill probability: throttle spilling Shared CC 100% Private Cooperative Caching CC 0% Policy (1) CMP Cooperative Caching / ISCA 2006 10

Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation •

Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation • Conclusion CMP Cooperative Caching / ISCA 2006 11

Hardware Implementation • Requirements – Information: singlet, spill/reuse history – Cache replacement policy –

Hardware Implementation • Requirements – Information: singlet, spill/reuse history – Cache replacement policy – Coherence protocol: clean owner and spilling • Can modify an existing implementation • Proposed implementation – Central Coherence Engine (CCE) – On-chip directory by duplicating tag arrays CMP Cooperative Caching / ISCA 2006 12

Information and Data Exchange • Singlet information – Directory detects and notifies the block

Information and Data Exchange • Singlet information – Directory detects and notifies the block owner • Sharing of clean data – PUTS: notify directory of clean data replacement – Directory sends forward request to the first sharer • Spilling – Currently implemented as a 2 -step data transfer – Can be implemented as recipient-issued prefetch CMP Cooperative Caching / ISCA 2006 13

Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation •

Outline • Introduction • CMP Cooperative Caching • Hardware Implementation • Performance Evaluation • Conclusion CMP Cooperative Caching / ISCA 2006 14

Performance Evaluation • Full system simulator – Modified GEMS Ruby to simulate memory hierarchy

Performance Evaluation • Full system simulator – Modified GEMS Ruby to simulate memory hierarchy – Simics MAI-based Oo. O processor simulator • Workloads – Multithreaded commercial benchmarks (8 -core) • OLTP, Apache, JBB, Zeus – Multiprogrammed SPEC 2000 benchmarks (4 -core) • 4 heterogeneous, 2 homogeneous • Private / shared / cooperative schemes – Same total capacity/associativity CMP Cooperative Caching / ISCA 2006 15

Multithreaded Workloads - Throughput • CC throttling - 0%, 30%, 70% and 100% •

Multithreaded Workloads - Throughput • CC throttling - 0%, 30%, 70% and 100% • Ideal – Shared cache with local bank latency CMP Cooperative Caching / ISCA 2006 16

Multithreaded Workloads - Avg. Latency • Low off-chip miss rate • High hit ratio

Multithreaded Workloads - Avg. Latency • Low off-chip miss rate • High hit ratio to local L 2 • Lower bandwidth needed than a shared cache CMP Cooperative Caching / ISCA 2006 17

Multiprogrammed Workloads Off-chip Remote L 2 Local L 2 L 1 CMP Cooperative Caching

Multiprogrammed Workloads Off-chip Remote L 2 Local L 2 L 1 CMP Cooperative Caching / ISCA 2006 Off-chip Remote L 2 Local L 2 L 1 18

Sensitivity - Varied Configurations • In-order processor with blocking caches Performance Normalized to Shared

Sensitivity - Varied Configurations • In-order processor with blocking caches Performance Normalized to Shared Cache CMP Cooperative Caching / ISCA 2006 19

Comparison with Victim Replication Singlethreaded SPECOMP Cooperative Caching / ISCA 2006 20

Comparison with Victim Replication Singlethreaded SPECOMP Cooperative Caching / ISCA 2006 20

A Spectrum of Related Research Shared Caches Hybrid Schemes • Speight et al. ISCA’

A Spectrum of Related Research Shared Caches Hybrid Schemes • Speight et al. ISCA’ 05 (CMP-NUCA proposals, Victim replication, - Private caches. CMP-Nu. RAPID, Fast Speight et al. ISCA’ 05, -&Writeback into caching, peer caches Fair, synergistic etc) Best capacity …… Private Caches Best latency Cooperative Caching Configurable/malleable Caches (Liu et al. HPCA’ 04, Huh et al. ICS’ 05, cache partitioning proposals, etc) CMP Cooperative Caching / ISCA 2006 21

Conclusion • CMP cooperative caching – Exploit benefits of private cache based design –

Conclusion • CMP cooperative caching – Exploit benefits of private cache based design – Capacity sharing through explicit cooperation – Cache replacement/placement policies for replication control and global management – Robust performance improvement Thank you! CMP Cooperative Caching / ISCA 2006 22

Backup Slides

Backup Slides

Shared vs. Private Caches for CMP • Shared cache – best capacity – No

Shared vs. Private Caches for CMP • Shared cache – best capacity – No replication, dynamic capacity sharing • Private cache – best latency – High local hit ratio, no interference from other cores • Need to combine the strengths of both Shared Desired CMP Cooperative Caching / ISCA 2006 Private Duplication 24

CMP Caching Challenges/Opportunities • Future hardware – More cores, limited on-chip cache capacity –

CMP Caching Challenges/Opportunities • Future hardware – More cores, limited on-chip cache capacity – On-chip wire-delay, costly off-chip accesses • Future software – Diverse sharing patterns, working set sizes – Interference due to capacity sharing • Opportunities – Freedom in designing on-chip structures/interfaces – High-bandwidth, low-latency on-chip communication CMP Cooperative Caching / ISCA 2006 25

Multi-cluster CMPs (Scalability) P L 1 I P L 1 D L 2 P

Multi-cluster CMPs (Scalability) P L 1 I P L 1 D L 2 P L 1 I L 2 P L 1 D CCE L 2 L 1 D L 1 I L 2 CCE L 1 D L 2 L 1 D L 1 P P P P L 1 D L 2 L 1 D CCE L 2 L 1 D P L 1 L 2 CCE L 1 D P L 2 L 1 D P CMP Cooperative Caching / ISCA 2006 L 1 D L 1 P 26

Central Coherence Engine (CCE). . . Network. . . P 1 Spilling buffer Output

Central Coherence Engine (CCE). . . Network. . . P 1 Spilling buffer Output queue Directory Memory P 8 L 1 Tag Array …… L 2 Tag Array …… 4 -to-1 mux Input queue 4 -to-1 mux State vector gather State vector . . . Network. . . CMP Cooperative Caching / ISCA 2006 27

Cache/Network Configurations • Parameters – 128 -byte block size; 300 -cycle memory latency –

Cache/Network Configurations • Parameters – 128 -byte block size; 300 -cycle memory latency – L 1 I/D split caches: 32 KB, 2 -way, 2 -cycle – L 2 caches: Sequential tag/data access, 15 -cycle – On-chip network: 5 -cycle per-hop • Configurations – 4 -core: 2 X 2 mesh network – 8 -core: 3 X 3 or 4 X 2 mesh network – Same total capacity/associativity for private/shared/cooperative caching schemes CMP Cooperative Caching / ISCA 2006 28