Improving MultipleCMP Systems with Token Coherence Mike Marty

  • Slides: 37
Download presentation
Improving Multiple-CMP Systems with Token Coherence Mike Marty 1, Jesse Bingham 2, Mark Hill

Improving Multiple-CMP Systems with Token Coherence Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo Martin 3, and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun (C) 2005 Multifacet Project

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem:

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Protocol Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [ISCA 2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory 2 Improving Multiple-CMP Systems using Token Coherence

Outline • Motivation and Background – Coherence in Multiple-CMP Systems – Example: Directory. CMP

Outline • Motivation and Background – Coherence in Multiple-CMP Systems – Example: Directory. CMP 3 • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation Improving Multiple-CMP Systems using Token Coherence

Coherence in Multiple-CMP Systems • Chip Multiprocessors (CMPs) emerging • Larger systems will be

Coherence in Multiple-CMP Systems • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs P I P D I D P P I D interconnect L 2 1 CMP L 2 CMP 2 L 2 interconnect CMP 3 4 CMP 4 Improving Multiple-CMP Systems using Token Coherence

Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for

Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity – explodes state space CMP 2 CMP 1 Inter-CMP Coherence interconnect CMP 3 5 CMP 4 Intra-CMP Coherence Improving Multiple-CMP Systems using Token Coherence

Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to

Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be. . . – Flat for correctness, but – Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 Performance Protocol interconnect CMP 3 6 CMP 4 Improving Multiple-CMP Systems using Token Coherence

Example: Directory. CMP 2 -level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store

Example: Directory. CMP 2 -level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 L 1 I&D L 1 I&D data/ fwd ack inv ack getx data/ ack S S inv S WB ack inv ack Shared L 2 / directory getx data/ ack O S Shared L 2 / directory WB fwd B: [M [S O] I] Memory/Directory 7 getx Memory/Directory Improving Multiple-CMP Systems using Token Coherence

Outline • Motivation and Background • Token Coherence: Flat for Correctness – Safety –

Outline • Motivation and Background • Token Coherence: Flat for Correctness – Safety – Starvation Avoidance 8 • Token Coherence: Hierarchical for Performance • Evaluation Improving Multiple-CMP Systems using Token Coherence

Example: Token Coherence [ISCA 2003] Load B Store B P 0 P 1 L

Example: Token Coherence [ISCA 2003] Load B Store B P 0 P 1 L 1 I&D L 2 mem 0 • • 9 L 1 I&D L 2 P 2 L 1 I&D L 2 interconnect P 3 L 1 I&D L 2 mem 3 Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block Improving Multiple-CMP Systems using Token Coherence

Extending to Multiple-CMP System CMP 0 L 1 I&D CMP 1 L 1 I&D

Extending to Multiple-CMP System CMP 0 L 1 I&D CMP 1 L 1 I&D L 2 interconnect L 2 P 2 L 1 I&D mem 0 10 L 1 I&D L 2 Shared L 2 P 3 L 2 interconnect Shared L 2 interconnect mem 1 Improving Multiple-CMP Systems using Token Coherence

Extending to Multiple-CMP System CMP 0 CMP 1 Store B P 0 L 1

Extending to Multiple-CMP System CMP 0 CMP 1 Store B P 0 L 1 I&D P 1 P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Token counting remains flat • Tokens to caches – Handles shared caches and other complex hierarchies 11 Improving Multiple-CMP Systems using Token Coherence

Starvation Avoidance CMP 0 GETX CMP 1 Store B P 0 P 1 L

Starvation Avoidance CMP 0 GETX CMP 1 Store B P 0 P 1 L 1 I&D Store B GETX P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Tokens move freely in the system – Transient requests can miss in-flight tokens – Incorrect speculation, filters, prediction, etc 12 Improving Multiple-CMP Systems using Token Coherence

Starvation Avoidance CMP 0 CMP 1 Store B P 0 P 1 P 2

Starvation Avoidance CMP 0 CMP 1 Store B P 0 P 1 P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Solution: issue Persistent Request – Heavyweight request guaranteed to succeed – Methods: Centralized [2003] and Distributed (New) 13 Improving Multiple-CMP Systems using Token Coherence

Old Scheme: Central Arbiter [2003] CMP 0 Store B CMP 1 timeout Store B

Old Scheme: Central Arbiter [2003] CMP 0 Store B CMP 1 timeout Store B P 0 L 1 I&D Store B timeout P 1 P 2 L 1 I&D interconnect B: P 0 B: P 2 B: P 1 mem 0 P 3 L 1 I&D interconnect Shared L 2 arbiter 0 timeout Shared L 2 interconnect mem 1 arbiter 0 – Processors issue persistent requests 14 Improving Multiple-CMP Systems using Token Coherence

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 0 P

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 0 P 1 P 2 B: P 0 L 1 I&D interconnect P 3 L 1 I&D B: P 0 interconnect B: P 0 Shared L 2 B: P 0 arbiter 0 B: P 2 B: P 1 mem 0 interconnect mem 1 arbiter 0 – Processors issue persistent requests – Arbiter orders and broadcasts activate 15 Improving Multiple-CMP Systems using Token Coherence

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 1 P

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 1 P 2 P 0 B: P 2 P 0 L 1 I&D interconnect B: P 2 P 0 L 1 I&D B: P 0 P 2 3 arbiter 0 B: P 2 B: P 1 L 1 I&D Shared L 2 B: P 2 P 0 2 mem 0 B: P 2 P 0 interconnect B: P 2 P 0 Shared L 2 1 P 3 interconnect mem 1 arbiter 0 – Processor sends deactivate to arbiter – Arbiter broadcasts deactivate (and next activate) – Bottom Line: handoff is 3 message latencies 16 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P 1 P 0: B P 1: B P 2: B L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests 17 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P 1 P 0: B P 1: B P 2: B L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests – Fixed priority (processor number) 18 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0: B

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0: B P 1: B P 2: B L 1 I&D 1 P 1 L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests – Fixed priority (processor number) – Processors broadcast deactivate 19 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 P 1: B P 2: B L

Improved Scheme: Distributed Arbitration [NEW] CMP 0 P 1: B P 2: B L 1 I&D 1 CMP 1 L 1 I&D P 1: B P 2: B L 1 I&D interconnect P 1: B P 2: B L 1 I&D P 1: B P 2: B interconnect Shared L 2 P 1: B P 2: B mem 0 P 1: B P 2: B P 3 interconnect mem 1 – Bottom line: Handoff is a single message latency • Subtle point: P 0 and P 1 must wait until next “wave” 20 Improving Multiple-CMP Systems using Token Coherence

Outline 21 • Motivation and Background • Token Coherence: Flat for Correctness • Token

Outline 21 • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation Improving Multiple-CMP Systems using Token Coherence

Hierarchical for Performance: Token. CMP • Target System: – 2 -8 CMPs – Private

Hierarchical for Performance: Token. CMP • Target System: – 2 -8 CMPs – Private L 1 s, shared L 2 per CMP – Any interconnect, but high-bandwidth • Performance Policy Goals: – – 22 Aggressively acquire tokens Exploit on-chip locality and bandwidth Respect cache hierarchy Detecting and handling missed tokens Improving Multiple-CMP Systems using Token Coherence

Hierarchical for Performance: Token. CMP • Approach: – On L 1 miss, broadcast within

Hierarchical for Performance: Token. CMP • Approach: – On L 1 miss, broadcast within own CMP • Local cache responds if possible – On L 2 miss, broadcast to other CMPs – Appropriate L 2 bank responds or broadcasts within its CMP • Optionally filter – Responses between CMPs carry extra tokens for future locality • Handling missed tokens: – Timeout after average memory latency – Invoke persistent request (no retries) • 23 Larger systems can use filters, multicast, soft-state directories Improving Multiple-CMP Systems using Token Coherence

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence:

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation – Model checking – Performance w/ commercial workloads – Robustness 24 Improving Multiple-CMP Systems using Token Coherence

Token. CMP Evaluation • Simple? – Model checking • Fast? – Full-system simulation w/

Token. CMP Evaluation • Simple? – Model checking • Fast? – Full-system simulation w/ commercial workloads • Robust? – Micro-benchmarks to simulate high contention 25 Improving Multiple-CMP Systems using Token Coherence

Complexity Evaluation with Model Checking • Methods: – TLA+ and TLC – Directory. CMP

Complexity Evaluation with Model Checking • Methods: – TLA+ and TLC – Directory. CMP omits all intra-CMP details – Token. CMP’s correctness substrate modeled • Result: – Complexity similar between Token. CMP and non-hierarchical Directory. CMP – Correctness Substrate verified to be correct and deadlock-free • Small configuration, varied parameters – All possible performance protocols correct 26 Improving Multiple-CMP Systems using Token Coherence

Performance Evaluation • Target System: – 4 CMPs, 4 procs/cmp – 2 GHz Oo.

Performance Evaluation • Target System: – 4 CMPs, 4 procs/cmp – 2 GHz Oo. O SPARC, 8 MB shared L 2 per chip – Directly connected interconnect • Methods: Multifacet GEMS simulator – Simics augmented with timing models – Released soon: http: //www. cs. wisc. edu/gems – ISCA 2005 Tutorial! • Benchmarks: – Performance: Apache, Spec, OLTP – Robustness: Locking u. Benchmark 27 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP 28

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP 28 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP DRAM

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP DRAM Directory Perfect L 2 29 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Traffic – Token. CMP traffic is reasonable (or better) • Directory. CMP

Full-system Simulation: Traffic – Token. CMP traffic is reasonable (or better) • Directory. CMP control overhead greater than broadcast for small system 30 Improving Multiple-CMP Systems using Token Coherence

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 31 less contention Improving Multiple-CMP

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 31 less contention Improving Multiple-CMP Systems using Token Coherence

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 32 less contention Improving Multiple-CMP

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 32 less contention Improving Multiple-CMP Systems using Token Coherence

Performance Robustness Locking micro-benchmark more contention 33 less contention Improving Multiple-CMP Systems using Token

Performance Robustness Locking micro-benchmark more contention 33 less contention Improving Multiple-CMP Systems using Token Coherence

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem:

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Protocol Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory 34 Improving Multiple-CMP Systems using Token Coherence

35 Improving Multiple-CMP Systems using Token Coherence

35 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Traffic 36 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Traffic 36 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Intra-CMP Traffic 37 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Intra-CMP Traffic 37 Improving Multiple-CMP Systems using Token Coherence