Token Coherence A Framework for Implementing MultipleCMP Systems

  • Slides: 46
Download presentation
Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2,

Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo Martin 3, and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania February 17 th, 2005 (C) 2005 Multifacet Project

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem:

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory 2 Improving Multiple-CMP Systems using Token Coherence

Outline • Motivation and Background – Coherence in Multiple-CMP Systems – Example: Directory. CMP

Outline • Motivation and Background – Coherence in Multiple-CMP Systems – Example: Directory. CMP 3 • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation Improving Multiple-CMP Systems using Token Coherence

Coherence in Multiple-CMP Systems • Chip Multiprocessors (CMPs) emerging • Larger systems will be

Coherence in Multiple-CMP Systems • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs P I P D I D P P I D interconnect L 2 1 CMP L 2 CMP 2 L 2 interconnect CMP 3 4 CMP 4 Improving Multiple-CMP Systems using Token Coherence

Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for

Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity – explodes state space CMP 2 CMP 1 Inter-CMP Coherence interconnect CMP 3 5 CMP 4 Intra-CMP Coherence Improving Multiple-CMP Systems using Token Coherence

Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to

Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be. . . – Flat for correctness, but – Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 Performance Protocol interconnect CMP 3 6 CMP 4 Improving Multiple-CMP Systems using Token Coherence

Example: Directory. CMP 2 -level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store

Example: Directory. CMP 2 -level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 L 1 I&D L 1 I&D data/ fwd ack inv ack getx data/ ack S S inv S WB ack inv ack Shared L 2 / directory getx data/ ack O S Shared L 2 / directory WB fwd B: [M [S O] I] Memory/Directory 7 getx Memory/Directory Improving Multiple-CMP Systems using Token Coherence

Token Coherence Summary • Token Coherence separates performance from correctness • Correctness Substrate: Enforces

Token Coherence Summary • Token Coherence separates performance from correctness • Correctness Substrate: Enforces coherence invariant and prevents starvation 1. Safety with Token Counting 2. Starvation Avoidance with Persistent Requests • Performance Policy: Makes the common case fast – Transient requests to seek tokens • Unordered, untracked, unacknowledged – Possible prediction, multicast, filters, etc 8 Improving Multiple-CMP Systems using Token Coherence

Outline • Motivation and Background • Token Coherence: Flat for Correctness – Safety –

Outline • Motivation and Background • Token Coherence: Flat for Correctness – Safety – Starvation Avoidance 9 • Token Coherence: Hierarchical for Performance • Evaluation Improving Multiple-CMP Systems using Token Coherence

Example: Token Coherence [ISCA 2003] Load B Store B P 0 P 1 L

Example: Token Coherence [ISCA 2003] Load B Store B P 0 P 1 L 1 I&D L 2 mem 0 • • 10 L 1 I&D L 2 P 2 L 1 I&D L 2 interconnect P 3 L 1 I&D L 2 mem 3 Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block Improving Multiple-CMP Systems using Token Coherence

Extending to Multiple-CMP System CMP 0 L 1 I&D CMP 1 L 1 I&D

Extending to Multiple-CMP System CMP 0 L 1 I&D CMP 1 L 1 I&D L 2 interconnect L 2 P 2 L 1 I&D mem 0 11 L 1 I&D L 2 Shared L 2 P 3 L 2 interconnect Shared L 2 interconnect mem 1 Improving Multiple-CMP Systems using Token Coherence

Extending to Multiple-CMP System CMP 0 CMP 1 Store B P 0 L 1

Extending to Multiple-CMP System CMP 0 CMP 1 Store B P 0 L 1 I&D P 1 P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Token counting remains flat • Tokens to caches – Handles shared caches and other complex hierarchies 12 Improving Multiple-CMP Systems using Token Coherence

Safety Recap • Safety: Maintain coherence invariant – Only one writer, or multiple readers

Safety Recap • Safety: Maintain coherence invariant – Only one writer, or multiple readers • Tokens for Safety – T Tokens associated with each memory block – # tokens encoded in 1+log 2 T – Processor acquires all tokens to write, a single token to read • Tokens passed to nodes in glueless multiprocessor scheme – But CMPs have private and shared caches • Tokens passed to caches in Multiple-CMP system – Arbitrary cache hierarchy easily handled – Flat for correctness 13 Improving Multiple-CMP Systems using Token Coherence

Some Token Counting Implications • Memory must store tokens – Separate RAM – Use

Some Token Counting Implications • Memory must store tokens – Separate RAM – Use extra ECC bits – Token cache • T sized to # caches to allow read-only copies in all caches • Replacements cannot be silent – Tokens must not be lost or dropped • Targeted for invalidate-based protocols – Not a solution for write-through or update protocols • Tokens must be identified by block address – Address must be in all token-carrying messages 14 Improving Multiple-CMP Systems using Token Coherence

Starvation Avoidance • Request messages can miss tokens – In-flight tokens • Transient Requests

Starvation Avoidance • Request messages can miss tokens – In-flight tokens • Transient Requests are not tracked throughout system – Incorrect filtering, multicast, destination-set prediction, etc • Possible Solution: Retries – Retry w/ optional randomized backoff is effective for races • Guaranteed Solution: Persistent Requests – – 15 Heavyweight request guaranteed to succeed Should be rare (uses more bandwidth) Locates all tokens in the system Orders competing requests Improving Multiple-CMP Systems using Token Coherence

Starvation Avoidance CMP 0 GETX CMP 1 Store B P 0 P 1 L

Starvation Avoidance CMP 0 GETX CMP 1 Store B P 0 P 1 L 1 I&D Store B GETX P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Tokens move freely in the system – Transient requests can miss in-flight tokens – Incorrect speculation, filters, prediction, etc 16 Improving Multiple-CMP Systems using Token Coherence

Starvation Avoidance CMP 0 CMP 1 Store B P 0 P 1 P 2

Starvation Avoidance CMP 0 CMP 1 Store B P 0 P 1 P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Solution: issue Persistent Request – Heavyweight request guaranteed to succeed – Methods: Centralized [2003] and Distributed (New) 17 Improving Multiple-CMP Systems using Token Coherence

Old Scheme: Central Arbiter [2003] CMP 0 Store B CMP 1 timeout Store B

Old Scheme: Central Arbiter [2003] CMP 0 Store B CMP 1 timeout Store B P 0 L 1 I&D Store B timeout P 1 P 2 L 1 I&D interconnect B: P 0 B: P 2 B: P 1 mem 0 P 3 L 1 I&D interconnect Shared L 2 arbiter 0 timeout Shared L 2 interconnect mem 1 arbiter 0 – Processors issue persistent requests 18 Improving Multiple-CMP Systems using Token Coherence

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 0 P

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 0 P 1 P 2 B: P 0 L 1 I&D interconnect P 3 L 1 I&D B: P 0 interconnect B: P 0 Shared L 2 B: P 0 arbiter 0 B: P 2 B: P 1 mem 0 interconnect mem 1 arbiter 0 – Processors issue persistent requests – Arbiter orders and broadcasts activate 19 Improving Multiple-CMP Systems using Token Coherence

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 1 P

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 1 P 2 P 0 B: P 2 P 0 L 1 I&D interconnect B: P 2 P 0 L 1 I&D B: P 0 P 2 3 arbiter 0 B: P 2 B: P 1 L 1 I&D Shared L 2 B: P 2 P 0 2 mem 0 B: P 2 P 0 interconnect B: P 2 P 0 Shared L 2 1 P 3 interconnect mem 1 arbiter 0 – Processor sends deactivate to arbiter – Arbiter broadcasts deactivate (and next activate) – Bottom Line: handoff is 3 message latencies 20 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P 1 P 0: B P 1: B P 2: B L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests 21 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P 1 P 0: B P 1: B P 2: B L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests – Fixed priority (processor number) 22 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0: B

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0: B P 1: B P 2: B L 1 I&D 1 P 1 L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests – Fixed priority (processor number) – Processors broadcast deactivate 23 Improving Multiple-CMP Systems using Token Coherence

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 P 0 P 1: B

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 P 0 P 1: B P 2: B L 1 I&D P 1: B P 2: B L 1 I&D interconnect P 1: B P 2: B L 1 I&D P 1: B P 2: B interconnect Shared L 2 P 1: B P 2: B mem 0 P 1: B P 2: B P 3 interconnect mem 1 – Bottom line: Handoff is a single message latency • Subtle point: P 0 and P 1 must wait until next “wave” 24 Improving Multiple-CMP Systems using Token Coherence

Implementing Distributed Persistent Requests • Table at each cache – Sized to N entries

Implementing Distributed Persistent Requests • Table at each cache – Sized to N entries for each processor (we use N=1) – Indexed by processor ID – Content-addressable by Address • Each incoming message must access table – Not on the critical path– can be slow CAM • Activate/deactivate reordering cannot be allowed – Persistent request virtual channel must be point-to-point ordered – Or, other solution such as sequence numbers or acks 25 Improving Multiple-CMP Systems using Token Coherence

Implementing Distributed Persistent Requests • Should reads be distinguished from writes? – Not necessary,

Implementing Distributed Persistent Requests • Should reads be distinguished from writes? – Not necessary, but – Persistent Read request is helpful • Implications of flat distributed arbitration – Simple flat for correctness – Global broadcast when used • Fortunately they are rare in typical workloads (0. 3%) • Bad workload (very high contention) would burn bandwidth – Maximum # processors must be architected • What about a hierarchical persistent request scheme? – Possible, but correctness is no longer flat – Make the common case fast 26 Improving Multiple-CMP Systems using Token Coherence

Reducing Unnecessary Traffic • Problem: Which token-holding cache responds with data? • Solution: Distinguish

Reducing Unnecessary Traffic • Problem: Which token-holding cache responds with data? • Solution: Distinguish one token as the owner token – The owner includes data with token response – Clean vs. dirty owner distinction also useful for writebacks 27 Improving Multiple-CMP Systems using Token Coherence

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence:

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance – Token. CMP – Another look at performance policies • 28 Evaluation Improving Multiple-CMP Systems using Token Coherence

Hierarchical for Performance: Token. CMP • Target System: – 2 -8 CMPs – Private

Hierarchical for Performance: Token. CMP • Target System: – 2 -8 CMPs – Private L 1 s, shared L 2 per CMP – Any interconnect, but high-bandwidth • Performance Policy Goals: – – 29 Aggressively acquire tokens Exploit on-chip locality and bandwidth Respect cache hierarchy Detecting and handling missed tokens Improving Multiple-CMP Systems using Token Coherence

Hierarchical for Performance: Token. CMP • Approach: – On L 1 miss, broadcast within

Hierarchical for Performance: Token. CMP • Approach: – On L 1 miss, broadcast within own CMP • Local cache responds if possible – On L 2 miss, broadcast to other CMPs – Appropriate L 2 bank responds or broadcasts within its CMP • Optionally filter – Responses between CMPs carry extra tokens for future locality • Handling missed tokens: – Timeout after average memory latency – Invoke persistent request (no retries) • 30 Larger systems can use filters, multicast, soft-state directories Improving Multiple-CMP Systems using Token Coherence

Other Optimizations in Token. CMP • Implementing E-state – Memory responds with all tokens

Other Optimizations in Token. CMP • Implementing E-state – Memory responds with all tokens on read request – Use clean/dirty owner distinction to eliminate writing back unwritten data • Implementing Migratory Sharing – What is it? • A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block – In Token. CMP, simply return all tokens • Non-speculative delay – Hold block for some # cycles so permission isn’t stolen prematurely 31 Improving Multiple-CMP Systems using Token Coherence

Another Look at Performance Policies • How to find tokens? – – Broadcast w/

Another Look at Performance Policies • How to find tokens? – – Broadcast w/ filters Multicast (destination-set prediction) Directories (soft or hard) • Who responds with data? – Owner token • Token. CMP uses Owner token for Inter-CMP responses – Other heuristics • For Token. CMP intra-CMP responses, cache responds if it has extra tokens 32 Improving Multiple-CMP Systems using Token Coherence

Transient Requests May Reduce Complexity • Processor holds the only required state about request

Transient Requests May Reduce Complexity • Processor holds the only required state about request • L 2 controller in Token. CMP very simple: – Re-broadcasts L 1 request message on a miss – Re-broadcasts or filters external request messages – Possible states: • no tokens (I) • all tokens (M) • some tokens (S) – Bounce unexpected tokens to memory • Directory. CMP’s L 2 controller is complex – – 33 Allocates MSHR on miss and forward Issues invalidates and receives acks Orders all intra-CMP requests and writebacks 57 states in our L 2 implementation! Improving Multiple-CMP Systems using Token Coherence

Writebacks • Directory. CMP uses “ 3 -phase writebacks” – – L 1 issues

Writebacks • Directory. CMP uses “ 3 -phase writebacks” – – L 1 issues writeback request L 2 enters transient state or blocks request L 2 responds with writeback L 1 sends data • Token. CMP uses “fire-and-forget” writebacks – Immediately send tokens and data – Heuristic: Only send data if # tokens > 1 34 Improving Multiple-CMP Systems using Token Coherence

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence:

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation – Model checking – Performance w/ commercial workloads – Robustness 35 Improving Multiple-CMP Systems using Token Coherence

Token. CMP Evaluation • Simple? – Some anecdotal examples and comparisons – Model checking

Token. CMP Evaluation • Simple? – Some anecdotal examples and comparisons – Model checking • Fast? – Full-system simulation w/ commercial workloads • Robust? – Micro-benchmarks to simulate high contention 36 Improving Multiple-CMP Systems using Token Coherence

Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu

Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia • Methods: – TLA+ and TLC – Directory. CMP omits all intra-CMP details – Token. CMP’s correctness substrate modeled • Result: – Complexity similar between Token. CMP and non-hierarchical Directory. CMP – Correctness Substrate verified to be correct and deadlock-free – All possible performance protocols correct 37 Improving Multiple-CMP Systems using Token Coherence

Performance Evaluation • Target System: – 4 CMPs, 4 procs/cmp – 2 GHz Oo.

Performance Evaluation • Target System: – 4 CMPs, 4 procs/cmp – 2 GHz Oo. O SPARC, 8 MB shared L 2 per chip – Directly connected interconnect • Methods: Multifacet GEMS simulator – Simics augmented with timing models – Released soon: http: //www. cs. wisc. edu/gems • Benchmarks: – Performance: Apache, Spec, OLTP – Robustness: Locking u. Benchmark 38 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP 39

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP 39 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP DRAM

Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP DRAM Directory Perfect L 2 40 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Inter-CMP Traffic – Token. CMP traffic is reasonable (or better) • Directory.

Full-system Simulation: Inter-CMP Traffic – Token. CMP traffic is reasonable (or better) • Directory. CMP control overhead greater than broadcast for small system 41 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Intra-CMP Traffic 42 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Intra-CMP Traffic 42 Improving Multiple-CMP Systems using Token Coherence

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 43 less contention Improving Multiple-CMP

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 43 less contention Improving Multiple-CMP Systems using Token Coherence

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 44 less contention Improving Multiple-CMP

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 44 less contention Improving Multiple-CMP Systems using Token Coherence

Performance Robustness Locking micro-benchmark more contention 45 less contention Improving Multiple-CMP Systems using Token

Performance Robustness Locking micro-benchmark more contention 45 less contention Improving Multiple-CMP Systems using Token Coherence

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem:

Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory 46 Improving Multiple-CMP Systems using Token Coherence