Token Coherence A Framework for Implementing MultipleCMP Systems
- Slides: 46
Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo Martin 3, and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania February 17 th, 2005 (C) 2005 Multifacet Project
Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory 2 Improving Multiple-CMP Systems using Token Coherence
Outline • Motivation and Background – Coherence in Multiple-CMP Systems – Example: Directory. CMP 3 • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation Improving Multiple-CMP Systems using Token Coherence
Coherence in Multiple-CMP Systems • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs P I P D I D P P I D interconnect L 2 1 CMP L 2 CMP 2 L 2 interconnect CMP 3 4 CMP 4 Improving Multiple-CMP Systems using Token Coherence
Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity – explodes state space CMP 2 CMP 1 Inter-CMP Coherence interconnect CMP 3 5 CMP 4 Intra-CMP Coherence Improving Multiple-CMP Systems using Token Coherence
Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be. . . – Flat for correctness, but – Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 Performance Protocol interconnect CMP 3 6 CMP 4 Improving Multiple-CMP Systems using Token Coherence
Example: Directory. CMP 2 -level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 L 1 I&D L 1 I&D data/ fwd ack inv ack getx data/ ack S S inv S WB ack inv ack Shared L 2 / directory getx data/ ack O S Shared L 2 / directory WB fwd B: [M [S O] I] Memory/Directory 7 getx Memory/Directory Improving Multiple-CMP Systems using Token Coherence
Token Coherence Summary • Token Coherence separates performance from correctness • Correctness Substrate: Enforces coherence invariant and prevents starvation 1. Safety with Token Counting 2. Starvation Avoidance with Persistent Requests • Performance Policy: Makes the common case fast – Transient requests to seek tokens • Unordered, untracked, unacknowledged – Possible prediction, multicast, filters, etc 8 Improving Multiple-CMP Systems using Token Coherence
Outline • Motivation and Background • Token Coherence: Flat for Correctness – Safety – Starvation Avoidance 9 • Token Coherence: Hierarchical for Performance • Evaluation Improving Multiple-CMP Systems using Token Coherence
Example: Token Coherence [ISCA 2003] Load B Store B P 0 P 1 L 1 I&D L 2 mem 0 • • 10 L 1 I&D L 2 P 2 L 1 I&D L 2 interconnect P 3 L 1 I&D L 2 mem 3 Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block Improving Multiple-CMP Systems using Token Coherence
Extending to Multiple-CMP System CMP 0 L 1 I&D CMP 1 L 1 I&D L 2 interconnect L 2 P 2 L 1 I&D mem 0 11 L 1 I&D L 2 Shared L 2 P 3 L 2 interconnect Shared L 2 interconnect mem 1 Improving Multiple-CMP Systems using Token Coherence
Extending to Multiple-CMP System CMP 0 CMP 1 Store B P 0 L 1 I&D P 1 P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Token counting remains flat • Tokens to caches – Handles shared caches and other complex hierarchies 12 Improving Multiple-CMP Systems using Token Coherence
Safety Recap • Safety: Maintain coherence invariant – Only one writer, or multiple readers • Tokens for Safety – T Tokens associated with each memory block – # tokens encoded in 1+log 2 T – Processor acquires all tokens to write, a single token to read • Tokens passed to nodes in glueless multiprocessor scheme – But CMPs have private and shared caches • Tokens passed to caches in Multiple-CMP system – Arbitrary cache hierarchy easily handled – Flat for correctness 13 Improving Multiple-CMP Systems using Token Coherence
Some Token Counting Implications • Memory must store tokens – Separate RAM – Use extra ECC bits – Token cache • T sized to # caches to allow read-only copies in all caches • Replacements cannot be silent – Tokens must not be lost or dropped • Targeted for invalidate-based protocols – Not a solution for write-through or update protocols • Tokens must be identified by block address – Address must be in all token-carrying messages 14 Improving Multiple-CMP Systems using Token Coherence
Starvation Avoidance • Request messages can miss tokens – In-flight tokens • Transient Requests are not tracked throughout system – Incorrect filtering, multicast, destination-set prediction, etc • Possible Solution: Retries – Retry w/ optional randomized backoff is effective for races • Guaranteed Solution: Persistent Requests – – 15 Heavyweight request guaranteed to succeed Should be rare (uses more bandwidth) Locates all tokens in the system Orders competing requests Improving Multiple-CMP Systems using Token Coherence
Starvation Avoidance CMP 0 GETX CMP 1 Store B P 0 P 1 L 1 I&D Store B GETX P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Tokens move freely in the system – Transient requests can miss in-flight tokens – Incorrect speculation, filters, prediction, etc 16 Improving Multiple-CMP Systems using Token Coherence
Starvation Avoidance CMP 0 CMP 1 Store B P 0 P 1 P 2 L 1 I&D interconnect Shared L 2 mem 0 P 3 Shared L 2 interconnect mem 1 • Solution: issue Persistent Request – Heavyweight request guaranteed to succeed – Methods: Centralized [2003] and Distributed (New) 17 Improving Multiple-CMP Systems using Token Coherence
Old Scheme: Central Arbiter [2003] CMP 0 Store B CMP 1 timeout Store B P 0 L 1 I&D Store B timeout P 1 P 2 L 1 I&D interconnect B: P 0 B: P 2 B: P 1 mem 0 P 3 L 1 I&D interconnect Shared L 2 arbiter 0 timeout Shared L 2 interconnect mem 1 arbiter 0 – Processors issue persistent requests 18 Improving Multiple-CMP Systems using Token Coherence
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 0 P 1 P 2 B: P 0 L 1 I&D interconnect P 3 L 1 I&D B: P 0 interconnect B: P 0 Shared L 2 B: P 0 arbiter 0 B: P 2 B: P 1 mem 0 interconnect mem 1 arbiter 0 – Processors issue persistent requests – Arbiter orders and broadcasts activate 19 Improving Multiple-CMP Systems using Token Coherence
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B P 1 P 2 P 0 B: P 2 P 0 L 1 I&D interconnect B: P 2 P 0 L 1 I&D B: P 0 P 2 3 arbiter 0 B: P 2 B: P 1 L 1 I&D Shared L 2 B: P 2 P 0 2 mem 0 B: P 2 P 0 interconnect B: P 2 P 0 Shared L 2 1 P 3 interconnect mem 1 arbiter 0 – Processor sends deactivate to arbiter – Arbiter broadcasts deactivate (and next activate) – Bottom Line: handoff is 3 message latencies 20 Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P 1 P 0: B P 1: B P 2: B L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests 21 Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0 P 1 P 0: B P 1: B P 2: B L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests – Fixed priority (processor number) 22 Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P 0: B P 1: B P 2: B L 1 I&D 1 P 1 L 1 I&D Store B P 0: B P 1: B P 2: B L 1 I&D interconnect P 0: B P 1: B P 2: B L 1 I&D P 0: B P 1: B P 2: B interconnect P 0: B Shared L 2 P 1: B P 2: B mem 0 P 3 P 0: B P 1: B P 2: B Shared L 2 interconnect mem 1 – Processors broadcast persistent requests – Fixed priority (processor number) – Processors broadcast deactivate 23 Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 P 0 P 1: B P 2: B L 1 I&D P 1: B P 2: B L 1 I&D interconnect P 1: B P 2: B L 1 I&D P 1: B P 2: B interconnect Shared L 2 P 1: B P 2: B mem 0 P 1: B P 2: B P 3 interconnect mem 1 – Bottom line: Handoff is a single message latency • Subtle point: P 0 and P 1 must wait until next “wave” 24 Improving Multiple-CMP Systems using Token Coherence
Implementing Distributed Persistent Requests • Table at each cache – Sized to N entries for each processor (we use N=1) – Indexed by processor ID – Content-addressable by Address • Each incoming message must access table – Not on the critical path– can be slow CAM • Activate/deactivate reordering cannot be allowed – Persistent request virtual channel must be point-to-point ordered – Or, other solution such as sequence numbers or acks 25 Improving Multiple-CMP Systems using Token Coherence
Implementing Distributed Persistent Requests • Should reads be distinguished from writes? – Not necessary, but – Persistent Read request is helpful • Implications of flat distributed arbitration – Simple flat for correctness – Global broadcast when used • Fortunately they are rare in typical workloads (0. 3%) • Bad workload (very high contention) would burn bandwidth – Maximum # processors must be architected • What about a hierarchical persistent request scheme? – Possible, but correctness is no longer flat – Make the common case fast 26 Improving Multiple-CMP Systems using Token Coherence
Reducing Unnecessary Traffic • Problem: Which token-holding cache responds with data? • Solution: Distinguish one token as the owner token – The owner includes data with token response – Clean vs. dirty owner distinction also useful for writebacks 27 Improving Multiple-CMP Systems using Token Coherence
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance – Token. CMP – Another look at performance policies • 28 Evaluation Improving Multiple-CMP Systems using Token Coherence
Hierarchical for Performance: Token. CMP • Target System: – 2 -8 CMPs – Private L 1 s, shared L 2 per CMP – Any interconnect, but high-bandwidth • Performance Policy Goals: – – 29 Aggressively acquire tokens Exploit on-chip locality and bandwidth Respect cache hierarchy Detecting and handling missed tokens Improving Multiple-CMP Systems using Token Coherence
Hierarchical for Performance: Token. CMP • Approach: – On L 1 miss, broadcast within own CMP • Local cache responds if possible – On L 2 miss, broadcast to other CMPs – Appropriate L 2 bank responds or broadcasts within its CMP • Optionally filter – Responses between CMPs carry extra tokens for future locality • Handling missed tokens: – Timeout after average memory latency – Invoke persistent request (no retries) • 30 Larger systems can use filters, multicast, soft-state directories Improving Multiple-CMP Systems using Token Coherence
Other Optimizations in Token. CMP • Implementing E-state – Memory responds with all tokens on read request – Use clean/dirty owner distinction to eliminate writing back unwritten data • Implementing Migratory Sharing – What is it? • A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block – In Token. CMP, simply return all tokens • Non-speculative delay – Hold block for some # cycles so permission isn’t stolen prematurely 31 Improving Multiple-CMP Systems using Token Coherence
Another Look at Performance Policies • How to find tokens? – – Broadcast w/ filters Multicast (destination-set prediction) Directories (soft or hard) • Who responds with data? – Owner token • Token. CMP uses Owner token for Inter-CMP responses – Other heuristics • For Token. CMP intra-CMP responses, cache responds if it has extra tokens 32 Improving Multiple-CMP Systems using Token Coherence
Transient Requests May Reduce Complexity • Processor holds the only required state about request • L 2 controller in Token. CMP very simple: – Re-broadcasts L 1 request message on a miss – Re-broadcasts or filters external request messages – Possible states: • no tokens (I) • all tokens (M) • some tokens (S) – Bounce unexpected tokens to memory • Directory. CMP’s L 2 controller is complex – – 33 Allocates MSHR on miss and forward Issues invalidates and receives acks Orders all intra-CMP requests and writebacks 57 states in our L 2 implementation! Improving Multiple-CMP Systems using Token Coherence
Writebacks • Directory. CMP uses “ 3 -phase writebacks” – – L 1 issues writeback request L 2 enters transient state or blocks request L 2 responds with writeback L 1 sends data • Token. CMP uses “fire-and-forget” writebacks – Immediately send tokens and data – Heuristic: Only send data if # tokens > 1 34 Improving Multiple-CMP Systems using Token Coherence
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation – Model checking – Performance w/ commercial workloads – Robustness 35 Improving Multiple-CMP Systems using Token Coherence
Token. CMP Evaluation • Simple? – Some anecdotal examples and comparisons – Model checking • Fast? – Full-system simulation w/ commercial workloads • Robust? – Micro-benchmarks to simulate high contention 36 Improving Multiple-CMP Systems using Token Coherence
Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia • Methods: – TLA+ and TLC – Directory. CMP omits all intra-CMP details – Token. CMP’s correctness substrate modeled • Result: – Complexity similar between Token. CMP and non-hierarchical Directory. CMP – Correctness Substrate verified to be correct and deadlock-free – All possible performance protocols correct 37 Improving Multiple-CMP Systems using Token Coherence
Performance Evaluation • Target System: – 4 CMPs, 4 procs/cmp – 2 GHz Oo. O SPARC, 8 MB shared L 2 per chip – Directly connected interconnect • Methods: Multifacet GEMS simulator – Simics augmented with timing models – Released soon: http: //www. cs. wisc. edu/gems • Benchmarks: – Performance: Apache, Spec, OLTP – Robustness: Locking u. Benchmark 38 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP 39 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Runtime – Token. CMP performs 9 -50% faster than Directory. CMP DRAM Directory Perfect L 2 40 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Inter-CMP Traffic – Token. CMP traffic is reasonable (or better) • Directory. CMP control overhead greater than broadcast for small system 41 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Intra-CMP Traffic 42 Improving Multiple-CMP Systems using Token Coherence
Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 43 less contention Improving Multiple-CMP Systems using Token Coherence
Performance Robustness Locking micro-benchmark (correctness substrate only) more contention 44 less contention Improving Multiple-CMP Systems using Token Coherence
Performance Robustness Locking micro-benchmark more contention 45 less contention Improving Multiple-CMP Systems using Token Coherence
Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory 46 Improving Multiple-CMP Systems using Token Coherence
- Token ring frame format
- Pgcps coherence framework
- Implementing hrd programs pdf
- Management issues central to strategy implementation
- Mis issues in strategy implementation
- Implementing strategies: management and operations issues
- Tripod of pricing
- Implementing merchandise plans
- Designing and implementing brand architecture strategies
- Designing and implementing brand architecture strategies
- Portfolio assessment matches assessment to teaching
- Crm vision statement examples
- Qsen competencies teamwork and collaboration
- Challenges of implementing predictive analytics
- Implementing organizational change spector
- Implementing strategies marketing finance/accounting
- Is the traditional method of implementing access control
- Simplified scoring model in project management
- Implementing strategies management and operations issues
- Designing and implementing brand architecture strategies
- Nfpa 1600
- Implementing organizational change theory into practice
- Ubmta implementing letter
- Implementing nfpa 1600 national preparedness standard
- Zpf cisco
- Ch 7
- Kontinuitetshantering i praktiken
- Typiska drag för en novell
- Nationell inriktning för artificiell intelligens
- Vad står k.r.å.k.a.n för
- Varför kallas perioden 1918-1939 för mellankrigstiden
- En lathund för arbete med kontinuitetshantering
- Underlag för särskild löneskatt på pensionskostnader
- Personlig tidbok för yrkesförare
- A gastrica
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Stig kerman
- Debatt mall
- Autokratiskt ledarskap
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon
- Kraft per area
- Publik sektor
- Bo bergman jag fryser om dina händer
- Presentera för publik crossboss
- Jiddisch