Cache Coherence in Scalable Machines COE 502 Parallel

Cache Coherence in Scalable Machines COE 502 – Parallel Processing Architectures Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals

Generic Scalable Multiprocessor P 1 Memory $ P 2 Memory CA Pn $ CA Memory = Communication Assist $ CA Scalable Interconnection Network many parallel transactions v Scalable distributed memory machine Ø P-C-M nodes connected by a scalable network Ø Scalable memory bandwidth at reasonable latency v Communication Assist (CA) Ø Interprets network transactions, forms an interface Ø Provides a shared address space in hardware Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 2

What Must a Coherent System do? v Provide set of states, transition diagram, and actions v Determine when to invoke coherence protocol Ø Done the same way on all systems ² State of the line is maintained in the cache ² Protocol is invoked if an “access fault” occurs on the line § Read miss, Write miss, Writing to a shared block v Manage coherence protocol 1. Find information about state of block in other caches ² Whether need to communicate with other cached copies 2. Locate the other cached copies 3. Communicate with those copies (invalidate / update) Ø Handled differently by different approaches Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 3

Bus-Based Cache Coherence v Functions 1, 2, and 3 are accomplished through … Ø Broadcast and snooping on bus Ø Processor initiating bus transaction sends out a “search” Ø Others respond and take necessary action v Could be done in a scalable network too Ø Broadcast to all processors, and let them respond v Conceptually simple, but broadcast doesn’t scale Ø On a bus, bus bandwidth doesn’t scale Ø On a scalable network ² Every access fault leads to at least p network transactions v Scalable Cache Coherence needs Ø Different mechanisms to manage protocol Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 4

Directory-Based Cache Coherence P 1 Directory $ Memory CA Pn Directory $ Memory = Communication Assist CA Scalable Interconnection Network many parallel transactions v Scalable cache coherence is based on directories v Distributed directories for distributed memories v Each cache-line-sized block of a memory … Ø Has a directory entry that keeps track of … ² State and the nodes that are currently sharing the block Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 5

Simple Directory Scheme v Simple way to organize a directory is to associate … Ø Every memory block with a corresponding directory entry Ø To keep track of copies of cached blocks and their states v On a miss Ø Locate home node and the corresponding directory entry Ø Look it up to find its state and the nodes that have copies Ø Communicate with the nodes that have copies if necessary v On a read miss Ø Directory indicates from which node data may be obtained v On a write miss Ø Directory identifies shared copies to be invalidated/updated v Many alternatives for organizing directory Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 6

Directory Information P 1 Directory $ Memory Dirty bit Presence bits for all nodes Pn Directory Memory CA $ CA Scalable Interconnection Network many parallel transactions v A simple organization of a directory entry is to have Ø Bit vector of p presence bits for each of the p nodes ² Indicating which nodes are sharing that block Ø One or more state bits per block reflecting memory view ² One dirty bit is used indicating whether block is modified ² If dirty bit is 1 then block is in modified state in only one node Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 7

Definitions v Home Node Ø Node in whose main memory the block is allocated v Dirty Node Ø Node that has copy of block in its cache in modified state v Owner Node Ø Node that currently holds the valid copy of a block Ø Can be either the home node or dirty node v Exclusive Node Ø Node that has block in its cache in an exclusive state Ø Either exclusive clean or exclusive modified v Local or Requesting Node Ø Node that issued the request for the block Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 8

Basic Operation on a Read Miss v Read miss by processor i Ø Requestor sends read request message to home node Ø Assist looks-up directory entry, if dirty-bit is OFF ² Reads block from memory ² Sends data reply message containing to requestor ² Turns ON Presence[ i ] Ø If dirty-bit is ON ² ² ² ² Requestor sends read request message to dirty node Dirty node sends data reply message to requestor Dirty node also sends revision message to home node Cache block state is changed to shared Home node updates its main memory and directory entry Turns dirty-bit OFF Turns Presence[ i ] ON Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 9

Read Miss to a Block in Dirty State v The number 1, 2, and so on show serialization of transactions v Letters to same number indicate overlapped transactions Read request to home node $ CA P 2. 4 a. Response with owner identity $ Data Reply CA P 3. Read request to owner $ CA Mem / Dir P Home Node Mem / Dir Requestor Mem / Dir Message 4 a and 4 b are performed in parallel Total of 5 transactions 1. 4 b. Revision message to home node Dirty Node Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 10

Basic Operation on a Write Miss v Write miss caused by processor i Ø Requestor sends read exclusive message to home node Ø Assist looks up directory entry, if dirty-bit is OFF ² Home node sends a data reply message to processor i § Message contains data and presence bits identifying all sharers ² Requestor node sends invalidation messages to all sharers ² Sharer nodes invalidate their cached copies and § Send acknowledgement messages to requestor node Ø If dirty-bit is ON ² Home node sends a response message identifying dirty node ² Requestor sends a read exclusive message to dirty node ² Dirty node sends a data reply message to processor i § And changes its cache state to Invalid Ø In both cases, home node clears presence bits ² But turns ON Presence[ i ] and dirty-bit Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 11

Write Miss to a Block with Sharers 1. P $ CA CA CA 4 b. 4 a. 3 b. Invalidation acknowledgement Invalidation request to sharer Invalidation acknowledgement P $ CA Sharer Cache Coherence in Scalable Machines $ 2. Data reply with sharer’s identity Mem / Dir Invalidation request to sharer P Mem / Dir $ 3 a. Mem / Dir P Home Node Rd. Ex request to home node Mem / Dir Called coherence or directory controllers Protocol is orchestrated by the assists Requestor Sharer © Muhamed Mudawar, COE 502 KFUPM Slide 12

Data Sharing Patterns v Provide insight into directory requirements v If most misses involve O(P) transactions then Ø Broadcast might be a good solution v However generally, there are few sharers at a write v Important to understand two aspects of data sharing Ø Frequency of shared writes or invalidating writes ² On a write miss or when writing to a block in the shared state ² Called invalidation frequency in invalidation-based protocols Ø Distribution of sharers called the invalidation size distribution v Invalidation size distribution also provides … Ø Insight into how to organize and store directory information Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 13

0 Cache Coherence in Scalable Machines 0 0. 03 0 0 0 0. 02 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63 # of invalidations 0 20 to 23 3. 04 0. 49 0. 34 0. 03 16 to 19 40 12 to 15 4 5 0 0 0 0 © Muhamed Mudawar, COE 502 KFUPM 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 0 60 to 63 0 56 to 59 0 52 to 55 48 to 51 0 16 to 19 # of invalidations 0 12 to 15 0 8 to 11 50 0 7 0 3 60 7 0 6 10 5 20 4 80 0 2 90 3 90 1 100 2 0 1 10 0 % of shared writes 20 0 % of shared writes Simulation running on 64 processors Figures show invalidation size distribution MSI protocol is used Cache Invalidation Patterns 91. 22 80 70 LU Invalidation Patterns 40 30 8. 75 0. 22 80. 98 70 60 Ocean Invalidation Patterns 30 15. 06 Slide 14

Cache Invalidation Patterns – cont’d 48. 35 35 30 25 Barnes-Hut Invalidation Patterns 22. 87 20 15 0 44 to 47 48 to 51 52 to 55 0 0. 33 60 to 63 0 56 to 59 0 40 to 43 32 to 35 0. 1 0. 07 36 to 39 28 to 31 24 to 27 0. 06 20 to 23 0. 2 16 to 19 2. 5 1. 06 0. 61 0. 24 0. 28 12 to 15 1. 4 8 to 11 6 2. 87 1. 88 5 3 2 1 4 5. 33 5 1. 27 0 7 10. 56 10 0 58. 35 60 50 40 Radiosity Invalidation Patterns 30 20 12. 04 Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM 60 to 63 56 to 59 52 to 55 48 to 51 44 to 47 40 to 43 36 to 39 32 to 35 28 to 31 24 to 27 20 to 23 16 to 19 2. 2 1. 74 1. 46 0. 92 0. 45 0. 37 0. 31 0. 28 0. 26 0. 24 0. 19 0. 91 12 to 15 # of invalidations 7 6 5 2. 24 1. 59 1. 16 0. 97 3. 28 4 3 2 4. 16 1 0 6. 68 0 10 8 to 11 % of shared writes 45 40 # of invalidations % of shared writes To capture inherent sharing patterns Infinite per-processor caches are used 50 Slide 15

Framework for Sharing Patterns v Shared-Data access patterns can be categorized: Ø Code and read-only data structures are never written ² Not an issue for directories Ø Producer-consumer ² One processor produces data and others consume it ² Invalidation size is determined by number of consumers Ø Migratory data ² Data migrates from one processor to another ² Example: computing a global sum, where sum migrates ² Invalidation size is small (typically 1) even as P scales Ø Irregular read-write ² Example: distributed task-queue (processes probe head ptr) ² Invalidations usually remain small, though frequent Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 16

Sharing Patterns Summary v Generally, few sharers at a write Ø Scales slowly with P v A write may send 0 invalidations in MSI protocol Ø Since block is loaded in shared state Ø This would not happen in MESI protocol v Infinite per-processor caches are used Ø To capture inherent sharing patterns Ø Finite caches send replacement hints on block replacement ² Which turn off presence bits and reduce # of invalidations ² However, traffic will not be reduced v Non-zero frequencies of very large invalidation sizes Ø Due to spinning on a synchronization variable by all Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 17

Alternatives for Directories Directory Schemes Centralized Finding source of directory information Distributed Flat Hierarchical Locating copies Memory-based Cache-based Directory information co-located with memory module in home node for each memory block Caches holding a copy of the memory block form a doubly linked list Directory hold pointer to head of list Examples: Stanford DASH/FLASH, MIT Alewife, SGI Origin, HAL Examples: IEEE SCI, Sequent NUMA-Q Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 18

Flat Memory-based Schemes v Information about cached (shared) copies is Ø Stored with the block at the home node v Performance Scaling Ø Traffic on a shared write: proportional to number of sharers Ø Latency on shared write: can issue invalidations in parallel v Simplest Representation: one presence bit per node Ø Storage overhead is proportional to P × M ² For M memory blocks in memory, and ignoring state bits Ø Directory storage overhead scale poorly with P ² ² Given 64 -byte cache block size For 64 nodes: 64 presence bits / 64 bytes = 12. 5% overhead 256 nodes: 256 presence bits / 64 bytes = 50% overhead 1024 nodes: 1024 presence bits / 64 bytes = 200% overhead Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 19

Reducing Directory Storage v Optimizations for full bit vector schemes Ø Increase cache block size ² Reduces storage overhead proportionally Ø Use more than one processor (SMP) per node ² Presence bit is per node, not per processor Ø But still scales as P × M ² Reasonable and simple enough for all but very large machines ² Example: 256 processors, 4 -processor nodes, 128 -byte lines Overhead = (256/4) / (128*8) = 64 / 1024 = 6. 25% (attractive) v Need to reduce “width” P Ø Addressing the P term v Need to reduce “height” M Ø Addressing the M term Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 20

Directory Storage Reductions v Width observation: Ø Most blocks cached (shared) by only few nodes Ø Don’t need a bit per node Ø Sharing patterns indicate a few pointers should suffice ² Entry can contain a few (5 or so) pointers to sharing nodes Ø P = 1024 10 bit pointers ² 5 pointers need only 50 bits rather than 1024 bits Ø Need an overflow strategy when there are more sharers v Height observation: Ø Number of memory blocks >> Number of cache blocks Ø Most directory entries are useless at any given time Ø Organize directory as a cache ² Rather than having one entry per memory block Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 21

Limited Pointers v Diri : only i pointers are used per entry Ø Works in most cases when # sharers ≤ i v Overflow mechanism needed when # sharers > i Ø Overflow bit indicates that number of sharers exceed i v Overflow methods include Ø Broadcast Overflow bit Directory Entry 2 Pointers 0 P 9 P 7 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 Ø No-broadcast Ø Coarse Vector Ø Software Overflow Ø Dynamic Pointers Cache Coherence in Scalable Machines P 12 P 13 P 14 P 15 © Muhamed Mudawar, COE 502 KFUPM Slide 22

Overflow Methods v Broadcast (Diri. B) Ø Broadcast bit turned on upon overflow Ø Invalidations broadcast to all nodes when block is written ² Regardless of whether or not they were sharing the block Ø Network bandwidth may be wasted for overflow cases Ø Latency increases if processor wait for acknowledgements v No-Broadcast (Diri. NB) Ø Avoids broadcast by never allowing # of sharers to exceed i Ø When number of sharers is equal to i ² New sharer replaces (invalidates) one of the old ones ² And frees up its pointer in the directory entry Ø Drawback: does not deal with widely shared data Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 23

Coarse Vector Overflow Method v Coarse vector (Diri. CVr) Ø Uses i pointers in its initial representation Ø But on overflow, changes representation to a coarse vector Ø Each bit indicates a unique group of r nodes Ø On a write, invalidate all r nodes that a bit corresponds to Overflow bit Directory Entry 2 Pointers Overflow bit Directory Entry P 7 8 -bit coarse vector 0 P 9 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM 1 Slide 24

Robustness of Coarse Vector Normalized Invalidations 800 700 Dir 4 B 600 Dir 4 NB 500 Dir 4 CV 4 400 300 200 100 0 Locus. Route Cholesky Barnes-Hut v 64 processors (one per node), 4 pointers per entry v 16 -bit coarse vector, 4 processors per group v Normalized to full-bit-vector (100 invalidations) v Conclusion: coarse vector scheme is quite robust Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 25

Software Overflow Schemes v Software (Diri. SW) Ø On overflow, trap to system software Ø Overflow pointers are saved in local memory Ø Frees directory entry to handle i new sharers in hardware Ø Used by MIT Alewife: 5 pointers, plus one bit for local node Ø But large overhead for interrupt processing ² 84 cycles for 5 invalidations, but 707 cycles for 6. v Dynamic Pointers (Diri. DP) is a variation of Diri. SW Ø Each directory entry contains extra pointer to local memory Ø Extra pointer uses a free list in special area of memory Ø Free list is manipulated by hardware, not software Ø Example: Stanford FLASH multiprocessor Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 26

Reducing Directory Height v Sparse Directory: reducing M term in P × M v Observation: cached entries << main memory Ø Most directory entries are unused most of the time Ø Example: 2 MB cache and 512 MB local memory per node => 510 / 512 or 99. 6% of directory entries are unused v Organize directory as a cache to save space Ø Dynamically allocate directory entries, as cache lines Ø Allow use of faster SRAMs, instead of slower DRAMs ² Reducing access time to directory information in critical path Ø When an entry is replaced, send invalidations to all sharers Ø Handles references from potentially all processors Ø Essential to be large enough and with enough associativity Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 27

Alternatives for Directories Directory Schemes Centralized Finding source of directory information Distributed Flat Hierarchical Locating copies Memory-based Cache-based Directory information co-located with memory module in home node for each memory block Caches holding a copy of the memory block from a doubly linked list; Directory holds pointer to head of list Examples: Stanford DASH/FLASH, MIT Alewife, SGI Origin, HAL Examples: IEEE SCI, Sequent NUMA-Q Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 28

Flat Cache-based Schemes v How they work: Ø Home node only holds pointer to rest of directory info Ø Distributed doubly linked list, weaves through caches ² Cache line has two associated pointers ² Points to next and previous nodes with a copy Ø On read, add yourself to head of the list Ø On write, propagate chain of invalidations down the list v Scalable Coherent Interface (SCI) Ø IEEE Standard Ø Doubly linked list Directory Home Node Cache Node A Cache Coherence in Scalable Machines Cache Node B © Muhamed Mudawar, COE 502 KFUPM Cache Node C Slide 29

Scaling Properties (Cache-based) v Traffic on write: proportional to number of sharers v Latency on write: proportional to number of sharers Ø Don’t know identity of next sharer until reach current one Ø Also assist processing at each node along the way Ø Even reads involve more than one assist ² Home and first sharer on list v Storage overhead Ø Quite good scaling along both axes Ø Only one head pointer per directory entry ² Rest is all proportional to cache size v But complex hardware implementation Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 30

Alternatives for Directories Directory Schemes Centralized Finding source of directory information Distributed Flat Hierarchical Locating copies Memory-based Cache-based Directory information co-located with memory module in home node for each memory block Caches holding a copy of the memory block form a doubly linked list Directory holds pointer to head of list Examples: Stanford DASH/FLASH, MIT Alewife, SGI Origin, HAL Examples: IEEE SCI, Sequent NUMA-Q Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 31

Finding Directory Information v Flat schemes Ø Directory distributed with memory Ø Information at home node Ø Location based on address: transaction sent to home v Hierarchical schemes Ø Directory organized as a hierarchical data structure Ø Leaves are processing nodes Ø Internal nodes have only directory information ² Directory entry says whether subtree caches a block Ø To find directory info, send “search” message up to parent ² Routes itself through directory lookups Ø Point-to-point messages between children and parents Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 32

Locating Shared Copies of a Block v Flat Schemes Ø Memory-based schemes ² Information about all copies stored at the home directory ² Examples: Stanford Dash/Flash, MIT Alewife , SGI Origin Ø Cache-based schemes ² Information about copies distributed among copies themselves § Inside caches, each copy points to next to form a linked list ² Scalable Coherent Interface (SCI: IEEE standard) v Hierarchical Schemes Ø Through the directory hierarchy Ø Each directory entry has presence bits ² To parent and children subtrees Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 33

Hierarchical Directories Processing nodes Tracks which of its children level-1 directories have a copy of the memory block. Also tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between level-2 and level-1 directories level-2 directory Tracks which of its children processing nodes have a copy level-1 of the memory block. Also directory tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between level-1 directory and processor caches v Directory is a hierarchical data structure Ø Leaves are processing nodes, internal nodes are just directories Ø Logical hierarchy, not necessarily a physical one ² Can be embedded in a general network topology Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 34

Two-Level Hierarchy v Individual nodes are multiprocessors Ø Example: mesh of SMPs v Coherence across nodes is directory-based Ø Directory keeps track of nodes, not individual processors v Coherence within nodes is snooping or directory Ø Orthogonal, but needs a good interface of functionality v Examples: Ø Convex Exemplar: directory-directory Ø Sequent, Data General, HAL: directory-snoopy Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 35

Examples of Two-level Hierarchies Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 36

Hierarchical Approaches: Snooping v Extend snooping approach: hierarchy of broadcast media Ø Tree of buses or rings Ø Processors are in the bus- or ring-based multiprocessors at the leaves Ø Parents and children connected by two-way snoopy interfaces ² Snoop both buses and propagate relevant transactions Ø Main memory may be centralized at root or distributed among leaves v Handled similarly to bus, but not full broadcast Ø faulting processor sends out “search” bus transaction on its bus Ø propagates up and down hieararchy based on snoop results v Problems: Ø High latency: multiple levels, and snoop/lookup at every level Ø Bandwidth bottleneck at root v Not popular today Cache Coherence in Scalable Machines © Muhamed Mudawar, CSE 661 – Spring 2005, KFUPM Slide 37

Advantages of Multiprocessor Nodes v Potential for cost and performance advantages Ø Amortization of node fixed costs over multiple processors Ø Can use commodity SMPs and CMPs Ø Less nodes for directory to keep track of Ø Much communication may be contained within node Ø Processors within a node prefetch data for each other Ø Processors can share caches (overlapping of working sets) Ø Benefits depend on sharing pattern (and mapping) ² Good for widely read-shared data ² Good for nearest-neighbor, if properly mapped ² Not so good for all-to-all communication Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 38

Protocol Design Optimizations v Reduce Bandwidth demands Ø By reducing number of protocol transactions per operation v Reduce Latency Ø By reducing network transactions in critical path Ø Overlap network activities or make them faster v Reduce endpoint assist occupancy per transaction Ø Especially when the assists are programmable v Traffic, latency, and occupancy … Ø Should not scale up too quickly with number of nodes Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 39

Reducing Read Miss Latency 3: intervention v Strict Request-Response: Ø Ø Ø L: Local or Requesting node H: Home node 1: request L D: Dirty node H 4 a: revise D 2: response 4 transactions in critical path 5 transactions in all 4 b: data reply 1: request 2: intervention v Intervention Forwarding: Ø Ø Home forwards intervention Directed to owner’s cache Home keeps track of requestor 4 transactions in all v Reply Forwarding: Ø Owner replies to requestor Ø 3 transactions in critical path Cache Coherence in Scalable Machines L H D 4: data reply 3: revise 1: request 2: intervention L © Muhamed Mudawar, COE 502 KFUPM H 3 a: revise D 3 b: data reply Slide 40

Reducing Invalidation Latency v In Cache-Based Protocol v Invalidations sent from Home Ø To all sharer nodes Si v Strict Request-Response: 1: inval H Ø Each sharer forwards invalidation to next sharer Ø s acknowledgements to Home Ø s+1 transactions in critical path S 2 S 1 4: ack 1: inval H Cache Coherence in Scalable Machines 6: ack 2 a: inval 2 b: ack 3 a: inval S 2 S 1 S 3 3 b: ack 4: ack 1: inval 2: inval v Single acknowledgement: Ø Sent by last sharer Ø s+1 transactions in total S 3 2: ack Ø 2 s transactions in total Ø 2 s transactions in critical path Ø s = number of sharers v Invalidation Forwarding: 5: inval 3: inval H © Muhamed Mudawar, COE 502 KFUPM 3: inval S 2 S 1 S 3 4: ack Slide 41

Correctness v Ensure basics of coherence at state transition level Ø Relevant lines are invalidated, updated, and retrieved Ø Correct state transitions and actions happen v Ensure serialization and ordering constraints Ø Ensure serialization for coherence (on a single location) Ø Ensure atomicity for consistency (on multiple locations) v Avoid deadlocks, livelocks, and starvation v Problems: Ø Multiple copies and multiple paths through network Ø Large latency makes optimizations attractive ² But optimizations complicate correctness Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 42

Serialization for Coherence v Serialization means that writes to a given location Ø Are seen in the same order by all processors v In a bus-based system: Ø Multiple copies, but write serialization is imposed by bus v In a scalable multiprocessor with cache coherence Ø Home node can impose serialization on a given location ² All relevant operations go to the home node first Ø But home node cannot satisfy all requests ² Valid copy can be in a dirty node ² Requests may reach the home node in one order, … § But reach the dirty node in a different order ² Then writes to same location are not seen in same order by all Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 43

Ensuring Serialization v Use additional ‘busy’ or ‘pending’ directory states v Indicate that previous operation is in progress Ø Further operations on same location must be delayed v Ensure serialization using one of the following … Ø Buffer at the home node ² Until previous (in progress) request has completed Ø Buffer at the requestor nodes ² By constructing a distributed list of pending requests Ø NACK and retry ² Negative acknowledgement sent to the requestor ² Request retried later by the requestor’s assist Ø Forward to the dirty node ² Dirty node determines their serialization when block is dirty Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 44

Write Atomicity Problem Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 45

Ensuring Write Atomicity v Multiple copies and a distributed interconnect v In Invalidation-based scheme Ø Block can be shared by multiple nodes Ø Owner provides appearance of write atomicity by not allowing read access to the new value until all shared copies are invalidated and invalidations are acknowledged. v Much harder in update schemes! Ø Since the data is sent directly to the sharers and is accessible immediately Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 46

Deadlock, Livelock, & Starvation v Potential source for deadlock Ø A node may receive too many messages Ø If no sufficient buffer space then flow control causes deadlock v Possible solutions for buffer deadlock Ø Provide enough buffer space or use main memory Ø Use NACKs (negative acknowledgments) Ø Provide separate request and response networks with separate buffers v Potential source for Livelock Ø NACKs that cause potential livelock and traffic problems v Potential source for Starvation Ø Unfairness in handling requests Ø Solution is to buffer all requests in FIFO order Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 47

Summary of Directory Organizations v Flat Schemes: v Issue (a): finding source of directory data Ø Go to home, based on address v Issue (b): finding out where the copies are Ø Memory-based: all info is in directory at home Ø Cache-based: home has pointer to distributed linked list v Issue (c): communicating with those copies Ø Memory-based: point-to-point messages (coarser on overflow) ² Can be multicast or overlapped Ø Cache-based: point-to-point linked list traversal to find them v Hierarchical Schemes: Ø All three issues through sending messages up and down tree Ø No single explicit list of sharers Ø Only direct communication is between parents and children Cache Coherence in Scalable Machines © Muhamed Mudawar, COE 502 KFUPM Slide 48

SGI Origin Case Study P P L 2 $ 1– 4 MB Directory Xbow Hub Sys. AD bus Router Main Memory (1– 4 GB) Router Directory Sys. AD bus Hub Xbow Main Memory (1– 4 GB) Interconnection Network v Each node consists of … Ø Two MIPS R 10 k processors with 1 st and 2 nd level caches Ø A fraction of the total main memory and directory Ø A Hub: communication assist or coherence controller SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 49

SGI Origin System Overview v Directory is in DRAM and is accessed in parallel v Hub is the communication assist Ø Implements the network interface and coherence protocol Ø Sees all second-level cache misses Ø Connects to Xbow for I/O Ø Connects to memory v Sys. AD Bus P P P L 2 $ 1– 4 MB Sys. AD bus Hub Ø Shared by 2 processors Ø No snooping coherence Ø Coherence thru directory Directory Xbow Ø System Address/Data SGI Origin P Main Memory (1– 4 GB) Directory Ø Connects to router chip Xbow Main Memory (1– 4 GB) Interconnection Network © Muhamed Mudawar, COE 502, KFUPM Slide 50

Origin Network N N N N N (b) 4 -node N N (d) 16 -node (c) 8 -node (d) 32 -node meta-router (e) 64 -node Up to 512 nodes or 1024 processors SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 51

Origin Network – cont’d v Interconnection network has a hypercube topology Ø With up to 32 nodes or 64 processors v Fat cube topology for beyond 64 processors Ø Up to 512 nodes or 1024 processors v Each router has six pairs of 1. 56 GB/s links Ø Two to nodes, four to other routers Ø Latency: 41 ns pin to pin across a router v Flexible cables up to 3 ft long v Four virtual channels per physical link Ø Request, reply, other two for priority or I/O SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 52

Origin Cache and Directory States v Cache states: MESI v Seven directory states Ø Unowned: no cache has a copy, memory copy is valid Ø Exclusive: one cache has block in M or E state ² Memory may or may not be up to date Ø Shared: multiple (can be zero) caches have a shared copy ² Shared copy is clean and memory has a valid copy Ø Three Busy states for three types of pending requests: ² Read, read exclusive (or upgrade), and uncached read ² Indicates home has received a previous request for the block ² Couldn’t satisfy it itself, sent it to another node and is waiting ² Cannot take another request for the block yet Ø Poisoned state, used for efficient page migration SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 53

Origin Directory Structure v Flat Memory based: directory information at home v Three directory formats: 1) If exclusive, entry is pointer to specific processor (not node) 2) If shared, bit vector: each bit points to a node, not processor ² Invalidation sent to a Hub is broadcast to both processors ² 16 -bit format (32 processors), kept in main memory DRAM ² 64 -bit format (128 processors), uses extended directory memory ² 128 -byte blocks 3) For larger machines, coarse bit vector ² Each bit corresponds to n/64 nodes (64 -bit format) ² Invalidations are sent to all Hubs and processors in a group Ø Machine chooses format dynamically ² When application is confined to 64 nodes or less SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 54

Handling a Read Miss v Hub examines address Ø If remote, sends request to home node Ø If local, looks up local directory entry and memory Ø Memory block is fetched in parallel with directory entry ² Called speculative data access ² Directory lookup returns one cycle earlier ² If directory is unowned or shared, it’s a win § Data already obtained by Hub Ø Directory may indicate one of many states v If directory is in Unowned State Cache state Ø Goes to Exclusive state (pointer format) Ø Replies with data block to requestor ² Exclusive data reply ² No network transactions if home is local SGI Origin © Muhamed Mudawar, COE 502, KFUPM 1: read request Dir state L H I→E U→E 2: exclusive data reply Slide 55

Read Block in Shared / Busy State v If directory is in Shared state: Ø Directory sets presence bit of requesting node Ø Replies with data block to requestor 1: read request ² Shared data reply ² Strict request-reply L H I→S S ² Block is cached in shared state 2: shared data reply v If directory is in a Busy state: Ø Directory not ready to handle request 1: read request Ø NACK to requestor Ø Asking requestor to try again Ø So as not to buffer request at home L H I B 2: NACK Ø NACK is a response that does not carry data SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 56

Read Block in Exclusive State v If directory is in Exclusive state (interesting case): Ø If home is not owner, need to get data from owner ² To home and to requestor Ø Uses reply forwarding for lowest latency and traffic Ø Directory state is set to busy-shared state Ø Home optimistically sends a speculative reply to requestor Ø At same time, request is forwarded to exclusive node (owner) Ø Owner sends revision message to home & reply to requestor 1: read request L I→S 2 a: speculative reply 2 b: Intervention H 3 b: Revision E→B→S E E, M→S 3 a: shared reply SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 57

Read Block in Exclusive State (2) v At the home node: Ø Set directory to busy state to NACK subsequent requests Ø Set requestor presence bit Ø Assume block is clean and send speculative reply v At the owner node: Ø If block is dirty ² Send data reply to requestor and a sharing write back to home ² Both messages carry modified block data Ø If block is clean ² Similar to above, but messages don’t carry data ² Revision message to home is called a downgrade v Home changes state from busy-shared to shared Ø When it receives revision message SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 58

Speculative Replies v Why speculative replies? Ø Requestor needs to wait for reply from owner anyway Ø No latency savings Ø Could just get data from owner always v Two reasons when block is in exclusive-clean state 1. L 2 cache controller does not reply with data ² When data is in exclusive-clean state in cache ² So there is a need to get data from home (speculative reply) 2. Enables protocol optimization ² No need to send notification messages back to home § When exclusive-clean blocks are replaced in caches ² Home will supply data (speculatively) and then checks SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 59

Handling a Write Miss v Request to home can be read exclusive or upgrade Ø Read-exclusive when block is not present in cache ² Request both data and ownership Ø Upgrade when writing a block in shared state ² Request ownership only Ø Copies in other caches must be invalidated ² Except when directory state in Unowned Ø Invalidations must be explicitly acknowledged Ø Home node updates directory and sends the invalidations Ø Includes the requestor identity in the invalidations ² So that acknowledgements are sent back to requestor directly SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 60

Write to Block in Unowned State v If Read Exclusive: Ø Directory state is changed to Exclusive Ø Directory entry is set to point to requestor Ø Home node sends a data reply to requestor Ø Requestor cache changes state to Modified 1: Rd. Ex request L H I→M U→E 2: exclusive data reply v if Upgrade: Ø The expected directory state is Shared Ø But Exclusive means block has been invalidated ² In requestor’s cache and directory notified ² Can happen in Origin Protocol § Someone else has upgraded same block 1: Upgr request L Invalidate S→I 0: Upgr H S→E 2: NACK Ø NACK reply: request should be retried as Read Exclusive SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 61

Write to Block in Shared State v At the home: Ø Set directory state to exclusive Ø Set directory pointer for requestor processor ² Ensures that subsequent requests are forwarded to requestor v Home sends invalidation requests to all sharers Ø Which will acknowledge requestor (Not home) 1: Rd. Ex/Upgr request L I, S→M 2 a: Exclusive reply/Upgr Ack 2 c: Invalidation request H 2 b: Invalidation request S→E S 1 S 2 S→I 3 a: Invalidation Ack 3 b: Invalidation Ack SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 62

Write to Block in Shared State (2) v If Read Exclusive, home sends Ø Exclusive reply with invalidations pending to requestor ² Reply contains data and number of sharers to be invalidated v If Upgrade, home sends Ø Upgrade Ack (No data) with invalidations pending v At requestor: Ø Wait for all acks to come back before “closing” the operation Ø Subsequent request for block to home ² Forwarded as intervention to requestor Ø For proper serialization ² Requestor does not handle intervention request until § All invalidation acks received for its outstanding request SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 63

Write to Block in Exclusive State v If Upgrade Ø Expected directory state is Shared Ø But another write has beaten this one 0: Rd. Ex 1: Upgr request L Invalidate S→I H S→E Ø Request not valid so NACK response 2: NACK Ø Request should be retried as Read Exclusive v If Read Exclusive Ø Set dir state to busy-exclusive and send speculative reply Ø Send invalidation to owner with identity of requestor 1: Rd. Ex request L I→M 2 a: Speculative reply 2 b: Invalidation H 3 b: Revision E→B→E E E, M→I 3 a: exclusive reply SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 64

Write Block in Exclusive State (2) v At owner: Ø Send ownership transfer revision message to home ² No data is sent to home Ø If block is dirty in cache ² Send exclusive reply with data to requestor ² Overrides speculative reply sent by home Ø If block is clean in exclusive state ² Send acknowledgment to requestor ² No data to requestor because got that from speculative reply v Write to a block in Busy state: NACK Ø Must try again to avoid buffering request at home node SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 65

Handling Writeback Requests v Directory state cannot be shared or unowned Ø Requestor is the owner and has the block as modified v If a read request had come in to set state to shared Ø Request would have been forwarded to owner, and Ø State would have been Busy v If directory state is Exclusive Ø Directory state is set to unowned Ø Acknowledgment is returned SGI Origin © Muhamed Mudawar, COE 502, KFUPM 1: Write Back L H M→I E→U 2: Acknowledgment Slide 66

Handling Writeback Requests (2) v If directory state is Busy: interesting race condition Ø Busy because of request from another node Y Ø Intervention has been forwarded to node X doing writeback ² Intervention and writeback have crossed each other Ø Y’s operation has had it’s effect on directory Ø Cannot drop or NACK writeback (only valid copy) Ø Can’t even retry after Y’s ref completes 1: Request Y L 2 b: Intervention Y 3 a: Data Reply I→S, M 2 a: Write Back E→B→S, E 2 c: Speculative Reply Y SGI Origin H E M→I 3 b: Write Back Acknowledgment © Muhamed Mudawar, COE 502, KFUPM Slide 67

Solution to Writeback Race v Combine the two operations Ø Request coming from node Y and Writeback coming from X v When writeback reaches directory, state changes to Ø Shared if it was busy-shared (Y requested read copy) Ø Exclusive if busy-exclusive (Y requested exclusive copy) v Home forwards the writeback data to the requestor Y Ø Sends writeback acknowledgment to X v When X receives the intervention, it ignores it Ø Since it has an outstanding writeback for the cache block v Y’s operation completes when it gets the reply v X’s writeback completes when it gets writeback SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 68

Replacement of Shared Block v Could send a replacement hint to the directory Ø To clear the presence bit of the sharing node Ø Reduces occurrence in limited-pointer representation v Can eliminate an invalidation Ø Next time the block is written v But does not reduce traffic Ø Have to send a replacement hint Ø Incurs the traffic at a different time Ø If block is not written again then replacement hint is a waste v Origin protocol does not use replacement hints Ø Can have a directory in shared state while no copy exists SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 69

Total Transaction Types v Coherent memory transactions Ø 9 request types Ø 6 invalidation/intervention types Ø 39 reply/response types v Non-coherent memory transactions Ø Uncached memory, I/O, and synchronization operations Ø 19 request transaction types Ø 14 reply types Ø NO invalidation/intervention types SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 70

Serialization for Coherence v Serialization means that writes to a given location Ø Are seen in the same order by all processors v In a bus-based system: Ø Multiple copies, but write serialization is imposed by bus v In a scalable multiprocessor with cache coherence Ø Home node can impose serialization on a given location ² All relevant operations go to the home node first Ø But home node cannot satisfy all requests ² Valid copy may not be in main memory but in a dirty node ² Requests may reach the home node in one order, … § But reach the dirty node in a different order ² Then writes to same location are not seen in same order by all SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 71

Possible Solutions for Serialization v Buffer (FIFO) requests at the home (MIT Alewife) Ø All requests go to home first (good for serialization) Ø Buffer until previous request has completed ² But buffering requests becomes acute at the home (full buffer) ² Let buffer overflow into main memory (MIT Alewife) v Buffer at the requestors (SCI Protocol) Ø Distributed linked list of pending requests Ø Used in cache-based approach v Use busy state and NACK to retry (Origin 2000) Ø Negative acknowledgement sent to requestor Ø Serialization order is for accepted requests (not NACKed) v Forward to owner (Stanford DASH) Ø Serialization determined by home when clean Ø Serialization determined by owner when exclusive SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 72

Need Serialization at Home v Example shows need for serialization at home Initial Condition: block A is in modified state in P 1’s cache. Home E→B→E 1 3 b 2 a 4 2 b P 2 P 1 I→M→I 3 a 1. 2 a. 2 b. 3 a. 3 b. 4. P 2 sends read-exclusive request for A to home node. Home forwards request to P 1 (dirty node). Home sends speculative reply to P 2. P 1 sends data reply to P 2 (replaces 2 b). P 1 sends “Ownership transfer” revision message to Home. P 2 having received its reply, considers write complete. Proceeds, but incurs a replacement of the just dirtied block, causing it to be written back in transaction 4. Writeback transaction is received at home before “Ownership-transfer” message (3 b) and the block is written into memory. Then when the revision message arrives at the home, the directory is made to point to P 2 as having the dirty copy. This corrupts coherence. v Origin prevents this from happening using busy states Ø Directory detects write-back from same node as request Ø Write-back is NACKed and must be retried SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 73

Need Serialization at Requestor v Having home node determine order is not enough Ø Following example shows need for serialization at requestor 1. 2. Home U→E→B→E 5 a 2 3 1 4 b 4 a P 1 P 2 I→E I→M 5 b P 1 sends read request to home node for block A P 2 sends read-exclusive request to home for the write of A. But won’t process it until it is done with read. 3. In response to (1), home sends reply to P 1 (and sets directory pointer). Home now thinks read is complete. Unfortunately, the reply does not get to P 1 right away. 4 a. In response to (2), home sends speculative reply to P 2 4 b. In response to (2), home sends invalidation to P 1; it reaches P 1 before (3) (no point-to-point order). 5. P 1 receives and applies invalidate, sends acknowledgment 5 a to home and 5 b to requestor. Finally, transaction 3 (exclusive data reply) reaches P 1 has block in exclusive state, while P 2 has it modified. v Origin: serialization is applied to all nodes Ø Any node does not begin a new transaction … ² Until previous transaction on same block is complete SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 74

Starvation v NACKs can cause starvation v Possible solutions: Ø Do nothing: starvation shouldn’t happen often (DASH) Ø Random delay between retries Ø Priorities: increase priority of NACKed requests (Origin) SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 75

Application Speedups SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 76

Summary v In directory protocol Ø Substantial implementation complexity below state diagram Ø Directory versus cache states Ø Transient states Ø Race conditions Ø Conditional actions Ø Speculation v Origin philosophy: Ø Memory-less: a node reacts to incoming events using only local state Ø An operation does not hold shared resources while requesting others SGI Origin © Muhamed Mudawar, COE 502, KFUPM Slide 77