The Stanford Directory Architecture for Shared Memory DASH

  • Slides: 12
Download presentation
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The Stanford Dash Multiprocessor” in IEEE Computer March 1992

Outline 1. Motivation 2. High Level System Overview 3. Cache Coherence Protocol 4. Memory

Outline 1. Motivation 2. High Level System Overview 3. Cache Coherence Protocol 4. Memory Consistency Model: Release Consistency 5. Overcoming Long Latency Operations 6. Software Support 7. Performance Results 8. Conclusion: Where is it now?

Motivation Goals: 1. Minimal impact on programming model 2. Cost efficiency 3. Scalability!!! Design

Motivation Goals: 1. Minimal impact on programming model 2. Cost efficiency 3. Scalability!!! Design Decisions: 1. Shared Address Space (no MPI) 2. Parallel architecture instead of next sequential processor (no clock issues yet!) 3. Hardware controlled, directory based cache coherency

High Level System Overview Processor Cache Processor Cache Processor Cache Memory Directory Cluster Directory

High Level System Overview Processor Cache Processor Cache Processor Cache Memory Directory Cluster Directory Interconnect Network A shared address space without shared memory? ? * * See http: //www. uschess. org/beginners/read/ for meaning of “? ? ”

Cache Coherence Protocol DASH’s Big Idea: Hierarchical Directory Protocol Processor Level - Locate cache

Cache Coherence Protocol DASH’s Big Idea: Hierarchical Directory Protocol Processor Level - Locate cache blocks using a hierarchy of directories - Like NUCA except for directories (NUDA = Non-Uniform Directory Access? ) - Cache blocks in three possible states - Dirty (M) - Shared (S) - Uncached (I) Processor Cache Local Cluster Level Other processor caches within local cluster Home Cluster Level Directory and main memory associated with a given address Remote Cluster Level Processor caches in remote clusters

Cache Coherency Example Processor Holding Block Requesting Processor 1. Processor makes request on local

Cache Coherency Example Processor Holding Block Requesting Processor 1. Processor makes request on local bus 2. No response, directory broadcasts on network 3. Home directory sees request, sends message to remote cluster Home Cluster Processor Cache Processor Cache Processor Cache Memory 4. Remote directory puts Cache request on bus 5. Remote processor Memory responds with data 6. Remote directory forwards data, updates home directory 7. Data delivered, home directory updated Directory Processor Cache Directory Interconnect Network Directory

Implications of Cache Coherence Protocol - What do hierarchical directories get us? - Very

Implications of Cache Coherence Protocol - What do hierarchical directories get us? - Very fast access on local cluster - Moderately fast access to home cluster - Minimized data movement (assumed temporal and spatial locality? ) - What problems still exist? - Broadcast in some circumstances can be bottleneck to scalability - Complexity of cache and directory controllers, require many outstanding requests to hide latency -> power hungry CAM’s - Potential for long latency events as shown in example (more on this later)

Memory Consistency Model: Release Consistency Review*: 1. W->R reordering allowed (to different blocks only)

Memory Consistency Model: Release Consistency Review*: 1. W->R reordering allowed (to different blocks only) 2. W->W reordering allowed (to different blocks only) 3. R->W (to different blocks only) and R-R reordering allowed Why Release Consistency? 1. Provides acceptable programming model 2. Reordering events is essential for performance on a variable latency system 3. Relaxed requirements for interconnect network, no need for in order distribution of messages * Taken from “Shared Memory Consistency Models: A Tutorial”, we’ll read this later

Overcoming Long Latency Operations Prefetching: - How is this beneficial to execution? - What

Overcoming Long Latency Operations Prefetching: - How is this beneficial to execution? - What can go wrong with prefetching? - Does this scale? Update and Deliver Operations: - What if we know data is going to be needed by many threads? - Tell system to broadcast data to everyone using Update-Write operation - Does this scale well? - What about embarrassingly parallel applications?

Software Support - Parallel version of Unix OS - Handle prefetching in softwared (will

Software Support - Parallel version of Unix OS - Handle prefetching in softwared (will this scale? ) - Parallelizing compiler (how well do you think this works? ) - Parallel language Jade (how easy to rewrite applications? )

Performance Results Do these look like they scale well? What is going on here?

Performance Results Do these look like they scale well? What is going on here? !?

Conclusion: Where is it now? - Novel architecture and cache coherence protocol - Some

Conclusion: Where is it now? - Novel architecture and cache coherence protocol - Some level of scalability for diverse applications - Why don’t we see DASH everywhere? - Parallel architectures not cost-effective for general purpose computing until recently - Requires adaptation of sequential code to parallel architecture - Power? - Any other reasons? - For anyone interested: DASH -> FLASH -> SGI Origin (Server) http: //www-flash. stanford. edu/architecture/papers/ISCA 94/ http: //www. futuretech. blinkenlights. nl/origin/isca. pdf