18 742 Spring 2011 Parallel Computer Architecture Lecture

  • Slides: 37
Download presentation
18 -742 Spring 2011 Parallel Computer Architecture Lecture 25: Shared Resource Management Prof. Onur

18 -742 Spring 2011 Parallel Computer Architecture Lecture 25: Shared Resource Management Prof. Onur Mutlu Carnegie Mellon University

Announcements n Schedule for the rest of the semester n April 20: Milestone II

Announcements n Schedule for the rest of the semester n April 20: Milestone II presentations q n April 27: Oral Exam q q n 30 minutes person; in my office; closed book/notes All content covered could be part of the exam May 6: Project poster session q n Same format as last time HH 1112, 2 -6 pm May 10: Project report due 2

Reviews and Reading List n Due Today (April 11), before class q n Kung,

Reviews and Reading List n Due Today (April 11), before class q n Kung, “Why Systolic Architectures? ” IEEE Computer 1982. Upcoming Topics (we will not cover all of them) q q q Shared Resource Management Memory Consistency Synchronization Main Memory Architectural Support for Debugging Parallel Architecture Case Studies 3

Readings: Shared Cache n Required Management q q n Qureshi and Patt, “Utility-Based Cache

Readings: Shared Cache n Required Management q q n Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High. Performance, Runtime Mechanism to Partition Shared Caches, ” MICRO 2006. Kim et al. , “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture, ” PACT 2004. Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs, ” HPCA 2009. Hardavellas et al. , “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches, ” ISCA 2009. Recommended q q q Kim et al. , “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, ” ASPLOS 2002. Qureshi et al. , “Adaptive Insertion Policies for High-Performance Caching, ” ISCA 2007. Lin et al. , “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems, ” HPCA 2008. 4

Readings: Shared Main Memory n Required q q q n Mutlu and Moscibroda, “Stall-Time

Readings: Shared Main Memory n Required q q q n Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors, ” MICRO 2007. Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Memory Controllers, ” ISCA 2008. Kim et al. , “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers, ” HPCA 2010. Kim et al. , “Thread Cluster Memory Scheduling, ” MICRO 2010. Ebrahimi et al. , “Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems, ” ASPLOS 2010. Recommended q q q Lee et al. , “Prefetch-Aware DRAM Controllers, ” MICRO 2008. Rixner et al. , “Memory Access Scheduling, ” ISCA 2000. Zheng et al. , “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency, ” MICRO 2008. 5

Resource Sharing Concept n Idea: Instead of dedicating a hardware resource to a hardware

Resource Sharing Concept n Idea: Instead of dedicating a hardware resource to a hardware context, allow multiple contexts to use it q n Example resources: functional units, pipeline, caches, buses, memory Why? + Resource sharing improves utilization/efficiency throughput q q As we saw with (simultaneous) multithreading When a resource is left idle by one thread, another thread can use it; no need to replicate shared data + Reduces communication latency q For example, shared data kept in the same cache in SMT + Compatible with the shared memory model 6

Resource Sharing Disadvantages n Resource sharing results in contention for resources q q When

Resource Sharing Disadvantages n Resource sharing results in contention for resources q q When the resource is not idle, another thread cannot use it If space is occupied by one thread, another thread needs to reoccupy it - Sometimes reduces each or some thread’s performance - Thread performance can be worse than when it is run alone - Eliminates performance isolation inconsistent performance across runs - Thread performance depends on co-executing threads - Uncontrolled (free-for-all) sharing degrades Qo. S - Causes unfairness, starvation 7

Need for Qo. S and Shared Resource Mgmt. n Why is unpredictable performance (or

Need for Qo. S and Shared Resource Mgmt. n Why is unpredictable performance (or lack of Qo. S) bad? n Makes programmer’s life difficult q n Causes discomfort to user q q n An optimized program can get low performance (and performance varies widely depending on co-runners) An important program can starve Examples from shared software resources Makes system management difficult q How do we enforce a Service Level Agreement when hardware resources are sharing is uncontrollable? 8

Resource Sharing vs. Partitioning n Sharing improves throughput q n Partitioning provides performance isolation

Resource Sharing vs. Partitioning n Sharing improves throughput q n Partitioning provides performance isolation (predictable performance) q n n Better utilization of space Dedicated space Can we get the benefits of both? Idea: Design shared resources in a controllable/partitionable way 9

Shared Hardware Resources n Memory subsystem (in both MT and CMP) q q q

Shared Hardware Resources n Memory subsystem (in both MT and CMP) q q q n I/O subsystem (in both MT and CMP) q q n Non-private caches Interconnects Memory controllers, buses, banks I/O, DMA controllers Ethernet controllers Processor (in MT) q q Pipeline resources L 1 caches 10

Resource Sharing Issues and Related Metrics n System performance n n n n n

Resource Sharing Issues and Related Metrics n System performance n n n n n Fairness Per-application performance (Qo. S) Power Energy System cost Lifetime Reliability, effect of faults Security, information leakage Partitioning: Isolation between apps/threads Sharing (free for all): No isolation 11

Main Memory in the System DRAM BANKS L 2 CACHE 3 L 2 CACHE

Main Memory in the System DRAM BANKS L 2 CACHE 3 L 2 CACHE 2 SHARED L 3 CACHE DRAM MEMORY CONTROLLER DRAM INTERFACE L 2 CACHE 1 L 2 CACHE 0 CORE 3 CORE 2 CORE 1 CORE 0 12

Modern Memory Systems (Multi. Core) 13

Modern Memory Systems (Multi. Core) 13

Memory System is the Major Shared Resource threads’ requests interfere 14

Memory System is the Major Shared Resource threads’ requests interfere 14

Multi-core Issues in Caching n n n How does the cache hierarchy change in

Multi-core Issues in Caching n n n How does the cache hierarchy change in a multi-core system? Private cache: Cache belongs to one core (a shared block can be in multiple caches) Shared cache: Cache is shared by multiple cores CORE 0 CORE 1 CORE 2 CORE 3 L 2 CACHE DRAM MEMORY CONTROLLER CORE 0 CORE 1 CORE 2 CORE 3 L 2 CACHE DRAM MEMORY CONTROLLER 15

Shared Caches Between Cores n Advantages: q q n High effective capacity Dynamic partitioning

Shared Caches Between Cores n Advantages: q q n High effective capacity Dynamic partitioning of available cache space n No fragmentation due to static partitioning Easier to maintain coherence (a cache block is in a single location) Shared data and locks do not ping pong between caches Disadvantages q q q Slower access Cores incur conflict misses due to other cores’ accesses n Misses due to inter-core interference n Some cores can destroy the hit rate of other cores Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth? ) 16

Shared Caches: How to Share? n Free-for-all sharing q q q n Placement/replacement policies

Shared Caches: How to Share? n Free-for-all sharing q q q n Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU) Not thread/application aware An incoming block evicts a block regardless of which threads the blocks belong to Problems q q q A cache-unfriendly application can destroy the performance of a cache friendly application Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit Reduced performance, reduced fairness 17

Controlled Cache Sharing n Utility based cache partitioning q q n Fair cache partitioning

Controlled Cache Sharing n Utility based cache partitioning q q n Fair cache partitioning q n Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High. Performance, Runtime Mechanism to Partition Shared Caches, ” MICRO 2006. Suh et al. , “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning, ” HPCA 2002. Kim et al. , “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture, ” PACT 2004. Shared/private mixed cache mechanisms q q Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs, ” HPCA 2009. Hardavellas et al. , “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches, ” ISCA 2009. 18

Utility Based Shared Cache Goal: Maximize system throughput Partitioning n n n Observation: Not

Utility Based Shared Cache Goal: Maximize system throughput Partitioning n n n Observation: Not all threads/applications benefit equally from caching simple LRU replacement not good for system throughput Idea: Allocate more cache space to applications that obtain the most benefit from more space The high-level idea can be applied to other shared resources as well. Qureshi and Patt, “Utility-Based Cache Partitioning: A Low. Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches, ” MICRO 2006. Suh et al. , “A New Memory Monitoring Scheme for Memory. Aware Scheduling and Partitioning, ” HPCA 2002. 19

Marginal Utility of a Cache Way Misses per 1000 instructions Utility Uab = Misses

Marginal Utility of a Cache Way Misses per 1000 instructions Utility Uab = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility Num ways from 16 -way 1 MB L 2 20

Misses per 1000 instructions (MPKI) Utility Based Shared Cache Partitioning Motivation equake vpr Improve

Misses per 1000 instructions (MPKI) Utility Based Shared Cache Partitioning Motivation equake vpr Improve performance by giving more cache to UTIL the application that benefits more from cache LRU Num ways from 16 -way 1 MB L 2 21

Utility Based Cache Partitioning (III) UMON 1 I$ Core 1 D$ PA UMON 2

Utility Based Cache Partitioning (III) UMON 1 I$ Core 1 D$ PA UMON 2 Shared L 2 cache I$ Core 2 D$ Main Memory Three components: q Utility Monitors (UMON) per core q Partitioning Algorithm (PA) q Replacement support to enforce partitions 22

Utility Monitors q For each core, simulate LRU policy using ATD q Hit counters

Utility Monitors q For each core, simulate LRU policy using ATD q Hit counters in ATD to count hits per recency position q LRU is a stack algorithm: hit counts utility E. g. hits(2 ways) = H 0+H 1 (MRU)H 0 H 1 H 2…H 15(LRU) MTD Set A Set B Set C Set D Set E Set F Set G Set H +++ + ATD Set A Set B Set C Set D Set E Set F Set G Set H 23

Utility Monitors 24

Utility Monitors 24

Dynamic Set Sampling q Extra tags incur hardware and power overhead q DSS reduces

Dynamic Set Sampling q Extra tags incur hardware and power overhead q DSS reduces overhead [Qureshi+ ISCA’ 06] q 32 sets sufficient (analytical bounds) q Storage < 2 k. B/UMON (MRU)H 0 H 1 H 2…H 15(LRU) MTD Set A Set B Set C Set D Set E Set F Set G Set H +++ + A Set B ATD Set B E C Set G Set D Set E Set F UMON Set (DSS) G Set H 25

Partitioning Algorithm q q Evaluate all possible partitions and select the best With a

Partitioning Algorithm q q Evaluate all possible partitions and select the best With a ways to core 1 and (16 -a) ways to core 2: Hitscore 1 = (H 0 + H 1 + … + Ha-1) ---- from UMON 1 Hitscore 2 = (H 0 + H 1 + … + H 16 -a-1) ---- from UMON 2 q Select a that maximizes (Hitscore 1 + Hitscore 2) q Partitioning done once every 5 million cycles 26

Way Partitioning Way partitioning support: [Suh+ HPCA’ 02, Iyer ICS’ 04] 1. Each line

Way Partitioning Way partitioning support: [Suh+ HPCA’ 02, Iyer ICS’ 04] 1. Each line has core-id bits 2. On a miss, count ways_occupied in set by miss-causing app ways_occupied < ways_given Yes Victim is the LRU line from other app No Victim is the LRU line from miss-causing app 27

Performance Metrics n Three metrics for performance: 1. Weighted Speedup (default metric) perf =

Performance Metrics n Three metrics for performance: 1. Weighted Speedup (default metric) perf = IPC 1/Single. IPC 1 + IPC 2/Single. IPC 2 correlates with reduction in execution time 2. Throughput perf = IPC 1 + IPC 2 can be unfair to low-IPC application 3. Hmean-fairness perf = hmean(IPC 1/Single. IPC 1, IPC 2/Single. IPC 2) balances fairness and performance 28

Weighted Speedup Results for UCP 29

Weighted Speedup Results for UCP 29

IPC Results for UCP improves average throughput by 17% 30

IPC Results for UCP improves average throughput by 17% 30

Any Problems with UCP So Far? - Scalability - Non-convex curves? n Time complexity

Any Problems with UCP So Far? - Scalability - Non-convex curves? n Time complexity of partitioning low for two cores (number of possible partitions ≈ number of ways) n Possible partitions increase exponentially with cores n For a 32 -way cache, possible partitions: q q n 4 cores 6545 8 cores 15. 4 million Problem NP hard need scalable partitioning algorithm 31

Greedy Algorithm [Stone+ To. C ’ 92] n n n GA allocates 1 block

Greedy Algorithm [Stone+ To. C ’ 92] n n n GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated Optimal partitioning when utility curves are convex Pathological behavior for non-convex curves 32

Problem with Greedy Algorithm In each iteration, the utility for 1 block: Misses U(A)

Problem with Greedy Algorithm In each iteration, the utility for 1 block: Misses U(A) = 10 misses U(B) = 0 misses Blocks assigned n All blocks assigned to A, even if B has same miss reduction with fewer blocks Problem: GA considers benefit only from the immediate block. Hence, it fails to exploit large gains from looking ahead 33

Lookahead Algorithm n Marginal Utility (MU) = Utility per cache resource q n n

Lookahead Algorithm n Marginal Utility (MU) = Utility per cache resource q n n n MUab = Uab/(b-a) GA considers MU for 1 block. LA considers MU for all possible allocations Select the app that has the max value for MU. Allocate it as many blocks required to get max MU Repeat till all blocks assigned 34

Lookahead Algorithm Example Misses Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks

Lookahead Algorithm Example Misses Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks B gets 3 blocks Blocks assigned Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block Result: A gets 5 blocks and B gets 3 blocks (Optimal) Time complexity ≈ ways 2/2 (512 ops for 32 -ways) 35

UCP Results Four cores sharing a 2 MB 32 -way L 2 LRU UCP(Greedy)

UCP Results Four cores sharing a 2 MB 32 -way L 2 LRU UCP(Greedy) UCP(Lookahead) UCP(Eval. All) Mix 1 Mix 2 (gap-applu-apsi-gzp) (swm-glg-mesa-prl) Mix 3 Mix 4 (mcf-applu-art-vrtx) (mcf-art-eqk-wupw) LA performs similar to Eval. All, with low time-complexity 36

Utility Based Cache Partitioning n Advantages over LRU + Improves system throughput + Better

Utility Based Cache Partitioning n Advantages over LRU + Improves system throughput + Better utilizes the shared cache n Disadvantages - Fairness, Qo. S? n Limitations - Scalability: Partitioning limited to ways. What if you have num. Ways < num. Apps? - Scalability: How is utility computed in a distributed cache? - What if past behavior is not a good predictor of utility? 37