Adaptive Insertion Policies for Managing Shared Caches Aamer

Paper Motivation Core 0 FLC • • • Core 0 Core 1 FLC LLC

Misses Per 1000 Instr (under LRU) Problems with LRU-Managed Shared Caches • soplex 0

Misses Per 1000 Instr (under LRU) Addressing Shared Cache Performance • soplex 0 –

Paper Contributions • Problem: For shared caches, conventional LRU policy allocates cache resources based

Review Insertion Policies “Adaptive Insertion Policies for High-Performance Caching” Moinuddin Qureshi, Aamer Jaleel, Yale

Cache Replacement 101 – ISCA’ 07 Two components of cache replacement: • Victim Selection:

Static Insertion Policies – ISCA’ 07 • Conventional (MRU Insertion) Policy: – Choose victim,

Dynamic Insertion Policy (DIP) via “Set-Dueling” – ISCA’ 07 HW Required: 10 bits +

Misses Per 1000 Instr (under LRU) Extending DIP to Shared Caches • DIP uses

Thread Aware Dynamic Insertion Policy (TADIP) • Assume N-core CMP running N apps, what

Using Set-Dueling As a Practical Approach to TADIP • Unnecessary to exhaustively search all

TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 4 applications: APP

TADIP Using Set-Dueling Monitors (SDMs) • • • Assume a cache shared by 4

Summarizing Insertion Policies Policy Insertion Policy Search Space # of SDMs # Counters LRU

Experimental Setup • Simulator and Benchmarks: – CMP$im – A Pin-based Multi-Core Performance Simulator

MPKI Cache % MRU Usage insertions Baseline LRU Policy / DIP MPKI H 264

TADIP Results – Throughput No Gains from DIP and TADIP are ROBUST and Do

TADIP Compared to Offline Best Static Policy Static Best almost always better because insertion

TADIP Vs. UCP ( MICRO’ 06 ) Utility Based Cache Partitioning (UCP) Cost Per

TADIP Results – Sensitivity to Cache Size TADIP Provides Performance Equivalent to Doubling Cache

Throughput Normalized to Baseline System TADIP Results – Scalability TADIP Scales to Large Number

Summary • The Problem: For shared caches, conventional LRU policy allocates cache resources based

Journal of Instruction-Level Parallelism 1 st Data Prefetching Championship (DPC-1) Sponsored by: Intel, JILP,

TADIP Results – Weighted Speedup TADIP Provides More Than Two Times Performance of DIP

TADIP Results – Fairness Metric TADIP Improves the Fairness 27

TADIP In Presence of Prefetching on 4 -core CMP TADIP Improves Performance Even In

Insertion Policy to Control Cache Occupancy (16 -Cores) • Changing insertion policy directly controls

TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 2 applications: APP

TADIP Using Set-Dueling Monitors (SDMs) • • • Assume a cache shared by 2

Slides: 31

Download presentation

Adaptive Insertion Policies for Managing Shared Caches Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely Jr. , Joel Emer Intel Corporation, VSSAD Aamer. Jaleel@intel. com International Conference on Parallel Architectures and Compilation Techniques (PACT)

Paper Motivation Core 0 FLC • • • Core 0 Core 1 FLC LLC Single Core ( SMT ) Dual Core ( ST/SMT ) Core 0 Core 1 Core 2 Core 3 FLC FLC MLC MLC LLC Quad-Core ( ST/SMT ) Shared caches common and more so with increasing # of cores # concurrent applications contention for shared cache High Performance Manage shared cache efficiently 2

Misses Per 1000 Instr (under LRU) Problems with LRU-Managed Shared Caches • soplex 0 – Applications that do not benefit from cache cause destructive cache interference h 264 ref soplex Conventional LRU policy allocates resources based on rate of demand h 264 ref 25 50 75 Cache Occupancy Under LRU Replacement (2 MB Shared Cache) 100 3

Misses Per 1000 Instr (under LRU) Addressing Shared Cache Performance • soplex 0 – Applications that do not benefit from cache cause destructive cache interference • h 264 ref soplex Conventional LRU policy allocates resources based on rate of demand h 264 ref 25 50 75 Cache Occupancy Under LRU Replacement (2 MB Shared Cache) 100 Cache Partitioning: Reserves cache resources based on application benefit rather than rate of demand L HW to detect cache benefit L Changes to existing cache structure L Not scalable to large # of applications Eliminate Drawbacks of Cache Partitioning 4

Paper Contributions • Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit • Goals: Design a dynamic hardware mechanism that: • Solution: Thread-Aware Dynamic Insertion Policy (TADIP) that improves average throughput by 12 -18% for 2, 4, 8, and 16 -core systems with two bytes of storage per HW-thread 1. 2. 3. 4. Provides High Performance by Allocating Cache on a Benefit-basis Is Robust Across Different Concurrently Executing Applications Scales to Large Number of Competing Applications Requires Low Design Overhead TADIP, Unlike Cache Partitioning, DOES NOT Attempt to Reserve Cache Space 5

Review Insertion Policies “Adaptive Insertion Policies for High-Performance Caching” Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon Steely Jr. , Joel Emer Appeared in ISCA’ 07 6

Cache Replacement 101 – ISCA’ 07 Two components of cache replacement: • Victim Selection: – Which line to replace for incoming line? (E. g. LRU, Random etc) • Insertion Policy: – With what priority is the new line placed in the replacement list? (E. g. insert new line into MRU position) Simple changes to insertion policy can minimize cache thrashing and improves cache performance for memory-intensive workloads 7

Static Insertion Policies – ISCA’ 07 • Conventional (MRU Insertion) Policy: – Choose victim, promote to MRU • LRU Insertion Policy (LIP): Bimodal Insertion Policy (BIP) – LIP does not age older lines – Infrequently insert some misses at MRU – Bimodal Throttle: b • We used b ~= 3% c d e f g LRU h Reference to ‘i’ with conventional LRU policy: – Choose victim, DO NOT promote to MRU – Unless reused, lines stay at LRU position • MRU a b i a b c d e f g Reference to ‘i’ with LIP: a b c d e f g i Reference to ‘i’ with BIP: if( rand() < b ) Insert at MRU postion else Insert at LRU position Applications Prefer Either Conventional LRU or BIP… 8

Dynamic Insertion Policy (DIP) via “Set-Dueling” – ISCA’ 07 HW Required: 10 bits + Combinational Logic • • Set Dueling Monitors (SDMs): Dedicated sets to estimate the performance of a predefined policy Divide the cache in three: – SDM-LRU: Dedicated LRU-sets – SDM-BIP: Dedicated BIP-sets – Follower sets • PSEL: n-bit saturating counter – misses to SDM-LRU: PSEL++ – misses to SDM-BIP: PSEL-- • Follower sets insertion policy: – Use LRU: If PSEL MSB = 0 – Use BIP: If PSEL MSB = 1 SDM-LRU SDM-BIP Follower Sets miss + PSEL – MSB = 1? NO USE LRU YES DO BIP - Based on Analytical and Empirical Studies: • 32 Sets per SDM • 10 bit PSEL counter 9

Misses Per 1000 Instr (under LRU) Extending DIP to Shared Caches • DIP uses a single policy (LRU or BIP) for all applications competing for the cache • DIP can not distinguish between apps that benefit from cache and those that do not • Example: soplex + h 264 ref w/2 MB cache – DIP learns LRU for both apps – soplex causes destructive interference – Desirable that only h 264 ref follow LRU and soplex follow BIP soplex h 264 ref Need a Thread-Aware Dynamic Insertion Policy (TADIP) 10

Thread Aware Dynamic Insertion Policy (TADIP) • Assume N-core CMP running N apps, what is best insertion policy for each app? (LRU=0, BIP=1) • Insertion policy decision can be thought of as an N-bit binary string: – If Px = 1, then for application c use BIP, else use LRU – e. g. 0000 always use conventional LRU, 1111 always use BIP • With N-bit string, 2 N possible string combinations. How to find best one? ? ? – Offline Profiling: Input set/system dependent & impractical with large N – Brute Force Search using SDMs: Infeasible with large N Need a PRACTICAL and SCALABLE Implementation of TADIP 11

Using Set-Dueling As a Practical Approach to TADIP • Unnecessary to exhaustively search all 2 N combinations • Some bits of the best binary insertion string can be learned independently – Example: Always use BIP for applications that create interference • Exponential Search Space Linear Search Space – Learn best policy (BIP or LRU) for each app in presence of all other apps Use Per-Application SDMs To Decide: In the presence of other apps, does an app cause destructive interference… If so, use BIP for this app, else use LRU policy 12

TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 4 applications: APP 0 APP 1 APP 2 APP 3 In the presence of other apps, does APP 0 doing LRU or BIP improve cache performance? < < < < 0, P 1, P 2, P 3 P 0, 0, P 2, P 3 P 0, 1, P 2, P 3 P 0, P 1, 0, P 3 P 0, P 1, 1, P 3 P 0, P 1, P 2, 0 P 0, P 1, P 2, 1 > > > > miss – – + PSEL 0 + PSEL 1 + PSEL 2 + PSEL 3 Follower Sets Pc = MSB( PSELc ) Set-Level View of Cache High-Level View of Cache 13

TADIP Using Set-Dueling Monitors (SDMs) • • • Assume a cache shared by 4 applications: APP 0 APP 1 APP 2 APP 3 miss – LRU SDMs for each APP + < 0, P 1, P 2, P 3 > – BIP SDMs for each APP – < 1, P 2, P 3 > + – Follower sets – Per-APP PSEL saturating counters – misses to LRU: PSEL++ – misses to BIP: PSEL-Follower sets insertion policy: – SDMs of one thread are follower sets of another thread – Let Px = MSB[ PSELx ] – Fill Decision: < < > > Follower Sets HW Required: (10*T) bits + Combinational Logic – – PSEL 0 PSEL 1 + PSEL 2 + PSEL 3 • 32 sets per SDM • 10 -bit PSEL Pc = MSB( PSELc ) 14

Summarizing Insertion Policies Policy Insertion Policy Search Space # of SDMs # Counters LRU Replacement < 0, 0, 0, … 0 > 0 0 DIP < 0, 0, 0, … 0 > and < 1, 1, 1, … 1 > 2 1 Brute Force < 0, 0, 0, … 0 > … < 1, 1, 1, … 1 > 2 N 2 N TADIP and Hamming Distance of 1 2 N N TADIP is SCALABLE with Large N 15

Experimental Setup • Simulator and Benchmarks: – CMP$im – A Pin-based Multi-Core Performance Simulator – 17 representative SPEC CPU 2006 benchmarks • Baseline Study: – 4 -core CMP with in-order cores (assuming L 1 -hit IPC of 1) – Three-level Cache Hierarchy: 32 KB L 1, 256 KB L 2, 4 MB L 3 – 15 workload mixes of four different SPEC CPU 2006 benchmarks • Scalability Study: – 2 -core, 4 -core, 8 -core, 16 -core systems – 50 workload mixes of 2, 4, 8, & 16 different SPEC CPU 2006 benchmarks 16

MPKI Cache % MRU Usage insertions Baseline LRU Policy / DIP MPKI H 264 REF MPKI SOPLEX Cache % MRU Usage insertions APKI: accesses per 1000 inst MPKI: misses per 1000 inst APKI soplex + h 264 ref Sharing 2 MB Cache LRU BIP TADIP Improves Throughput by 27% over LRU and DIP 17

TADIP Results – Throughput No Gains from DIP and TADIP are ROBUST and Do Not Degrade Performance over LRU Making Thread-Aware Decisions is 2 x Better than DIP 18

TADIP Compared to Offline Best Static Policy Static Best almost always better because insertion string with best IPC chosen as “Best Static”. TADIP optimizes for fewer misses. Can use TADIP to optimize other metrics (e. g. IPC) TADIP Better Due to Phase Adaptation TADIP is within 85% of Best Offline Determined Insertion Policy Decision 19

TADIP Vs. UCP ( MICRO’ 06 ) Utility Based Cache Partitioning (UCP) Cost Per Thread (bytes) UCP TADIP 1920 2 DIP Out-Performs UCP Without Requiring Any Cache Partitioning Hardware Unlike Cache Partitioning Schemes, TADIP Does NOT Reserve Cache Space TADIP Does Efficient CACHE MANAGEMENT by Changing Insertion Policy 20

TADIP Results – Sensitivity to Cache Size TADIP Provides Performance Equivalent to Doubling Cache Size 21

Throughput Normalized to Baseline System TADIP Results – Scalability TADIP Scales to Large Number of Concurrently Executing Applications 22

Summary • The Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit • Solution: Thread-Aware Dynamic Insertion Policy (TADIP) 1. Provides High Performance by Allocating Cache on a Benefit-Basis - Up to 94%, 64%, 26% and 16% performance on 2, 4, 8, and 16 core CMPs 2. Is Robust Across Different Workload Mixes - Does not significantly hurt performance when LRU works well 3. Scales to Large Number of Competing Applications - Evaluated up to 16 -cores in our study 4. Requires Low Design Overhead - < 2 bytes per HW-thread and NO CHANGES to existing cache structure 23

Q&A 24

Journal of Instruction-Level Parallelism 1 st Data Prefetching Championship (DPC-1) Sponsored by: Intel, JILP, IEEE TC-u. ARCH Conjunction with: HPCA-15 Paper & Abstract Due: December 12 th, 2008 Notification: January 16 th, 2008 Final Version: January 30 th, 2008 More Information and Prefetch Download Kit At: http: //www. jilp. org/dpc/

TADIP Results – Weighted Speedup TADIP Provides More Than Two Times Performance of DIP TADIP Improves Performance over LRU by 18% 26

TADIP Results – Fairness Metric TADIP Improves the Fairness 27

TADIP In Presence of Prefetching on 4 -core CMP TADIP Improves Performance Even In Presence of HW Prefetching 28

Insertion Policy to Control Cache Occupancy (16 -Cores) • Changing insertion policy directly controls the amount of cache resources provided to an application • In figure, only showing only the TADIP selection insertion policy for xalancbmk & sphinx 3 • TADIP improves performance by 28% Cache % MRU MPKI Usage insertions APKI Sixteen Core Mix with 16 MB LLC Insertion Policy Directly Controls Cache Occupancy 29

TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 2 applications: APP 0 and APP 1 In the presence of other apps, should APP 0 do LRU or BIP? In the presence of other apps, should APP 1 do LRU or BIP? < 0 , P 1 > < 1 , P 1 > Follower Sets miss + – PSEL 0 PSEL 1 • 32 sets per SDM • 9 -bit PSEL Pc = MSB( PSELc ) Set-Level View of Cache High-Level View of Cache 30

TADIP Using Set-Dueling Monitors (SDMs) • • • Assume a cache shared by 2 applications: APP 0 and APP 1 – LRU SDMs for each APP < 0 , P 1 > – BIP SDMs for each APP – Follower sets < 1 , P 1 > PSEL 0, PSEL 1: per-APP PSEL – misses to LRU: PSEL++ – misses to BIP: PSEL-Follower sets insertion policy: – SDMs of one thread are follower sets of another thread – Let Px = MSB[ PSELx ] – Fill Decision: Follower Sets HW Required: (9*T) bits + Combinational Logic miss + – PSEL 0 PSEL 1 • 32 sets per SDM • 9 -bit PSEL cntr Pc = MSB( PSELc ) 31