CRUISE Cache Replacement and UtilityAware Scheduling Aamer Jaleel

  • Slides: 32
Download presentation
CRUISE: Cache Replacement and Utility-Aware Scheduling Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon

CRUISE: Cache Replacement and Utility-Aware Scheduling Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon Steely Jr. , Joel Emer Intel Corporation, VSSAD Aamer. Jaleel@intel. com Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012)

Motivation Core 0 L 1 • • Core 0 Core 1 L 1 LLC

Motivation Core 0 L 1 • • Core 0 Core 1 L 1 LLC Single Core ( SMT ) Dual Core ( ST/SMT ) Core 0 Core 1 Core 2 Core 3 L 1 L 1 L 2 L 2 LLC Quad-Core ( ST/SMT ) Shared last-level cache (LLC) common with increasing # of cores # concurrent applications contention for shared cache 2

Misses Per 1000 Instr (under LRU) Problems with LRU-Managed Shared Caches • soplex –

Misses Per 1000 Instr (under LRU) Problems with LRU-Managed Shared Caches • soplex – Applications that have no cache benefit cause destructive cache interference h 264 ref soplex 0 Conventional LRU policy allocates resources based on rate of demand h 264 ref 25 50 75 Cache Occupancy Under LRU Replacement (2 MB Shared Cache) 100 3

Misses Per 1000 Instr (under LRU) Addressing Shared Cache Performance • soplex – Applications

Misses Per 1000 Instr (under LRU) Addressing Shared Cache Performance • soplex – Applications that have no cache benefit cause destructive cache interference h 264 ref • State-of-Art Solutions: – Improve Cache Replacement (HW) – Modify Memory Allocation (SW) – Intelligent Application Scheduling (SW) soplex 0 Conventional LRU policy allocates resources based on rate of demand h 264 ref 25 50 75 Cache Occupancy Under LRU Replacement (2 MB Shared Cache) 100 4

HW Techniques for Improving Shared Caches • • Modify cache replacement policy Goal: Allocate

HW Techniques for Improving Shared Caches • • Modify cache replacement policy Goal: Allocate cache resources based on cache utility NOT demand C 0 C 1 LLC LRU C 0 C 1 LLC Intelligent LLC Replacement 5

SW Techniques for Improving Shared Caches I • • Modify OS memory allocation policy

SW Techniques for Improving Shared Caches I • • Modify OS memory allocation policy Goal: Allocate pages to different cache sets to minimize interference Intelligent Memory Allocator (OS) C 0 C 1 LLC LRU 6

SW Techniques for Improving Shared Caches II • • Modify scheduling policy using Operating

SW Techniques for Improving Shared Caches II • • Modify scheduling policy using Operating System (OS) or hypervisor Goal: Intelligently co-schedule applications to minimize contention C 0 C 1 LLC 0 C 2 C 3 LLC 1 LRU-managed LLC 7

SW Techniques for Improving Shared Caches A C 0 B C C 1 C

SW Techniques for Improving Shared Caches A C 0 B C C 1 C 2 D • Three possible schedules: • • • C 3 LLC 1 LLC 0 A, B | C, D A, C | B, D A, D | B, C Optimal / Worst Schedule 4. 9 ~30% 5. 5 6. 3 Throughput Baseline System Worst Schedule Optimal Schedule (4 -core CMP, 3 -level hierarchy, LRU-managed LLC) 1. 40 1. 30 1. 20 1. 10 1. 00 ~9% On Average 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 8

Interactions Between Co-Scheduling and Replacement Existing co-scheduling proposals evaluated on LRU-managed LLCs Question: Is

Interactions Between Co-Scheduling and Replacement Existing co-scheduling proposals evaluated on LRU-managed LLCs Question: Is intelligent co-scheduling necessary with improved cache replacement policies? DRRIP Cache Replacement [ Jaleel et al, ISCA’ 10 ] 9

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1.

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1. 28 (4 -core CMP, 3 -level hierarchy, per-workload comparison 1365 4 -core multi-programmed workloads) • Category I: No need for intelligent co-schedule under both LRU/DRRIP 1. 24 • Category II: Require intelligent co-schedule only under LRU 1. 20 1. 16 • Category III: Require intelligent co-schedule only under DRRIP 1. 12 • Category IV: Require intelligent co-schedule under both LRU/DRRIP 1. 08 1. 04 1. 00 1. 04 1. 08 1. 12 1. 16 1. 20 Optimal / Worst Schedule ( LRU ) 1. 24 1. 28 10

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1.

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1. 28 (4 -core CMP, 3 -level hierarchy, per-workload comparison 1365 4 -core multi-programmed workloads) • Category I: No need for intelligent co-schedule under both LRU/DRRIP 1. 24 • Category II: Require intelligent co-schedule only under LRU 1. 20 1. 16 • Category III: Require intelligent co-schedule only under DRRIP 1. 12 • Category IV: Require intelligent co-schedule under both LRU/DRRIP 1. 08 1. 04 1. 00 1. 04 1. 08 1. 12 1. 16 1. 20 Optimalmal / Worst Schedule ( LRU ) 1. 24 1. 28 Observation: Need for Intelligent Co-Scheduling is Function of Replacement Policy 11

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1.

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1. 28 (4 -core CMP, 3 -level hierarchy, per-workload comparison 1365 4 -core multi-programmed workloads) • Category II: Require intelligent co-schedule only under LRU 1. 24 1. 20 1. 16 1. 12 C 0 C 1 C 2 C 3 1. 08 LLC 0 1. 04 1. 00 LLC 1 LRU-managed LLCs 1. 04 1. 08 1. 12 1. 16 1. 20 Optimalmal / Worst Schedule ( LRU ) 1. 24 1. 28 12

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1.

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1. 28 (4 -core CMP, 3 -level hierarchy, per-workload comparison 1365 4 -core multi-programmed workloads) • Category II: Require intelligent co-schedule only under LRU 1. 24 1. 20 1. 16 1. 12 C 0 C 1 C 2 C 3 1. 08 LLC 0 1. 04 1. 00 LLC 1 LRU-managed LLCs 1. 04 1. 08 1. 12 1. 16 1. 20 Optimalmal / Worst Schedule ( LRU ) 1. 24 1. 28 13

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1.

Interactions Between Optimal Co-Scheduling and Replacement Optimal / Worst Schedule ( DRRIP ) 1. 28 (4 -core CMP, 3 -level hierarchy, per-workload comparison 1365 4 -core multi-programmed workloads) • Category II: Require intelligent co-schedule only under LRU 1. 24 1. 20 1. 16 1. 12 C 0 C 1 C 2 C 3 1. 08 LLC 0 1. 04 1. 00 LLC 1 DRRIP-managed LLCs 1. 00 1. 04 1. 08 1. 12 1. 16 1. 20 Optimalmal / Worst Schedule ( LRU ) 1. 24 1. 28 No Re-Scheduling Necessary for Category II Workloads in DRRIP-managed LLCs 14

Opportunity for Intelligent Application Co-Scheduling • Prior Art: • Evaluated using inefficient cache policies

Opportunity for Intelligent Application Co-Scheduling • Prior Art: • Evaluated using inefficient cache policies (i. e. LRU replacement) • Proposal: Cache Replacement and Utility-aware Scheduling: • Understand how apps access the LLC (in isolation) • Schedule applications based on how they can impact each other • ( Keep LLC replacement policy in mind ) 15

Memory Diversity of Applications (In Isolation) LLCF LLCT LLCFR CCF Core 0 Core 1

Memory Diversity of Applications (In Isolation) LLCF LLCT LLCFR CCF Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 L 2 L 2 LLC Core Cache Fitting (e. g. povray*) LLC Friendly (e. g. bzip 2*) LLC LLC Thrashing (e. g. bwaves*) LLC Fitting (e. g. sphinx 3*) *Assuming a 4 MB shared LLC 16

Cache Replacement and Utility-aware Scheduling (CRUISE) • Core Cache Fitting (CCF) Apps: • Infrequently

Cache Replacement and Utility-aware Scheduling (CRUISE) • Core Cache Fitting (CCF) Apps: • Infrequently access the LLC • Do not rely on LLC for performance • • Co-scheduling multiple CCF jobs on same LLC “wastes” that LLC Best to spread CCF applications across available LLCs CCF Core 0 Core 1 Core 2 Core 3 L 2 L 2 LLC 17

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Thrashing (LLCT) Apps: • Frequently access

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Thrashing (LLCT) Apps: • Frequently access the LLC • Do not benefit at all from the LLC • Under LRU, LLCT apps degrade performance of other applications • Co-schedule LLCT with LLCT apps LLCT Core 0 Core 1 Core 2 Core 3 L 2 L 2 LLC 18

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Thrashing (LLCT) Apps: • Frequently access

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Thrashing (LLCT) Apps: • Frequently access the LLC • Do not benefit at all from the LLC • Under DRRIP, LLCT apps do not degrade performance of coscheduled apps • Best to spread LLCT apps across available LLCs to efficiently utilize cache resources LLCT Core 0 Core 1 Core 2 Core 3 L 2 L 2 LLC 19

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Fitting (LLCF) Apps: • Frequently access

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Fitting (LLCF) Apps: • Frequently access the LLC • Require majority of LLC • Behave like LLCT apps if they do not receive majority of LLC • Best to co-schedule LLCF with CCF applications (if present) LLCF CCF Core 0 Core 1 Core 2 Core 3 L 2 L 2 LLC • If no CCF app, schedule with LLCF/LLCT 20

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Friendly (LLCFR) Apps: • Rely on

Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Friendly (LLCFR) Apps: • Rely on LLC for performance • Can share LLC with similar apps • Co-scheduling multiple LLCFR jobs on same LLC will not result in suboptimal performance LLCFR Core 0 Core 1 Core 2 Core 3 L 2 L 2 LLC 21

CRUISE for LRU-managed Caches (CRUISE-L) • • Applications: LLCT LLCF Co-schedule apps as follows:

CRUISE for LRU-managed Caches (CRUISE-L) • • Applications: LLCT LLCF Co-schedule apps as follows: • • Co-schedule LLCT apps with LLCT apps Spread CCF applications across LLCs Co-schedule LLCF apps with CCF Fill LLCFR apps onto free cores LLCT LLCF LLCT CCF Core 0 Core 1 Core 2 Core 3 L 2 L 2 LLC 22

CRUISE for DRRIP-managed Caches (CRUISE-D) • • Applications: LLCT LLCFR CCF Co-schedule apps as

CRUISE for DRRIP-managed Caches (CRUISE-D) • • Applications: LLCT LLCFR CCF Co-schedule apps as follows: • • Spread LLCT apps across LLCs Spread CCF apps across LLCs Co-schedule LLCF with CCF/LLCT apps Fill LLCFR apps onto free cores LLCFR LLCT CCF LLCT Core 0 Core 1 Core 2 Core 3 L 2 L 2 LLC 23

Experimental Methodology • System Model: • 4 -wide Oo. O processor (Core i 7

Experimental Methodology • System Model: • 4 -wide Oo. O processor (Core i 7 type) • 3 -level memory hierarchy (Core i 7 type) • Application Scheduler • Workloads • Multi-programmed combinations of SPEC CPU 2006 applications • ~1400 4 -core multi-programmed workloads (2 cores/LLC) • ~6400 8 -core multi-programmed workloads (2 cores/LLC, 4 cores/LLC) 24

Experimental Methodology • System Model: • 4 -wide Oo. O processor (Core i 7

Experimental Methodology • System Model: • 4 -wide Oo. O processor (Core i 7 type) • 3 -level memory hierarchy (Core i 7 type) • Application Scheduler • Workloads A B C D C 0 C 1 C 2 C 3 LLC 0 LLC 1 Baseline System • Multi-programmed combinations of SPEC CPU 2006 applications • ~1400 4 -core multi-programmed workloads (2 cores/LLC) • ~6400 8 -core multi-programmed workloads (2 cores/LLC, 4 cores/LLC) 25

CRUISE Performance on Shared Caches (4 -core CMP, 3 -level hierarchy, averaged across all

CRUISE Performance on Shared Caches (4 -core CMP, 3 -level hierarchy, averaged across all 1365 multi-programmed workload mixes) Random CRUISE-D Distributed Intensity (ASPLOS’ 10) Optimal 1. 04 1. 02 O P T I M A L 1. 06 C R U I S E - D O P T I M A L 1. 08 C R U I S E - L Performance Relative to Worst Schedule 1. 10 CRUISE-L 1. 00 LRU-managed LLC • • DRRIP-managed LLC CRUISE provides near-optimal performance Optimal co-scheduling decision is a function of LLC replacement policy 26

Classifying Application Cache Utility in Isolation How Do You Know Application Classification at Run

Classifying Application Cache Utility in Isolation How Do You Know Application Classification at Run Time? x • Profiling: • Application provides memory intensity at run time • HW Performance Counters: x • Assume isolated cache behavior same as shared cache behavior x • Periodically pause adjacent cores at runtime • Proposal: Runtime Isolated Cache Estimator (RICE) • Architecture support to estimate isolated cache behavior while still sharing the LLC 27

Runtime Isolated Cache Estimator (RICE) • Assume a cache shared by 2 applications: APP

Runtime Isolated Cache Estimator (RICE) • Assume a cache shared by 2 applications: APP 0 APP 1 Monitor isolated cache behavior. Only Monitor isolated APP 0 behavior. fills to these cache Only sets, APP 1 all fills other to these apps sets, bypass all these other sets apps bypass these sets APP 0 APP 1 Miss Counters to compute isolated hit/miss rate (apki, mpki) < P 0 , P 1 , P 2 , P 3 > Follower Sets Set-Level View of Cache + Access • 32 sets per APP • 15 -bit hit/miss cntrs High-Level View of Cache 28

Runtime Isolated Cache Estimator (RICE) • Assume a cache shared by 2 applications: APP

Runtime Isolated Cache Estimator (RICE) • Assume a cache shared by 2 applications: APP 0 APP 1 Monitor isolated cache behavior if only half the cache available. Only APP 0 fills to half the ways in the sets. All other apps use these sets Needed to classify LLCF applications. Set-Level View of Cache APP 0 APP 1 + + Access-F Miss-F Access-H Miss-H Counters to compute isolated hit/miss rate (apki, mpki) < P 0 , P 1 , P 2 , P 3 > Follower Sets • 32 sets per APP • 15 -bit hit/miss cntrs High-Level View of Cache 29

Performance of CRUISE using RICE Classifier Performance Relative to Worst Schedule 1. 30 CRUISE

Performance of CRUISE using RICE Classifier Performance Relative to Worst Schedule 1. 30 CRUISE 1. 25 Distributed Intensity (ASPLOS’ 10) Optimal 1. 20 1. 15 1. 10 1. 05 1. 00 0. 95 • CRUISE using Dynamic RICE Classifier Within 1 -2% of Optimal 30

Summary • Optimal application co-scheduling is an important problem • Useful for future multi-core

Summary • Optimal application co-scheduling is an important problem • Useful for future multi-core processors and virtualization technologies • • Co-scheduling decisions are function of replacement policy Our Proposal: • Cache Replacement and Utility-aware Scheduling (CRUISE) • Architecture support for estimating isolated cache behavior (RICE) • CRUISE is scalable and performs similar to optimal co-scheduling • RICE requires negligible hardware overhead 31

Q&A 32

Q&A 32