Seminar in Computer Architecture Meeting 3 a Memory

  • Slides: 54
Download presentation
Seminar in Computer Architecture Meeting 3 a: Memory Channel Partitioning Prof. Onur Mutlu ETH

Seminar in Computer Architecture Meeting 3 a: Memory Channel Partitioning Prof. Onur Mutlu ETH Zürich Fall 2020 1 October 2020

Example Paper Presentation 2

Example Paper Presentation 2

We Will Briefly Review This Paper n Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu,

We Will Briefly Review This Paper n Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning" Proceedings of the 44 th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx) 3

Application-Aware Memory Channel Partitioning Sai Prashanth Muralidhara § Lavanya Subramanian † Onur Mutlu †

Application-Aware Memory Channel Partitioning Sai Prashanth Muralidhara § Lavanya Subramanian † Onur Mutlu † Mahmut Kandemir § Thomas Moscibroda ‡ § Pennsylvania State University † Carnegie Mellon University ‡ Microsoft Research

Background, Problem & Goal 5

Background, Problem & Goal 5

Main Memory is a Bottleneck Core Memory Controller Channel Main Memory Core n n

Main Memory is a Bottleneck Core Memory Controller Channel Main Memory Core n n n Main memory latency is long Core stalls, performance degrades Multiple applications share the main memory 6

Problem of Inter-Application Interference Core Memory Controller Req Channel Req Main Memory Core n

Problem of Inter-Application Interference Core Memory Controller Req Channel Req Main Memory Core n n n Applications’ requests interfere at the main memory This inter-application interference degrades system performance Problem further exacerbated due to q q Increasing number of cores Limited off-chip pin bandwidth 7

Outline Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First Approach:

Outline Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First Approach: Application-Aware Memory Channel Partitioning Our Second Approach: Integrated Memory Partitioning and Scheduling 8

Background: Main Memory Rows Columns Memory Controller n Bank 1 Bank 2 Bank 3

Background: Main Memory Rows Columns Memory Controller n Bank 1 Bank 2 Bank 3 Row Buffer Row Buffer Channel FR-FCFS memory scheduling policy q q n Bank 0 [Zuravleff et al. , US Patent ‘ 97; Rixner et al. , ISCA ‘ 00] Row-buffer hit first Oldest request first Unaware of inter-application interference 9

Novelty 10

Novelty 10

Previous Approach Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First

Previous Approach Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First Approach: Application-Aware Memory Channel Partitioning Our Second Approach: Integrated Memory Partitioning and Scheduling 11

Application-Aware Memory Request Scheduling n n n Monitor application memory access characteristics Rank applications

Application-Aware Memory Request Scheduling n n n Monitor application memory access characteristics Rank applications based on memory access characteristics Prioritize requests at the memory controller, based on ranking 12

An Example: Thread Cluster Memory Scheduling Memory-non-intensive thread Nonintensive cluster Throughput thread thread higher

An Example: Thread Cluster Memory Scheduling Memory-non-intensive thread Nonintensive cluster Throughput thread thread higher priority Prioritized thread higher priority Threads in the system Memory-intensive Intensive cluster Fairness Figure: Kim et al. , MICRO 2010 13

Application-Aware Memory Request Scheduling Advantages n n Reduces interference between applications by request reordering

Application-Aware Memory Request Scheduling Advantages n n Reduces interference between applications by request reordering Improves system performance Disadvantages n Requires modifications to memory scheduling logic for q q n Ranking Prioritization Cannot completely eliminate interference by request reordering 14

Key Approach and Ideas 15

Key Approach and Ideas 15

The Paper’s Approach Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our

The Paper’s Approach Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First Approach: Application-Aware Memory Channel Partitioning Our Second Approach: Integrated Memory Partitioning and Scheduling 16

Observation: Modern Systems Have Multiple Channels Core Red App Memory Controller Channel 0 Memory

Observation: Modern Systems Have Multiple Channels Core Red App Memory Controller Channel 0 Memory Controller Channel 1 Memory Core Blue App A new degree of freedom Mapping data across multiple channels 17

Data Mapping in Current Systems Core Red App Page Memory Controller Channel 0 Memory

Data Mapping in Current Systems Core Red App Page Memory Controller Channel 0 Memory Controller Channel 1 Memory Core Blue App Causes interference between applications’ requests 18

Partitioning Channels Between Applications Core Red App Page Memory Controller Channel 0 Memory Controller

Partitioning Channels Between Applications Core Red App Page Memory Controller Channel 0 Memory Controller Channel 1 Memory Core Blue App Eliminates interference between applications’ requests 19

Overview: Memory Channel Partitioning (MCP) n Goal q n Basic Idea q n Eliminate

Overview: Memory Channel Partitioning (MCP) n Goal q n Basic Idea q n Eliminate harmful interference between applications Map the data of badly-interfering applications to different channels Key Principles q q Separate low and high memory-intensity applications Separate low and high row-buffer locality applications 20

Key Insight 1: Separate by Memory Intensity High memory-intensity applications interfere with low memory

Key Insight 1: Separate by Memory Intensity High memory-intensity applications interfere with low memory -intensity applications in shared memory channels Time Units Core Red App Core Blue App 5 4 3 2 1 Channel 0 Bank 1 Bank 0 Bank 1 Time Units Core Red App 5 4 3 2 1 Core Saved Cycles Blue App Saved Cycles Bank 0 Bank 1 Channel 1 Conventional Page Mapping Channel 0 Channel Partitioning Map data of low and high memory-intensity applications to different channels 21

Key Insight 2: Separate by Row-Buffer Request Buffer Channel 0 Channelapplications 0 Locality High.

Key Insight 2: Separate by Row-Buffer Request Buffer Channel 0 Channelapplications 0 Locality High. Request row-buffer locality interfere with low State Bank 0 R 1 row-buffer locality applications in shared memory channels R 0 Time units 6 5 R 3 R 2 R 0 Bank 1 R 4 Bank 0 R 1 R 4 Bank 0 Bank 1 R 3 R 2 Bank 1 Service Order 3 4 1 2 R 1 R 3 R 2 R 0 R 4 Channel 1 Channel 0 Bank 0 Time units 6 5 Service Order 3 4 Bank 1 Bank 0 Bank 1 R 1 2 1 R 0 R 4 Channel 1 Channel 0 Bank 1 Bank 0 Saved row-buffer Cycles Bank 1 R 3 R 2 Map data of low and high locality applications Channel 1 to different channels Conventional Page Mapping Channel Partitioning 22

Mechanisms (in some detail) 23

Mechanisms (in some detail) 23

Memory Channel Partitioning (MCP) Mechanism Hardware 1. 2. 3. 4. 5. Profile applications Classify

Memory Channel Partitioning (MCP) Mechanism Hardware 1. 2. 3. 4. 5. Profile applications Classify applications into groups Partition channels between application groups Assign a preferred channel to each application Allocate application pages to preferred channel System Software 24

1. Profile Applications n n Hardware counters collect application memory access characteristics Memory access

1. Profile Applications n n Hardware counters collect application memory access characteristics Memory access characteristics q Memory intensity: Last level cache Misses Per Kilo Instruction (MPKI) q Row-buffer locality: Row-buffer Hit Rate (RBH) - percentage of accesses that hit in the row buffer 25

2. Classify Applications Test MPKI Low Intensity High Intensity Test RBH Low High Intensity

2. Classify Applications Test MPKI Low Intensity High Intensity Test RBH Low High Intensity Low Row-Buffer Locality High Intensity High Row-Buffer Locality 26

3. Partition Channels Among Groups: Step 1 Channel 1 Low Intensity Channel 2 High

3. Partition Channels Among Groups: Step 1 Channel 1 Low Intensity Channel 2 High Intensity Low Row-Buffer Locality High Intensity High Row-Buffer Locality Assign number of channels proportional to number of applications in group Channel 3. . . Channel N-1 Channel N 27

3. Partition Channels Among Groups: Step 2 Channel 1 Low Intensity Channel 2 Channel

3. Partition Channels Among Groups: Step 2 Channel 1 Low Intensity Channel 2 Channel 3 High Intensity Low Row-Buffer Locality High Intensity High Row-Buffer Locality Assign number of channels proportional to bandwidth demand of group . . . Channel N-1. . Channel N 28

4. Assign Preferred Channel to Application n Assign each application a preferred channel from

4. Assign Preferred Channel to Application n Assign each application a preferred channel from n its group’s allocated channels Distribute applications to channels such that group’s bandwidth demand is balanced across its channels MPKI: 1 Channel 1 MPKI: 3 Low Intensity MPKI: 42 Channel 29

5. Allocate Page to Preferred Channel n Enforce channel preferences computed in the previous

5. Allocate Page to Preferred Channel n Enforce channel preferences computed in the previous step n On a page fault, the operating system q q q allocates page to preferred channel if free page available in preferred channel if free page not available, replacement policy tries to allocate page to preferred channel if it fails, allocate page to another channel 30

Interval Based Operation Current Interval Next Interval time 1. Profile applications 5. Enforce channel

Interval Based Operation Current Interval Next Interval time 1. Profile applications 5. Enforce channel preferences 2. Classify applications into groups 3. Partition channels between groups 4. Assign preferred channel to applications 31

Integrating Partitioning and Scheduling Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling

Integrating Partitioning and Scheduling Goal: Mitigate Inter-Application Interference Previous Approach: Application-Aware Memory Request Scheduling Our First Approach: Application-Aware Memory Channel Partitioning Our Second Approach: Integrated Memory Partitioning and Scheduling 32

Observations n n n Applications with very low memory-intensity rarely access memory Dedicating channels

Observations n n n Applications with very low memory-intensity rarely access memory Dedicating channels to them results in precious memory bandwidth waste They have the most potential to keep their cores busy We would really like to prioritize them They interfere minimally with other applications Prioritizing them does not hurt others 33

Integrated Memory Partitioning and Scheduling (IMPS) n n Always prioritize very low memory-intensity applications

Integrated Memory Partitioning and Scheduling (IMPS) n n Always prioritize very low memory-intensity applications in the memory scheduler Use memory channel partitioning to mitigate interference between other applications 34

Key Results: Methodology and Evaluation 35

Key Results: Methodology and Evaluation 35

Hardware Cost n n Memory Channel Partitioning (MCP) q Only profiling counters in hardware

Hardware Cost n n Memory Channel Partitioning (MCP) q Only profiling counters in hardware q No modifications to memory scheduling logic q 1. 5 KB storage cost for a 24 -core, 4 -channel system Integrated Memory Partitioning and Scheduling (IMPS) q A single bit per request q Scheduler prioritizes based on this single bit 36

Methodology n Simulation Model q q 24 cores, 4 channels, 4 banks/channel Core Model

Methodology n Simulation Model q q 24 cores, 4 channels, 4 banks/channel Core Model n n q n Memory Model – DDR 2 Workloads q n Out-of-order, 128 -entry instruction window 512 KB L 2 cache/core 240 SPEC CPU 2006 multiprogrammed workloads (categorized based on memory intensity) Metrics q System Performance 37

Previous Work on Memory Scheduling n FR-FCFS [Zuravleff et al. , US Patent 1997,

Previous Work on Memory Scheduling n FR-FCFS [Zuravleff et al. , US Patent 1997, Rixner et al. , ISCA 2000] q q n ATLAS [Kim et al. , HPCA 2010] q n Prioritizes row-buffer hits and older requests Application-unaware Prioritizes applications with low memory-intensity TCM [Kim et al. , MICRO 2010] q q Always prioritizes low memory-intensity applications Shuffles request priorities of high memory-intensity applications 38

Comparison to Previous Scheduling Policies Averaged over 240 workloads Normalized System Performance 1, 12

Comparison to Previous Scheduling Policies Averaged over 240 workloads Normalized System Performance 1, 12 1, 1 11% 5% FRFCFS 1, 08 7% 1% 1, 06 ATLAS 1, 04 TCM 1, 02 1 MCP 0, 98 IMPS 0, 96 0, 94 Better system performance than the best previous scheduler Significant performance improvement over baseline FRFCFS at lower hardware cost 39

Interaction with Memory Scheduling Averaged over 240 workloads Normalized System Performance 1, 12 1,

Interaction with Memory Scheduling Averaged over 240 workloads Normalized System Performance 1, 12 1, 1 1, 08 1, 06 1, 04 1, 02 1 0, 98 0, 96 0, 94 No No IMPS FRFCFS ATLAS TCM IMPS improves performance regardless of scheduling policy Highest improvement over FRFCFS as IMPS designed for FRFCFS 40

Summary 41

Summary 41

Summary n n Uncontrolled inter-application interference in main memory degrades system performance Application-aware memory

Summary n n Uncontrolled inter-application interference in main memory degrades system performance Application-aware memory channel partitioning (MCP) q n to Integrated memory partitioning and scheduling (IMPS) q q n Separates the data of badly-interfering applications different channels, eliminating interference Prioritizes very low memory-intensity applications in scheduler Handles other applications’ interference by partitioning MCP/IMPS provide better performance than applicationaware memory request scheduling at lower hardware cost 42

Strengths 43

Strengths 43

Strengths of the Paper n n n n Novel solution to a key problem

Strengths of the Paper n n n n Novel solution to a key problem in multi-core systems, memory interference; the importance of problem will increase over time Keeps the memory scheduling hardware simple Combines multiple interference reduction techniques Can provide performance isolation across applications mapped to different channels General idea of partitioning can be extended to smaller granularities in the memory hierarchy: banks, subarrays, etc. Well-written paper Thorough simulation-based evaluation 44

Weaknesses 45

Weaknesses 45

Weaknesses/Limitations of the Paper n n Mechanism may not work effectively if workload changes

Weaknesses/Limitations of the Paper n n Mechanism may not work effectively if workload changes behavior after profiling Overhead of moving pages between channels restricts mechanism’s benefits Small number of memory channels reduces the scope of partitioning Load imbalance across channels can reduce performance q n n The paper addresses this and compares to another mechanism Software-hardware cooperative solution might not always be easy to adopt Evaluation is done solely in simulation Evaluation does not consider multi-chip systems Are these the best workloads to evaluate? 46

Recall: Try to Avoid Rat Holes Source: P. Jarupunphol, “Using Buddhist Insights to Analyse

Recall: Try to Avoid Rat Holes Source: P. Jarupunphol, “Using Buddhist Insights to Analyse the Cause of System Project Failures, ” Ph. D. Thesis, 2013 47

Thoughts and Ideas 48

Thoughts and Ideas 48

Extensions (I) n Can this idea be extended to different granularities in memory? q

Extensions (I) n Can this idea be extended to different granularities in memory? q n n Can this idea be extended to provide performance predictability and performance isolation? How can MCP be combined effectively with other interference reduction techniques? q q n Partition banks, subarrays, mats across workloads E. g. , source throttling methods [Ebrahimi+, ASPLOS 2010] E. g. , thread scheduling methods Can this idea be evaluated on a real system? How? 49

Takeaways 50

Takeaways 50

Key Takeaways n A novel method to reduce memory interference n Simple and effective

Key Takeaways n A novel method to reduce memory interference n Simple and effective n Hardware/software cooperative n Good potential for work building on it to extend it q q q n To different structures To different metrics Multiple works have already built on the paper (see bank partitioning works in PACT 2012, HPCA 2012) Easy to read and understand paper 51

Open Discussion 52

Open Discussion 52

Discussion Starters n Thoughts on the previous ideas? n How practical is this? n

Discussion Starters n Thoughts on the previous ideas? n How practical is this? n n Will the problem become bigger and more important over time? Will the solution become more important over time? Are other solutions better? Is this solution clearly advantageous in some cases? 53

Seminar in Computer Architecture Meeting 3 a: Memory Channel Partitioning Prof. Onur Mutlu ETH

Seminar in Computer Architecture Meeting 3 a: Memory Channel Partitioning Prof. Onur Mutlu ETH Zürich Fall 2020 1 October 2020