Thread Cluster Memory Scheduling Exploiting Differences in Memory

  • Slides: 43
Download presentation
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Motivation • Memory is a shared resource Core Memory • Threads’ requests contend for

Motivation • Memory is a shared resource Core Memory • Threads’ requests contend for memory – Degradation in single thread performance – Can even lead to starvation • How to schedule memory requests to increase both system throughput and fairness? 2

Previous Scheduling Algorithms are Biased Maximum Slowdown System throughput bias 15 13 11 9

Previous Scheduling Algorithms are Biased Maximum Slowdown System throughput bias 15 13 11 9 7 Fairness bias l 3 FRFCFS STFM PAR-BS ATLAS ea 5 Id Better fairness 17 1 8 8. 2 8. 4 8. 6 8. 8 Weighted Speedup 9 Better system throughput No previous memory scheduling algorithm provides both the best fairness and system throughput 3

Why do Previous Algorithms Fail? Throughput biased approach Prioritize less memory-intensive threads Fairness biased

Why do Previous Algorithms Fail? Throughput biased approach Prioritize less memory-intensive threads Fairness biased approach Take turns accessing memory Good for throughput Does not starve thread A less memory intensive thread B thread C higher priority starvation unfairness thread C thread A thread B not prioritized reduced throughput Single policy for all threads is insufficient 4

Insight: Achieving Best of Both Worlds higher priority thread For Throughput Prioritize memory-non-intensive threads

Insight: Achieving Best of Both Worlds higher priority thread For Throughput Prioritize memory-non-intensive threads thread thread For Fairness Unfairness caused by memory-intensive being prioritized over each other • Shuffle threads Memory-intensive threads have different vulnerability to interference • Shuffle asymmetrically 5

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together q Evaluation q Conclusion 6

Overview: Thread Cluster Memory Scheduling 1. Group threads into two clusters 2. Prioritize non-intensive

Overview: Thread Cluster Memory Scheduling 1. Group threads into two clusters 2. Prioritize non-intensive cluster 3. Different policies for each cluster Memory-non-intensive thread Non-intensive cluster Throughput thread higher priority Prioritized thread higher priority Threads in the system Memory-intensive Intensive cluster Fairness 7

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together q Evaluation q Conclusion 8

TCM Outline 1. Clustering 9

TCM Outline 1. Clustering 9

Clustering Threads αT T T = Total memory bandwidth usage thread Non-intensive cluster thread

Clustering Threads αT T T = Total memory bandwidth usage thread Non-intensive cluster thread Step 1 Sort threads by MPKI (misses per kiloinstruction) higher MPKI Intensive cluster α < 10% Cluster. Threshold Step 2 Memory bandwidth usage αT divides clusters 10

TCM Outline 1. Clustering 2. Between Clusters 11

TCM Outline 1. Clustering 2. Between Clusters 11

Prioritization Between Clusters Prioritize non-intensive cluster > priority • Increases system throughput – Non-intensive

Prioritization Between Clusters Prioritize non-intensive cluster > priority • Increases system throughput – Non-intensive threads have greater potential for making progress • Does not degrade fairness – Non-intensive threads are “light” – Rarely interfere with intensive threads 12

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 13

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 13

Non-Intensive Cluster Prioritize threads according to MPKI higher priority thread lowest MPKI thread highest

Non-Intensive Cluster Prioritize threads according to MPKI higher priority thread lowest MPKI thread highest MPKI • Increases system throughput – Least intensive thread has the greatest potential for making progress in the processor 14

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 15

Intensive Cluster Periodically shuffle the priority of threads higher priority Most prioritized thread Increases

Intensive Cluster Periodically shuffle the priority of threads higher priority Most prioritized thread Increases fairness thread • Is treating all threads equally good enough? • BUT: Equal turns ≠ Same slowdown 16

Case Study: A Tale of Two Threads Case Study: Two intensive threads contending 1.

Case Study: A Tale of Two Threads Case Study: Two intensive threads contending 1. random-access 2. streaming Which is slowed down more easily? 14 12 10 8 6 4 2 0 7 x prioritized 1 x random-access streaming Prioritize streaming Slowdown Prioritize random-access 14 12 10 8 6 4 2 0 11 x prioritized 1 x random-access streaming random-access thread is more easily slowed down 17

Why are Threads Different? random-access streaming req stuck activated rows Bank 1 Bank 2

Why are Threads Different? random-access streaming req stuck activated rows Bank 1 Bank 2 Bank 3 • All requests parallel • High bank-level parallelism Bank 4 Memory • All requests Same row • High row-buffer locality Vulnerable to interference 18

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 19

Niceness How to quantify difference between threads? Niceness High Low Bank-level parallelism Row-buffer locality

Niceness How to quantify difference between threads? Niceness High Low Bank-level parallelism Row-buffer locality Vulnerability to interference Causes interference + Niceness 20

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling What can go wrong? 2. Niceness-Aware shuffling

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling What can go wrong? 2. Niceness-Aware shuffling GOOD: Each thread prioritized once Most prioritized D Priority D C B A A B C D Nice thread Least nice thread Time Shuffle. Interval 21

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling What can go wrong? 2. Niceness-Aware shuffling

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling What can go wrong? 2. Niceness-Aware shuffling GOOD: Each thread prioritized once Most prioritized Priority D A B C D D C B A B C D D A B C D A C B A B C D A Shuffle. Interval Nice thread Least nice thread Time BAD: Nice threads receive lots of interference 22

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D Priority

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D Priority D C B A GOOD: Each thread prioritized once D Nice thread Least nice thread Time Shuffle. Interval 23

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized Priority GOOD:

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized Priority GOOD: Each thread prioritized once D C B A D D C B D B A D C C B B D C C B A A A D A Shuffle. Interval Nice thread Least nice thread Time GOOD: Least nice thread stays mostly deprioritized 24

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 25

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together q Evaluation q Conclusion 26

Quantum-Based Operation Previous quantum Current quantum (~1 M cycles) Time During quantum: • Monitor

Quantum-Based Operation Previous quantum Current quantum (~1 M cycles) Time During quantum: • Monitor thread behavior 1. Memory intensity 2. Bank-level parallelism 3. Row-buffer locality Shuffle interval (~1 K cycles) Beginning of quantum: • Perform clustering • Compute niceness of intensive threads 27

TCM Scheduling Algorithm 1. Highest-rank: Requests from higher ranked threads prioritized • Non-Intensive cluster

TCM Scheduling Algorithm 1. Highest-rank: Requests from higher ranked threads prioritized • Non-Intensive cluster > Intensive cluster • Non-Intensive cluster: lower intensity higher rank • Intensive cluster: rank shuffling 2. Row-hit: Row-buffer hit requests are prioritized 3. Oldest: Older requests are prioritized 28

Implementation Costs Required storage at memory controller (24 cores) Thread memory behavior Storage MPKI

Implementation Costs Required storage at memory controller (24 cores) Thread memory behavior Storage MPKI ~0. 2 kb Bank-level parallelism ~0. 6 kb Row-buffer locality ~2. 9 kb Total < 4 kbits • No computation is on the critical path 29

Outline q Motivation & Insights q Overview q Algorithm Throughput q Bringing it All

Outline q Motivation & Insights q Overview q Algorithm Throughput q Bringing it All Together q Evaluation q Conclusion Fairness 30

Metrics & Methodology • Metrics System throughput Unfairness • Methodology – Core model •

Metrics & Methodology • Metrics System throughput Unfairness • Methodology – Core model • 4 GHz processor, 128 -entry instruction window • 512 KB/core L 2 cache – Memory model: DDR 2 – 96 multiprogrammed SPEC CPU 2006 workloads 31

Previous Work FRFCFS [Rixner et al. , ISCA 00]: Prioritizes row-buffer hits – Thread-oblivious

Previous Work FRFCFS [Rixner et al. , ISCA 00]: Prioritizes row-buffer hits – Thread-oblivious Low throughput & Low fairness STFM [Mutlu et al. , MICRO 07]: Equalizes thread slowdowns – Non-intensive threads not prioritized Low throughput PAR-BS [Mutlu et al. , ISCA 08]: Prioritizes oldest batch of requests while preserving bank-level parallelism – Non-intensive threads not always prioritized Low throughput ATLAS [Kim et al. , HPCA 10]: Prioritizes threads with less memory service – Most intensive thread starves Low fairness 32

Results: Fairness vs. Throughput Averaged over 96 workloads Maximum Slowdown Better fairness 16 FRFCFS

Results: Fairness vs. Throughput Averaged over 96 workloads Maximum Slowdown Better fairness 16 FRFCFS 14 ATLAS 5% 12 STFM 10 39% PAR-BS 8 5% 6 TCM 8% 4 7. 5 8 8. 5 9 Weighted Speedup 9. 5 10 Better system throughput TCM provides best fairness and system throughput 33

Results: Fairness-Throughput Tradeoff When configuration parameter is varied… Maximum Slowdown Better fairness 12 FRFCFS

Results: Fairness-Throughput Tradeoff When configuration parameter is varied… Maximum Slowdown Better fairness 12 FRFCFS 10 ATLAS STFM 8 PAR-BS TCM 6 4 2 12 12. 5 13 13. 5 14 Adjusting Cluster. Threshold 14. 5 15 15. 5 16 Weighted Speedup Better system throughput TCM allows robust fairness-throughput tradeoff 34

Operating System Support • Cluster. Threshold is a tunable knob – OS can trade

Operating System Support • Cluster. Threshold is a tunable knob – OS can trade off between fairness and throughput • Enforcing thread weights – OS assigns weights to threads – TCM enforces thread weights within each cluster 35

Outline q Motivation & Insights q Overview q Algorithm Throughput q Bringing it All

Outline q Motivation & Insights q Overview q Algorithm Throughput q Bringing it All Together q Evaluation q Conclusion Fairness 36

Conclusion • No previous memory scheduling algorithm provides both high system throughput and fairness

Conclusion • No previous memory scheduling algorithm provides both high system throughput and fairness – Problem: They use a single policy for all threads • TCM groups threads into two clusters 1. Prioritize non-intensive cluster throughput 2. Shuffle priorities in intensive cluster fairness 3. Shuffling should favor nice threads fairness • TCM provides the best system throughput and fairness 37

THANK YOU 38

THANK YOU 38

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Weight Support • Even if heaviest weighted thread happens to be the most

Thread Weight Support • Even if heaviest weighted thread happens to be the most intensive thread… – Not prioritized over the least intensive thread 40

Better fairness Harmonic Speedup Better system throughput 41

Better fairness Harmonic Speedup Better system throughput 41

Shuffling Algorithm Comparison • Niceness-Aware shuffling – Average of maximum slowdown is lower –

Shuffling Algorithm Comparison • Niceness-Aware shuffling – Average of maximum slowdown is lower – Variance of maximum slowdown is lower Shuffling Algorithm Round-Robin Niceness-Aware E(Maximum Slowdown) 5. 58 4. 84 VAR(Maximum Slowdown) 1. 61 0. 85 42

Sensitivity Results Shuffle. Interval (cycles) 500 600 700 800 14. 2 14. 3 14.

Sensitivity Results Shuffle. Interval (cycles) 500 600 700 800 14. 2 14. 3 14. 2 14. 7 6. 0 5. 4 5. 9 5. 5 System Throughput Maximum Slowdown 4 Number of Cores 8 16 24 32 System Throughput (compared to ATLAS) 0% 3% 2% 1% 1% Maximum Slowdown (compared to ATLAS) -4% -30% -29% -30% -41% 43