Thread Cluster Memory Scheduling Exploiting Differences in Memory

Motivation • Memory is a shared resource Core Memory • Threads’ requests contend for

Previous Scheduling Algorithms are Biased Maximum Slowdown System throughput bias 15 13 11 9

Why do Previous Algorithms Fail? Throughput biased approach Prioritize less memory-intensive threads Fairness biased

Insight: Achieving Best of Both Worlds higher priority thread For Throughput Prioritize memory-non-intensive threads

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together

Overview: Thread Cluster Memory Scheduling 1. Group threads into two clusters 2. Prioritize non-intensive

Clustering Threads αT T T = Total memory bandwidth usage thread Non-intensive cluster thread

TCM Outline 1. Clustering 2. Between Clusters 11

Prioritization Between Clusters Prioritize non-intensive cluster > priority • Increases system throughput – Non-intensive

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 13

Non-Intensive Cluster Prioritize threads according to MPKI higher priority thread lowest MPKI thread highest

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster

Intensive Cluster Periodically shuffle the priority of threads higher priority Most prioritized thread Increases

Case Study: A Tale of Two Threads Case Study: Two intensive threads contending 1.

Why are Threads Different? random-access streaming req stuck activated rows Bank 1 Bank 2

Niceness How to quantify difference between threads? Niceness High Low Bank-level parallelism Row-buffer locality

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling What can go wrong? 2. Niceness-Aware shuffling

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D Priority

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized Priority GOOD:

Quantum-Based Operation Previous quantum Current quantum (~1 M cycles) Time During quantum: • Monitor

TCM Scheduling Algorithm 1. Highest-rank: Requests from higher ranked threads prioritized • Non-Intensive cluster

Implementation Costs Required storage at memory controller (24 cores) Thread memory behavior Storage MPKI

Outline q Motivation & Insights q Overview q Algorithm Throughput q Bringing it All

Metrics & Methodology • Metrics System throughput Unfairness • Methodology – Core model •

Previous Work FRFCFS [Rixner et al. , ISCA 00]: Prioritizes row-buffer hits – Thread-oblivious

Results: Fairness vs. Throughput Averaged over 96 workloads Maximum Slowdown Better fairness 16 FRFCFS

Results: Fairness-Throughput Tradeoff When configuration parameter is varied… Maximum Slowdown Better fairness 12 FRFCFS

Operating System Support • Cluster. Threshold is a tunable knob – OS can trade

Conclusion • No previous memory scheduling algorithm provides both high system throughput and fairness

Thread Weight Support • Even if heaviest weighted thread happens to be the most

Better fairness Harmonic Speedup Better system throughput 41

Shuffling Algorithm Comparison • Niceness-Aware shuffling – Average of maximum slowdown is lower –

Sensitivity Results Shuffle. Interval (cycles) 500 600 700 800 14. 2 14. 3 14.

Slides: 43

Download presentation

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Motivation • Memory is a shared resource Core Memory • Threads’ requests contend for memory – Degradation in single thread performance – Can even lead to starvation • How to schedule memory requests to increase both system throughput and fairness? 2

Previous Scheduling Algorithms are Biased Maximum Slowdown System throughput bias 15 13 11 9 7 Fairness bias l 3 FRFCFS STFM PAR-BS ATLAS ea 5 Id Better fairness 17 1 8 8. 2 8. 4 8. 6 8. 8 Weighted Speedup 9 Better system throughput No previous memory scheduling algorithm provides both the best fairness and system throughput 3

Why do Previous Algorithms Fail? Throughput biased approach Prioritize less memory-intensive threads Fairness biased approach Take turns accessing memory Good for throughput Does not starve thread A less memory intensive thread B thread C higher priority starvation unfairness thread C thread A thread B not prioritized reduced throughput Single policy for all threads is insufficient 4

Insight: Achieving Best of Both Worlds higher priority thread For Throughput Prioritize memory-non-intensive threads thread thread For Fairness Unfairness caused by memory-intensive being prioritized over each other • Shuffle threads Memory-intensive threads have different vulnerability to interference • Shuffle asymmetrically 5

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together q Evaluation q Conclusion 6

Overview: Thread Cluster Memory Scheduling 1. Group threads into two clusters 2. Prioritize non-intensive cluster 3. Different policies for each cluster Memory-non-intensive thread Non-intensive cluster Throughput thread higher priority Prioritized thread higher priority Threads in the system Memory-intensive Intensive cluster Fairness 7

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together q Evaluation q Conclusion 8

TCM Outline 1. Clustering 9

Clustering Threads αT T T = Total memory bandwidth usage thread Non-intensive cluster thread Step 1 Sort threads by MPKI (misses per kiloinstruction) higher MPKI Intensive cluster α < 10% Cluster. Threshold Step 2 Memory bandwidth usage αT divides clusters 10

TCM Outline 1. Clustering 2. Between Clusters 11

Prioritization Between Clusters Prioritize non-intensive cluster > priority • Increases system throughput – Non-intensive threads have greater potential for making progress • Does not degrade fairness – Non-intensive threads are “light” – Rarely interfere with intensive threads 12

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 13

Non-Intensive Cluster Prioritize threads according to MPKI higher priority thread lowest MPKI thread highest MPKI • Increases system throughput – Least intensive thread has the greatest potential for making progress in the processor 14

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 15

Intensive Cluster Periodically shuffle the priority of threads higher priority Most prioritized thread Increases fairness thread • Is treating all threads equally good enough? • BUT: Equal turns ≠ Same slowdown 16

Case Study: A Tale of Two Threads Case Study: Two intensive threads contending 1. random-access 2. streaming Which is slowed down more easily? 14 12 10 8 6 4 2 0 7 x prioritized 1 x random-access streaming Prioritize streaming Slowdown Prioritize random-access 14 12 10 8 6 4 2 0 11 x prioritized 1 x random-access streaming random-access thread is more easily slowed down 17

Why are Threads Different? random-access streaming req stuck activated rows Bank 1 Bank 2 Bank 3 • All requests parallel • High bank-level parallelism Bank 4 Memory • All requests Same row • High row-buffer locality Vulnerable to interference 18

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 19

Niceness How to quantify difference between threads? Niceness High Low Bank-level parallelism Row-buffer locality Vulnerability to interference Causes interference + Niceness 20

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling What can go wrong? 2. Niceness-Aware shuffling GOOD: Each thread prioritized once Most prioritized D Priority D C B A A B C D Nice thread Least nice thread Time Shuffle. Interval 21

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling What can go wrong? 2. Niceness-Aware shuffling GOOD: Each thread prioritized once Most prioritized Priority D A B C D D C B A B C D D A B C D A C B A B C D A Shuffle. Interval Nice thread Least nice thread Time BAD: Nice threads receive lots of interference 22

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D Priority D C B A GOOD: Each thread prioritized once D Nice thread Least nice thread Time Shuffle. Interval 23

Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized Priority GOOD: Each thread prioritized once D C B A D D C B D B A D C C B B D C C B A A A D A Shuffle. Interval Nice thread Least nice thread Time GOOD: Least nice thread stays mostly deprioritized 24

TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 25

Outline q Motivation & Insights q Overview q Algorithm q Bringing it All Together q Evaluation q Conclusion 26

Quantum-Based Operation Previous quantum Current quantum (~1 M cycles) Time During quantum: • Monitor thread behavior 1. Memory intensity 2. Bank-level parallelism 3. Row-buffer locality Shuffle interval (~1 K cycles) Beginning of quantum: • Perform clustering • Compute niceness of intensive threads 27

TCM Scheduling Algorithm 1. Highest-rank: Requests from higher ranked threads prioritized • Non-Intensive cluster > Intensive cluster • Non-Intensive cluster: lower intensity higher rank • Intensive cluster: rank shuffling 2. Row-hit: Row-buffer hit requests are prioritized 3. Oldest: Older requests are prioritized 28

Implementation Costs Required storage at memory controller (24 cores) Thread memory behavior Storage MPKI ~0. 2 kb Bank-level parallelism ~0. 6 kb Row-buffer locality ~2. 9 kb Total < 4 kbits • No computation is on the critical path 29

Outline q Motivation & Insights q Overview q Algorithm Throughput q Bringing it All Together q Evaluation q Conclusion Fairness 30

Metrics & Methodology • Metrics System throughput Unfairness • Methodology – Core model • 4 GHz processor, 128 -entry instruction window • 512 KB/core L 2 cache – Memory model: DDR 2 – 96 multiprogrammed SPEC CPU 2006 workloads 31

Previous Work FRFCFS [Rixner et al. , ISCA 00]: Prioritizes row-buffer hits – Thread-oblivious Low throughput & Low fairness STFM [Mutlu et al. , MICRO 07]: Equalizes thread slowdowns – Non-intensive threads not prioritized Low throughput PAR-BS [Mutlu et al. , ISCA 08]: Prioritizes oldest batch of requests while preserving bank-level parallelism – Non-intensive threads not always prioritized Low throughput ATLAS [Kim et al. , HPCA 10]: Prioritizes threads with less memory service – Most intensive thread starves Low fairness 32

Results: Fairness vs. Throughput Averaged over 96 workloads Maximum Slowdown Better fairness 16 FRFCFS 14 ATLAS 5% 12 STFM 10 39% PAR-BS 8 5% 6 TCM 8% 4 7. 5 8 8. 5 9 Weighted Speedup 9. 5 10 Better system throughput TCM provides best fairness and system throughput 33

Results: Fairness-Throughput Tradeoff When configuration parameter is varied… Maximum Slowdown Better fairness 12 FRFCFS 10 ATLAS STFM 8 PAR-BS TCM 6 4 2 12 12. 5 13 13. 5 14 Adjusting Cluster. Threshold 14. 5 15 15. 5 16 Weighted Speedup Better system throughput TCM allows robust fairness-throughput tradeoff 34

Operating System Support • Cluster. Threshold is a tunable knob – OS can trade off between fairness and throughput • Enforcing thread weights – OS assigns weights to threads – TCM enforces thread weights within each cluster 35

Outline q Motivation & Insights q Overview q Algorithm Throughput q Bringing it All Together q Evaluation q Conclusion Fairness 36

Conclusion • No previous memory scheduling algorithm provides both high system throughput and fairness – Problem: They use a single policy for all threads • TCM groups threads into two clusters 1. Prioritize non-intensive cluster throughput 2. Shuffle priorities in intensive cluster fairness 3. Shuffling should favor nice threads fairness • TCM provides the best system throughput and fairness 37

THANK YOU 38

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Weight Support • Even if heaviest weighted thread happens to be the most intensive thread… – Not prioritized over the least intensive thread 40

Better fairness Harmonic Speedup Better system throughput 41

Shuffling Algorithm Comparison • Niceness-Aware shuffling – Average of maximum slowdown is lower – Variance of maximum slowdown is lower Shuffling Algorithm Round-Robin Niceness-Aware E(Maximum Slowdown) 5. 58 4. 84 VAR(Maximum Slowdown) 1. 61 0. 85 42

Sensitivity Results Shuffle. Interval (cycles) 500 600 700 800 14. 2 14. 3 14. 2 14. 7 6. 0 5. 4 5. 9 5. 5 System Throughput Maximum Slowdown 4 Number of Cores 8 16 24 32 System Throughput (compared to ATLAS) 0% 3% 2% 1% 1% Maximum Slowdown (compared to ATLAS) -4% -30% -29% -30% -41% 43