Optimal Power Allocation for Multiprogrammed Workloads on Singlechip

Single-chip heterogeneous processors • Compared to systems based on discrete components - Lower communication

Challenges • SCHP’s performance: limited by power budget - Total chip power budget CPU/GPU

Outline • Motivation • Target platform: SCHP + MW • Workload-aware power allocation -

Target platform: SCHP + MW • • 4 -core CPU + 16 -SM GPU

Workload-aware power allocation • Characteristics of programs - Non-uniform performance sensitivities - Throughput vs.

Methodology: shared power budget Output Power Configuration 34. 2 22. 4 34. 2 24.

Methodology: benchmark programs • Used 6 benchmark programs. • Divided into 3 groups depending

Evaluation: case study 1 (compute- vs. memory-bound) 19% throughput improvement 32% energy efficiency improvement

Evaluation: case study 2 (memory- vs. memory-bound) 10% throughput improvement 32% energy efficiency improvement

Evaluation: variation of optimal configuration • Depending on programs’ characteristics and evaluation metrics P

Evaluation: performance improvement from optimal power allocation • Achieved significant improvement - 12% for

Algorithm for throughput maximization calculate (slope) wait(regular_time) compute-bound (mri-q) Normalized throughput abs(sp 1 -sp

Algorithm for energy efficiency maximization final = min_power • Gradient search from the minimum

Conclusion • We propose a solution for optimal power allocation - Workload-aware power allocation

Simulator • Integrated CPU + GPU simulator • H. Wang, V. Sathish, R. Singh,

Slides: 19

Download presentation

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1, 2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim 2, 3 1 2 3

Single-chip heterogeneous processors • Compared to systems based on discrete components - Lower communication overhead Lower power consumption Lower cost (less silicon) Emerging application friendly (sequential + parallel processing) AMD’s Llano Intel’s Sandy Bridge Sources: AMD, Intel, and Samsung’s Exynos 2

Challenges • SCHP’s performance: limited by power budget - Total chip power budget CPU/GPU power budget • Multiprogrammed workload - Workload-aware power allocation Considering characteristics and metrics How can optimize overall performance within limited power budget? 3

Outline • Motivation • Target platform: SCHP + MW • Workload-aware power allocation - Characteristics of programs Evaluation Metrics - Power configuration Benchmark programs • Methodology • Evaluation • Algorithm • Conclusion 4

Target platform: SCHP + MW • • 4 -core CPU + 16 -SM GPU Multiple V/F domains DVFS 2 programs running Hardware resources evenly divided CPU V/F domain (per-core) Program 1 Program 2 CPU Core 0 CPU Core 2 GPU 0 V/F domain GPU 0 CPU Core 1 CPU Core 3 GPU 1 V/F domain GPU 1 Multiprogrammed Workload Memory Controllers MCs V/F domain 5

Workload-aware power allocation • Characteristics of programs - Non-uniform performance sensitivities - Throughput vs. Energy efficiency • Evaluation metrics compute-bound (mri-q) Normalized throughput memory-bound (stream-copy) 2. 0 1. 8 1. 6 1. 4 1. 2 1. 0 0. 8 Allocating more power to mri-q 28. 6 34. 2 39. 8 48. 6 59. 0 Power allocation (using the same HW) 6

Methodology: shared power budget Output Power Configuration 34. 2 22. 4 34. 2 24. 8 46. 4 17. 4 62. 8 16. 8 11. 2 • CPU Can 1 change the budget CPUpower 2 GPUfor 1 • • Energy Efficiency Throughput 22. 4 31. 2 16. 8 31. 2 41. 6 11. 2 41. 6 GPU 2 Total chip power budget = 100 W CPU power budget = 80 W GPU power budget = 64 W Baseline configuration - Evenly divided (25 W for each CPU/GPU group) 8

Methodology: benchmark programs • Used 6 benchmark programs. • Divided into 3 groups depending on characteristics Benchmark Acronym Source Characteristics Magnetic Resonance Imaging Q MRQ Parboil Compute-bound Stream Cluster SCL Rodinia Compute-bound Hotspot HOT Rodinia Neutral Sum of Absolute Difference SAD Parboil Neutral Stencil STN Parboil Memory-bound Stream Copy SCP CS Virginia Memory-bound 9

Evaluation: case study 1 (compute- vs. memory-bound) 19% throughput improvement 32% energy efficiency improvement • Allocating more power to compute-bound • Optimal points vary depending on metrics. 11

Evaluation: case study 2 (memory- vs. memory-bound) 10% throughput improvement 32% energy efficiency improvement • Equally allocated power • Again, optimal point depends on - Evaluation metric Workload characteristics (compute- or memory-bound) 12

Evaluation: variation of optimal configuration • Depending on programs’ characteristics and evaluation metrics P 1 P 2 MRQ (C) SCP (M) SAD (N) MRQ (C) SCL (C) HOT (N) SAD (N) SCL(C) STN (M) HOT (N) SCP (M) MRQ(N) SAD (N) STN (M) SCP (M) Metric 1: throughput P 1 (Watt) P 2 (Watt) CPU GPU 17. 4 31. 2 17. 4 41. 6 17. 4 22. 4 Metric 2: energy efficiency P 1 (Watt) P 2 (Watt) CPU GPU 17. 4 16. 8 17. 4 11. 2 17. 4 16. 8 17. 4 22. 4 17. 4 16. 8 17. 4 11. 2 17. 4 22. 4 13

Evaluation: performance improvement from optimal power allocation • Achieved significant improvement - 12% for throughput 18% for energy efficiency Normalized IPS/W GEOMEAN SAD vs. SCP (NM) HOT vs. STN (NM) SCL vs. SAD (CN) MRQ vs. SAD (CN) HOT vs. MRQ (NC) SCL vs. SCP (CM) MRQ vs. SCP (CM) SAD vs. HOT (NN) SCP vs. STN (MM) MRQ vs. SCL (CC) 1. 4 1. 3 1. 2 1. 1 1. 0 0. 9 14

Algorithm for throughput maximization calculate (slope) wait(regular_time) compute-bound (mri-q) Normalized throughput abs(sp 1 -sp 2) 2. 0 < threshold NO sp 1 > sp 2 NO 1. 8 1. 6 1. 4 1. 2 1. 0 0. 8 alloc(p 2_more) YES memory-bound (stream-copy) YES 28. 6 alloc(equally) alloc(p 1_more) 34. 2 39. 8 48. 6 Power allocation 59. 0 15

Algorithm for energy efficiency maximization final = min_power • Gradient search from the minimum power allocation MAX = max( EE(final), EE(final, p 1++), EE(final, p 2++) ) EE(final) == MAX exit EE(final, p 1++) > EE(final, p 2++) final = (final, p 1++) 16

Conclusion • We propose a solution for optimal power allocation - Workload-aware power allocation By using program characteristics and evaluation metrics • Significant performance improvement achieved - 12% for throughput 18% for energy efficiency • Run-time algorithms effectively find (near-)optimal power allocation 17

Backup slides 18

Simulator • Integrated CPU + GPU simulator • H. Wang, V. Sathish, R. Singh, M. Schulte and N. Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors, " in PACT, 2012. http: //cpu-gpu-sim. ece. wisc. edu/ gem 5 + GPGPU-Sim Adaptive power allocation for multiprogrammed workload - Per-core V/F domains for CPU - 2 V/F domains for GPU 19