Energyaware scheduling for asymmetric multicores Daniel Mosse Computer

  • Slides: 30
Download presentation
Energy-aware scheduling for asymmetric multi-cores Daniel Mosse Computer Science Department University of Pittsburgh Collaborators:

Energy-aware scheduling for asymmetric multi-cores Daniel Mosse Computer Science Department University of Pittsburgh Collaborators: Vinicius Petrucci, Orlando Loques (Brazil) Rami Melhem, Luca Lugini (Pitt) Adriano Maron (ANSYS) Neven Abou Gazala, Sameh Gobriel (Intel Labs)

2 Cores are the New Gates (Shekhar Borkar, Intel) 512 Unicore 256 128 #

2 Cores are the New Gates (Shekhar Borkar, Intel) 512 Unicore 256 128 # cores/chip 64 Pico. Chip AMBRIC Homogeneous Multicore CISCO CSR 1 Heterogeneous Multicore Courtesy: Gordon’ 06 NVIDIA G 80 Larrabee 32 RAZA XLR 16 RAW Cell 8 Niagara 4 BCM 1480 2 Cavium 4004 8008 8080 Xbox 360 Power 4 PA 8800 8086 286 386 486 Pentium P 2 1 P 3 P 4 Athlon 1975 1980 1985 1990 1995 2000 Opteron 4 P AMD Fusion Core 2 Quad Xeon Power 6 Opteron Core 2 Duo Core Itanium 2 2005 2010

Non-homogeneous multi-core systems • • • Emerging and attractive alternative to homogeneous systems o

Non-homogeneous multi-core systems • • • Emerging and attractive alternative to homogeneous systems o improved performance and energy efficiency benefits Different cores types (large/small) are used to o o run each thread on a core type that is best suited for it satisfy time-varying demands (e. g. , compute-intensive or memory-intensive) of a range of threads Different hardware capabilities o o Cache size Frequency Architecture …. Mosse: Het. CMP+energy

Introduction Some background Chip Multicore Processors (CMPs) have been commercially available for 10+ years;

Introduction Some background Chip Multicore Processors (CMPs) have been commercially available for 10+ years; • High performance CMPs: Suitable for CPU-intensive computations • Low power CMPs: Simplified architecture, initially designed for entrylevel laptops and mobile devices; Asymmetric Chip Multicore Processors (ACMPs) [Kumar et al 2003] "Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction, " in MICRO-36 2003” • Offers a new opportunity to reduce power consumption by running non-CPU intensive applications in the low power cores. 10/29/2020 CS 3530 - Advanced Topics in Distributed and Real-time

Heterogeneous multi-core systems Mosse: Het. CMP+energy

Heterogeneous multi-core systems Mosse: Het. CMP+energy

Introduction Some background 10/29/2020 CS 3530 - Advanced Topics in Distributed and Real-time

Introduction Some background 10/29/2020 CS 3530 - Advanced Topics in Distributed and Real-time

Introduction Some background The major players… Samsung Exynos 5 and 7 Octacore Qualcomm Snapdragon

Introduction Some background The major players… Samsung Exynos 5 and 7 Octacore Qualcomm Snapdragon 808 and 810 - big. LITTLE - big: Cortex-A 15 - LITTLE: Cortex-A 7 - big. LITTLE - big: Cortex-A 57 - LITTLE: Cortex-A 53 Exynos 5 5410 - Only 4 CPU-cores active at a time; - 3 GPU-cores at 533 MHz Snapdragon 808 - 2 big + 4 LITTLE - Adreno 418 Exynos 5 5422 - All 8 CPU-cores simultaneously active; - 6 GPU-cores at 533 MHz Snapdragon 810 - 4 big + 4 LITTLE - Adreno 430 10/29/2020 CS 3530 - Advanced Topics in Distributed and Real-time

Challenges • • • Assignment: match threads and core/memory Dynamic vs static scheduling Real-time

Challenges • • • Assignment: match threads and core/memory Dynamic vs static scheduling Real-time vs general purpose Global vs partitioned scheduling Cache partition vs cache sharing Inclusive vs exclusive cache Bus bandwidth partitioning vs sharing Memory allocation Memory bank distribution • … Mosse: Het. CMP+energy

Typical datacenter workload Load fluctuation and power consumption of Web-search running on Google servers

Typical datacenter workload Load fluctuation and power consumption of Web-search running on Google servers * (QPS = Queries Per Second) * Meisner et al. Power management of online data-intensive services. ISCA 2011 Energy consumption is not proportional to the amount of computation! Mosse: Het. CMP+energy

Typical server workload: Twitter Source: ASPLOS 14, Delimitrou Mosse: Het. CMP+energy

Typical server workload: Twitter Source: ASPLOS 14, Delimitrou Mosse: Het. CMP+energy

Introduction The opportunity Deadlines are pessimistic and based on worst-case execution time. X 264

Introduction The opportunity Deadlines are pessimistic and based on worst-case execution time. X 264 Video Encoding on 4 big cores Deadline Phase 1 Opportunity to save energy!!! Phase 3 Frames over time big 10/29/2020 Phase 2 LITTLE CS 3530 - Advanced Topics in Distributed and Real-time big

Opportunity: ACMP • Power efficiency improvement • Real system evaluation on Intel Quick. IA

Opportunity: ACMP • Power efficiency improvement • Real system evaluation on Intel Quick. IA (Atom + Xeon) Small cores can be 7 -13 x more power-efficient than big cores Mosse: Het. CMP+energy

Performance: latency tail latency: meet Qo. S of 90% of requests… Web-search running on

Performance: latency tail latency: meet Qo. S of 90% of requests… Web-search running on Intel Quick. IA Big brawny cores achieve lower latency at all load levels But small wimpy cores still meet the Qo. S at low load using much less power! Mosse: Het. CMP+energy

Scheduling Het. CMP Insight: Exploit load fluctuation to improve energy efficiency and meet Qo.

Scheduling Het. CMP Insight: Exploit load fluctuation to improve energy efficiency and meet Qo. S • Low load: Wimpy cores to reduce power with satisfactory Qo. S Mosse: Het. CMP+energy

Scheduling Het. CMP • High load: Brawny cores to guarantee Qo. S Mosse: Het.

Scheduling Het. CMP • High load: Brawny cores to guarantee Qo. S Mosse: Het. CMP+energy

Introduction The opportunity Deadlines are pessimistic and based on worst-case execution time. X 264

Introduction The opportunity Deadlines are pessimistic and based on worst-case execution time. X 264 Video Encoding on 4 big cores Deadline Phase 1 Opportunity to save energy!!! Phase 3 Frames over time big 10/29/2020 Phase 2 LITTLE CS 3530 - Advanced Topics in Distributed and Real-time big

Challenges • Tension between responsiveness and stability o Responsiveness § short task migration interval

Challenges • Tension between responsiveness and stability o Responsiveness § short task migration interval quickly reacts, capturing timevarying workload fluctuations o Stability § Avoid over-reaction to load fluctuations; it can cause oscillatory behavior § Consider system settling time (observe the effects of task migrations) Mosse: Het. CMP+energy

Responsiveness and stability Fast reaction! Slow reaction… Qo. S violations! Over-reaction!!! Qo. S violations!

Responsiveness and stability Fast reaction! Slow reaction… Qo. S violations! Over-reaction!!! Qo. S violations!

Octopus-Man: Solution • Octopus-Man monitor o Application-level latency monitoring • Octopus-Man Mapper • Mosse:

Octopus-Man: Solution • Octopus-Man monitor o Application-level latency monitoring • Octopus-Man Mapper • Mosse: Het. CMP+energy Task-to-core management for Qo. S guarantee and energy efficiency

Octopus-Man Mapper: Designs 1) PID control system opros: well-known control methodology ocons: parameter tuning

Octopus-Man Mapper: Designs 1) PID control system opros: well-known control methodology ocons: parameter tuning via extensive offline app profiling 2) Deadzone-based control system opros: simple online scheme based on Qo. S thresholds ocons: sensitive to threshold parameter selection • Can either effectively provide high Qo. S while maximizing energy efficiency? • Responsiveness and Stability Mosse: Het. CMP+energy

Design 1: PID control system GOAL: To keep the controlled system running as close

Design 1: PID control system GOAL: To keep the controlled system running as close as possible to its specified Qo. S target r(t )- g in t pu m co = Qo. S target (e. g. , 90%-tile latency) y( t) s es c r u o monitored Qo. S Mosse: Het. CMP+energy re

Qo. S Metric / Control Variable x → p-quantile LUCIANO BERTINI – Fe. BID

Qo. S Metric / Control Variable x → p-quantile LUCIANO BERTINI – Fe. BID 2007 – Munich, Germany, May 25 th, 2007

Qo. S Metric / Control Variable x → p-quantile LUCIANO BERTINI – Fe. BID

Qo. S Metric / Control Variable x → p-quantile LUCIANO BERTINI – Fe. BID 2007 – Munich, Germany, May 25 th, 2007

Design 2: Deadzone State Machine Qo. S alert: Qo. S variable > Qo. S

Design 2: Deadzone State Machine Qo. S alert: Qo. S variable > Qo. S target * UP_THR Qo. S safe: Qo. S variable < Qo. S target * DOWN_THR The deadzone thresholds impact the stability of the mapping algorithm!

Experimental Platform: Intel Quick. IA Mosse: Het. CMP+energy

Experimental Platform: Intel Quick. IA Mosse: Het. CMP+energy

All-brawny (Static) baseline: Websearch Qo. S Latency slack! Core Mapping Throughput Mosse: Het. CMP+energy

All-brawny (Static) baseline: Websearch Qo. S Latency slack! Core Mapping Throughput Mosse: Het. CMP+energy

PID vs Deadzone: websearch PID control Deadzone control Mosse: Het. CMP+energy

PID vs Deadzone: websearch PID control Deadzone control Mosse: Het. CMP+energy

Qo. S results Mosse: Het. CMP+energy

Qo. S results Mosse: Het. CMP+energy

Energy efficiency gains Memcached Web-search Mosse: Het. CMP+energy

Energy efficiency gains Memcached Web-search Mosse: Het. CMP+energy

Improving throughput with colocation Web-search co-located with SPEC programs Improvement of 34% (mean) and

Improving throughput with colocation Web-search co-located with SPEC programs Improvement of 34% (mean) and 50% (max) in batch throughput Mosse: Het. CMP+energy