Dynamic Workload Characterization for Power Efficient Scheduling on















- Slides: 15
Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman, 1 Vasileios Kontorinis, 1 Dean Tullsen, 1 Tajana Rosing, 2 Eric Saxe, 2 Jonathan Chew ISLPED 2010, Austin 1 UC San Diego 2 Oracle Corp. International Symposium on Low Power Electronics and Design
Introduction • Chip Multiprocessors/multicore architectures are pervasive in modern systems • Hierarchy of asymmetric resource sharing among cores: – Memory bandwidth – Last level cache – Pipeline • Threads scheduled across these cores share these resources: – Resource requirements – Relative placement Overall performance and power efficiency 2
Motivation • Modern OS capture the resource sharing asymmetries • However, the balancing based on thread count: – Resource usage? – Resource requirement? 3
Motivation gzip art art • The difference between best and worst schedule as high as 70% on a ‘balanced’ system! • The threads that share last level cache makes a big difference – High contention deteriorates performance and power efficiency 4
Motivation • Default scheduler exhibits high variance: – Due to ping pong between best and worst schedules – Frequent pre-emptions by high priority ‘transient threads’ • Not an OS specific problem: – Lack of information available to the OS scheduler 5
Contributions • Highlight the inability of modern OS to extract full power efficiency from the parallel architectures – Lack of resource utilization knowledge accessible to scheduler • Identify characteristics of threads that affect resource sharing efficiency and metrics to capture them • Uncover and provide solution to ‘transient threads’ – Short running kernel threads that impede stable scheduling • Extend the scheduler to incorporate this logic into the load balancing fabric • Implement a prototype “Workload Characteristics Aware (WCA) Scheduler” 6
Transient Threads • High priority short running threads – Run in order of us – Example: java, fsflush, nscd etc. • Have little impact on runtimes of long running threads • Mislead the OS load balancer Artificial Load Almost idle 7
Transient Threads • Identification: – Spend most of the time blocked vs running – Maintain ratio in the thread data structure – Flag as transient if ratio < 1% • Resolution: – Load balance only non transient threads Artificial Load Almost idle 8
Cache Sensitivity • Two requirements: – Identify cache sensitive threads – Reconstruct the load balancer to balance them • Identification metrics: – LLCRPI • Cache weight = 2 • Highest degree of sensitivity – IPC • Cache weight = 1 • Medium degree of sensitivity – Non sensitive • Cache weight = 0 • No sensitivity • Maintain cache weight of each thread dynamically
Cache Sensitivity • Enhance the load representation structure • Balance # of threads, CW and IPC 10
Methodology • Implemented the system on Open. Solaris – Transient thread characterization – Cache sensitivity characterization – Load balancing algorithm • Tested on an Intel Xeon E 5430 based machine • Workloads using 12 SPEC 2 K benchmarks – 3 thread combinations – Present the toughest case for the scheduler • Compare results against default scheduler: – Average weighted Perf/Watt: • Captures both system level power consumption and performance – # of thread migrations in the system and execution time stability 11
Overall Results • ~14% average improvement in Perf/Watt 12
Power Efficiency • Significant speedup at roughly the same power budget • Better opportunities for idle power savings 13
Stability Analysis • 91% reduction in migration rate of threads • Stable and predictable schedules and run-times • 89% reduction in execution time std deviation 14
Conclusions • Identify limitations of modern OS to extract full power efficiency from modern CMP architectures • Highlight characteristics of threads that affect cache sharing efficiency and metrics to capture them • Identify ‘transient threads’ as an impediment to stable scheduling • Extend the scheduler to incorporate cache and transient thread management into the load balancing fabric • Prototype scheduler implementation improves Perf/Watt by up to 30% 15