EECS 262 a Advanced Topics in Computer Systems

  • Slides: 46
Download presentation
EECS 262 a Advanced Topics in Computer Systems Lecture 12 Multiprocessor/Realtime Scheduling October 8

EECS 262 a Advanced Topics in Computer Systems Lecture 12 Multiprocessor/Realtime Scheduling October 8 th , 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California, Berkeley http: //www. eecs. berkeley. edu/~kubitron/cs 262

Today’s Papers • Implementing Constant-Bandwidth Servers upon Multiprocessor Platforms Sanjoy Baruah, Jo el Goossens,

Today’s Papers • Implementing Constant-Bandwidth Servers upon Multiprocessor Platforms Sanjoy Baruah, Jo el Goossens, and Giuseppe Lipari. Appears in Proceedings of Real-Time and Embedded Technology and Applications Symposium, (RTAS), 2002. • Composing Parallel Software Efficiently with Lithe Heidi Pan, Benjamin Hindman, Krste Asanovic. Appears in Conference on Programming Languages Design and Implementation (PLDI), 2010 • Thoughts? 10/10/2012 cs 262 a-S 12 Lecture-12 2

The Future is Parallel Software Challenge: How to build many different large parallel apps

The Future is Parallel Software Challenge: How to build many different large parallel apps that run well? v Can’t rely solely on compiler/hardware: limited parallelism & energy efficiency v Can’t rely solely on hand-tuning: limited programmer productivity 10/10/2012 cs 262 a-S 12 Lecture-12 3

Composability is Essential App 1 App 2 BLAS MKL BLAS code reuse same library

Composability is Essential App 1 App 2 BLAS MKL BLAS code reuse same library implementation, different apps Goto BLAS modularity same app, different library implementations Composability is key to building large, complex apps. 10/10/2012 cs 262 a-S 12 Lecture-12 4

Motivational Example Sparse QR Factorization (Tim Davis, Univ of Florida) Column Elimination Tree SPQR

Motivational Example Sparse QR Factorization (Tim Davis, Univ of Florida) Column Elimination Tree SPQR Frontal Matrix Factorization TBB MKL Open. MP OS Hardware Software Architecture 10/10/2012 System Stack cs 262 a-S 12 Lecture-12 5

TBB, MKL, Open. MP • Intel’s Threading Building Blocks (TBB) – Library that allows

TBB, MKL, Open. MP • Intel’s Threading Building Blocks (TBB) – Library that allows programmers to express parallelism using a higher-level, task-based, abstraction – Uses work-stealing internally (i. e. Cilk) – Open-source • Intel’s Math Kernel Library (MKL) – Uses Open. MP for parallelism • Open. MP – Allows programmers to express parallelism in the SPMD-style using a combination of compiler directives and a runtime library – Creates SPMD teams internally (i. e. UPC) – Open-source implementation of Open. MP from GNU (libgomp) 10/10/2012 cs 262 a-S 12 Lecture-12 6

Suboptimal Performance Speedup over Sequential Performance of SPQR on 16 -core AMD Opteron System

Suboptimal Performance Speedup over Sequential Performance of SPQR on 16 -core AMD Opteron System Matrix 10/10/2012 cs 262 a-S 12 Lecture-12 7

Out-of-the-Box Configurations Core 0 1 2 3 TBB Open. MP OS virtualized kernel threads

Out-of-the-Box Configurations Core 0 1 2 3 TBB Open. MP OS virtualized kernel threads Hardware 10/10/2012 cs 262 a-S 12 Lecture-12 8

Providing Performance Isolation Using Intel MKL with Threaded Applications http: //www. intel. com/support/performancetools/libraries/mkl/sb/CS-017177. htm

Providing Performance Isolation Using Intel MKL with Threaded Applications http: //www. intel. com/support/performancetools/libraries/mkl/sb/CS-017177. htm 10/10/2012 cs 262 a-S 12 Lecture-12 9

“Tuning” the Code Speedup over Sequential Performance of SPQR on 16 -core AMD Opteron

“Tuning” the Code Speedup over Sequential Performance of SPQR on 16 -core AMD Opteron System Matrix 10/10/2012 cs 262 a-S 12 Lecture-12 10

Partition Resources Core 0 1 2 3 TBB Open. MP OS Hardware Tim Davis’

Partition Resources Core 0 1 2 3 TBB Open. MP OS Hardware Tim Davis’ “tuned” SPQR by manually partitioning the resources. 10/10/2012 cs 262 a-S 12 Lecture-12 11

“Tuning” the Code (continued) Speedup over Sequential Performance of SPQR on 16 -core AMD

“Tuning” the Code (continued) Speedup over Sequential Performance of SPQR on 16 -core AMD Opteron System Matrix 10/10/2012 cs 262 a-S 12 Lecture-12 12

Harts: Hardware Threads virtualized kernel threads harts OS OS Core 0 Core 1 Core

Harts: Hardware Threads virtualized kernel threads harts OS OS Core 0 Core 1 Core 2 Core 3 v Expose true hardware resources • Applications requests harts from OS • Application “schedules” the harts itself (two-level scheduling) • Can both space-multiplex and timemultiplex harts … but never timemultiplex harts of the same application 10/10/2012 cs 262 a-S 12 Lecture-12 13

Sharing Harts (Dynamically) TBB time Open. MP OS Hardware 10/10/2012 cs 262 a-S 12

Sharing Harts (Dynamically) TBB time Open. MP OS Hardware 10/10/2012 cs 262 a-S 12 Lecture-12 14

How to Share Harts? call graph scheduler hierarchy CLR TBB OMP CLR Cilk TBB

How to Share Harts? call graph scheduler hierarchy CLR TBB OMP CLR Cilk TBB Open. MP v Hierarchically: Caller gives resources to callee to execute v Cooperatively: Callee gives resources back to caller when done 10/10/2012 Cilk cs 262 a-S 12 Lecture-12 15

A Day in the Life of a Hart TBB Sched: next? CLR TBB execute.

A Day in the Life of a Hart TBB Sched: next? CLR TBB execute. TBB task Cilk TBB Sched: next? Open. MP TBB Sched. Q execute TBB task • Non-preemptive scheduling. TBB Sched: next? nothing left to do, give hart back to parent CLR Sched: next? Cilk Sched: next? time 10/10/2012 cs 262 a-S 12 Lecture-12 16

Lithe (ABI) Parent Cilk TBB Scheduler Caller enter yield request register unregister call interface

Lithe (ABI) Parent Cilk TBB Scheduler Caller enter yield request register unregister call interface for exchanging values interface for sharing harts enter yield request register unregister call Child TBB Open. MP Scheduler Callee v 10/10/2012 return Analogous to function call ABI for enabling interoperable codes. cs 262 a-S 12 Lecture-12 17

A Few Details … • A hart is only managed by one scheduler at

A Few Details … • A hart is only managed by one scheduler at a time • The Lithe runtime manages the hierarchy of schedulers and the interaction between schedulers • Lithe ABI only a mechanism to share harts, not policy 10/10/2012 cs 262 a-S 12 Lecture-12 18

Putting It All Together Lithe-TBB Sched. Q func(){ register(TBB); request(2); unregister(TBB); Lithe-TBB Sched. Q

Putting It All Together Lithe-TBB Sched. Q func(){ register(TBB); request(2); unregister(TBB); Lithe-TBB Sched. Q enter(TBB); yield(); } time 10/10/2012 cs 262 a-S 12 Lecture-12 19

Synchronization • Can’t block a hart on a synchronization object • Synchronization objects are

Synchronization • Can’t block a hart on a synchronization object • Synchronization objects are implemented by saving the current “context” and having the hart re-enter the current scheduler #pragma omp barrier (block context) enter TBB Scheduler enter yield request register unregister yield #pragma omp barrier (unblock context) request(1) enter Open. MP Scheduler (resume context) enter yield request register unregister enter time 10/10/2012 cs 262 a-S 12 Lecture-12 20

Lithe Contexts • Includes notion of a stack • Includes context-local storage • There

Lithe Contexts • Includes notion of a stack • Includes context-local storage • There is a special transition context for each hart that allows it to transition between schedulers easily (i. e. on an enter, yield) 10/10/2012 cs 262 a-S 12 Lecture-12 21 21

Lithe-compliant Schedulers • TBB – Worker model – ~180 lines added, ~5 removed, ~70

Lithe-compliant Schedulers • TBB – Worker model – ~180 lines added, ~5 removed, ~70 modified (~1, 500 / ~8, 000 total) • Open. MP – Team model – ~220 lines added, ~35 removed, ~150 modified (~1, 000 / ~6, 000 total) 10/10/2012 cs 262 a-S 12 Lecture-12 22 22

Overheads? • TBB – Example micro-benchmarks that Intel includes with releases • Open. MP

Overheads? • TBB – Example micro-benchmarks that Intel includes with releases • Open. MP – NAS benchmarks (conjugate gradient, LU solver, and multigrid) 10/10/2012 cs 262 a-S 12 Lecture-12 23 23

Flickr Application Server • Graphics. Magick parallelized using Open. MP • Server component parallelized

Flickr Application Server • Graphics. Magick parallelized using Open. MP • Server component parallelized using threads (or libprocesses) • Spectrum of possible implementations: – Process one image upload at a time, pass all resources to Open. MP (via Graphics. Magick) + Easy implementation - Can’t overlap communication with computation, some network links are slow, images are different sizes, diminishing returns on resize operations – Process as many images as possible at a time, run Graphics. Magick sequentially + Also easy implementation - Really bad latency when low-load on server, 32 core machine underwhelmed – All points in between … + Account for changing load, different image sizes, different link bandwidth/latency - Hard to program 10/10/2012 cs 262 a-S 12 Lecture-12 24

Flickr-Like App Server Libprocess Graphics Magick Open. MPLithe (Lithe) Tradeoff between throughput saturation point

Flickr-Like App Server Libprocess Graphics Magick Open. MPLithe (Lithe) Tradeoff between throughput saturation point and laten 25 10/10/2012 cs 262 a-S 12 Lecture-12 25

Case Study: Sparse QR Factorization • Different matrix sizes • delta. X creates ~30,

Case Study: Sparse QR Factorization • Different matrix sizes • delta. X creates ~30, 000 Open. MP schedulers • … • Rucci creates ~180, 000 Open. MP schedulers • Platform: Dual-socket 2. 66 GHz Intel Xeon (Clovertown) with 4 cores per socket (8 total cores) 10/10/2012 cs 262 a-S 12 Lecture-12 26

Case Study: Sparse QR Factorization ESOC Rucci Tuned: 70. 8 Tuned: 360. 0 Out-of-the-box:

Case Study: Sparse QR Factorization ESOC Rucci Tuned: 70. 8 Tuned: 360. 0 Out-of-the-box: 111. 8 Out-of-the-box: 576. 9 Sequential: 172. 1 Sequential: 970. 5 Lithe: 354. 7 Lithe: 66. 7 10/10/2012 cs 262 a-S 12 Lecture-12 27

Case Study: Sparse QR Factorization delta. X landmark Tuned: 14. 5 Tuned: 2. 5

Case Study: Sparse QR Factorization delta. X landmark Tuned: 14. 5 Tuned: 2. 5 Out-of-the-box: 26. 8 Out-of-the-box: 4. 1 Sequential: 37. 9 Sequential: 3. 4 Lithe: 2. 3 Lithe: 13. 6 10/10/2012 cs 262 a-S 12 Lecture-12 28

Is this a good paper? • What were the authors’ goals? • What about

Is this a good paper? • What were the authors’ goals? • What about the evaluation/metrics? • Did they convince you that this was a good system/approach? • Were there any red-flags? • What mistakes did they make? • Does the system/approach meet the “Test of Time” challenge? • How would you review this paper today? 10/10/2012 cs 262 a-S 12 Lecture-12 29

Characteristics of a RTS Slides adapted from Frank Drew • Extreme reliability and safety

Characteristics of a RTS Slides adapted from Frank Drew • Extreme reliability and safety – Embedded systems typically control the environment in which they operate – Failure to control can result in loss of life, damage to environment or economic loss • Guaranteed response times – We need to be able to predict with confidence the worst case response times for systems – Efficiency is important but predictability is essential » In RTS, performance guarantees are: • Task- and/or class centric • Often ensured a priori » In conventional systems, performance is: • System oriented and often throughput oriented • Post-processing (… wait and see …) • Soft Real-Time – Attempt to meet deadlines with high probability – Important for multimedia applications 10/10/2012 cs 262 a-S 12 Lecture-12 30

Terminology • Scheduling: – Define a policy of how to order tasks such that

Terminology • Scheduling: – Define a policy of how to order tasks such that a metric is maximized/minimized – Real-time: guarantee hard deadlines, minimize the number of missed deadlines, minimize lateness • Dispatching: – Carry out the execution according to the schedule – Preemption, context switching, monitoring, etc. • Admission Control: – Filter tasks coming into they systems and thereby make sure the admitted workload is manageable • Allocation: – Designate tasks to CPUs and (possibly) nodes. Precedes scheduling 10/10/2012 cs 262 a-S 12 Lecture-12 31

Non-Real-Time Scheduling • Primary Goal: maximize performance • Secondary Goal: ensure fairness • Typical

Non-Real-Time Scheduling • Primary Goal: maximize performance • Secondary Goal: ensure fairness • Typical metrics: – Minimize response time – Maximize throughput – E. g. , FCFS (First-Come-First-Served), RR (Round-Robin) 10/10/2012 cs 262 a-S 12 Lecture-12 32

Example: Workload Characteristics • Tasks are preemptable, independent with arbitrary arrival (=release) times •

Example: Workload Characteristics • Tasks are preemptable, independent with arbitrary arrival (=release) times • Times have deadlines (D) and known computation times (C) • Tasks execute on a uniprocessor system • Example Setup: 10/10/2012 cs 262 a-S 12 Lecture-12 33

Example: Non-preemptive FCFS Scheduling 10/10/2012 cs 262 a-S 12 Lecture-12 34

Example: Non-preemptive FCFS Scheduling 10/10/2012 cs 262 a-S 12 Lecture-12 34

Example: Round-Robin Scheduling 10/10/2012 cs 262 a-S 12 Lecture-12 35

Example: Round-Robin Scheduling 10/10/2012 cs 262 a-S 12 Lecture-12 35

Real-Time Scheduling • Primary goal: ensure predictability • Secondary goal: ensure predictability • Typical

Real-Time Scheduling • Primary goal: ensure predictability • Secondary goal: ensure predictability • Typical metrics: – Guarantee miss ration = 0 (hard real-time) – Guarantee Probability(missed deadline) < X% (firm realtime) – Minimize miss ration / maximize completion ration (firm real-time) – Minimize overall tardiness; maximize overall usefulness (soft real-time) • E. g. , EDF (Earliest Deadline First), LLF (Least Laxity First), RMS (Rate-Monotonic Scheduling), DM (Deadline Monotonic Scheduling) • Real-time is about enforcing predictability, and does not equal to fast computing!!! 10/10/2012 cs 262 a-S 12 Lecture-12 36

Scheduling: Problem Space • Uni-processor / multiprocessor / distributed system • Periodic / sporadic

Scheduling: Problem Space • Uni-processor / multiprocessor / distributed system • Periodic / sporadic /aperiodic tasks • Independent / interdependant tasks • • Preemptive / non-preemptive Tick scheduling / event-driven scheduling Static (at design time) / dynamic (at run-time) Off-line (pre-computed schedule), on-line (scheduling decision at runtime) • Handle transient overloads • Support Fault tolerance 10/10/2012 cs 262 a-S 12 Lecture-12 37

Task Assignment and Scheduling • Cyclic executive scheduling ( later) • Cooperative scheduling –

Task Assignment and Scheduling • Cyclic executive scheduling ( later) • Cooperative scheduling – scheduler relies on the current process to give up the CPU before it can start the execution of another process • A static priority-driven scheduler can preempt the current process to start a new process. Priorities are set pre-execution – E. g. , Rate-monotonic scheduling (RMS), Deadline Monotonic scheduling (DM) • A dynamic priority-driven scheduler can assign, and possibly also redefine, process priorities at run-time. – Earliest Deadline First (EDF), Least Laxity First (LLF) 10/10/2012 cs 262 a-S 12 Lecture-12 38

Simple Process Model • • Fixed set of processes (tasks) Processes are periodic, with

Simple Process Model • • Fixed set of processes (tasks) Processes are periodic, with known periods Processes are independent of each other System overheads, context switches etc, are ignored (zero cost) • Processes have a deadline equal to their period – i. e. , each process must complete before its next release • Processes have fixed worst-case execution time (WCET) 10/10/2012 cs 262 a-S 12 Lecture-12 39

Performance Metrics • • • 10/10/2012 Completion ratio / miss ration Maximize total usefulness

Performance Metrics • • • 10/10/2012 Completion ratio / miss ration Maximize total usefulness value (weighted sum) Maximize value of a task Minimize lateness Minimize error (imprecise tasks) Feasibility (all tasks meet their deadlines) cs 262 a-S 12 Lecture-12 40

Scheduling Approaches (Hard RTS) • Off-line scheduling / analysis (static analysis + static scheduling)

Scheduling Approaches (Hard RTS) • Off-line scheduling / analysis (static analysis + static scheduling) – – – All tasks, times and priorities given a priori (before system startup) Time-driven; schedule computed and hardcoded (before system startup) E. g. , Cyclic Executives Inflexible May be combined with static or dynamic scheduling approaches • Fixed priority scheduling (static analysis + dynamic scheduling) – All tasks, times and priorities given a priori (before system startup) – Priority-driven, dynamic(!) scheduling » The schedule is constructed by the OS scheduler at run time – For hard / safety critical systems – E. g. , RMA/RMS (Rate Monotonic Analysis / Rate Monotonic Scheduling) • Dynamic priority schededuling – – Tasks times may or may not be known Assigns priorities based on the current state of the system For hard / best effort systems E. g. , Least Completion Time (LCT), Earliest Deadline First (EDF), Least Slack Time (LST) 10/10/2012 cs 262 a-S 12 Lecture-12 41

Cyclic Executive Approach • Clock-driven (time-driven) scheduling algorithm • Off-line algorithm • Minor Cycle

Cyclic Executive Approach • Clock-driven (time-driven) scheduling algorithm • Off-line algorithm • Minor Cycle (e. g. 25 ms) gcd of all periods • Major Cycle (e. g. 100 ms) lcm of all periods - Period Comp. Time A 25 10 B 25 8 C 50 5 D 50 4 E 100 2 - Construction of a cyclic executive is equivalent to bin packing 10/10/2012 Process cs 262 a-S 12 Lecture-12 42

Cyclic Executive (cont. ) Frank Drews 10/10/2012 Real-Time Systems cs 262 a-S 12 Lecture-12

Cyclic Executive (cont. ) Frank Drews 10/10/2012 Real-Time Systems cs 262 a-S 12 Lecture-12 43

Cyclic Executive: Observations • No actual processes exist at run-time – Each minor cycle

Cyclic Executive: Observations • No actual processes exist at run-time – Each minor cycle is just a sequence of procedure calls • The procedures share a common address space and can thus pass data between themselves. – This data does not need to be protected (via semaphores, mutexes, for example) because concurrent access is not possible • All ‘task’ periods must be a multiple of the minor cycle time 10/10/2012 cs 262 a-S 12 Lecture-12 44

Cyclic Executive: Disadvantages • With the approach it is difficult to: • incorporate sporadic

Cyclic Executive: Disadvantages • With the approach it is difficult to: • incorporate sporadic processes; • incorporate processes with long periods; – Major cycle time is the maximum period that can be accommodated without secondary schedules (=procedure in major cycle that will call a secondary procedure every N major cycles) • construct the cyclic executive, and • handle processes with sizeable computation times. – Any ‘task’ with a sizeable computation time will need to be split into a fixed number of fixed sized procedures. 10/10/2012 cs 262 a-S 12 Lecture-12 45

Next Time: Online Scheduling 10/10/2012 cs 262 a-S 12 Lecture-12 46

Next Time: Online Scheduling 10/10/2012 cs 262 a-S 12 Lecture-12 46