Outline OS schedulers Unix scheduling Linux 2 4

  • Slides: 81
Download presentation

Outline • • • OS schedulers Unix scheduling Linux 2. 4 scheduler Linux 2.

Outline • • • OS schedulers Unix scheduling Linux 2. 4 scheduler Linux 2. 6 scheduler – O(1) scheduler – O(2) scheduler 2

Introduction preemptive & cooperative multitasking • A multitasking operating system is one that can

Introduction preemptive & cooperative multitasking • A multitasking operating system is one that can simultaneously interleave execution of more than one process. • Multitasking operating systems come in two flavors: cooperative multitasking and preemptive multitasking. – Linux provides preemptive multitasking – MAC OS 9 and earlier being the most notable cooperative multitasking. 3

Linux scheduler – Scheduling Policy • Scheduling policy determines what runs when – fast

Linux scheduler – Scheduling Policy • Scheduling policy determines what runs when – fast process response time (low latency) – maximal system utilization (high throughput) • Processes classification: – I/O-bound processes: spends much of its time submitting and waiting on I/O requests – Processor-bound processes: spend much of their time executing code • Unix variants tends to favor I/O-bound processes, thus providing good process response time 4

Linux scheduler – Process Priority • Linux’s priority-based scheduling – Rank processes based on

Linux scheduler – Process Priority • Linux’s priority-based scheduling – Rank processes based on their worth and need for processor time. – Both the user and the system may set a process's priority to influence the scheduling behavior of the system. • Dynamic priority-based scheduling – Begins with an initial base priority – Then enables the scheduler to increase or decrease the priority dynamically to fulfill scheduling objectives. – E. g. , a process that is spending more time waiting on I/O will receive an elevated dynamic priority. 5

Linux scheduler – Priority Ranges • Two separate priority ranges. – nice value, from

Linux scheduler – Priority Ranges • Two separate priority ranges. – nice value, from -20 to +19 with a default of 0. • Larger nice values correspond to a lower priority. (you are being nice to the other processes on the system). – real-time priority, by default range from 0 to 99. • All real-time processes are at a higher priority than normal processes. • Linux implements real-time priorities in accordance with POSIX standards on the matter. 6

2. 4 scheduler • Non-preemptible kernel – Set p->need_resched if schedule() should be invoked

2. 4 scheduler • Non-preemptible kernel – Set p->need_resched if schedule() should be invoked at the ‘next opportunity‘ (kernel => user mode). • Round-robin – task_struct->counter: number of clock ticks left to run in this scheduling slice, decremented by a timer. 7

2. 4 scheduler 1. Check if schedule() was invoked from interrupt handler (due to

2. 4 scheduler 1. Check if schedule() was invoked from interrupt handler (due to a bug) and panic if so. 2. Use spin_lock_irq() to lock ‘runqueue_lock’ 3. Check if a task is ‘runnable’ – in TASK_RUNNING state – in TASK_INTERRUPTIBLE state and a signal is pending 4. Examine the ‘goodness’ of each process 5. Context switch 8

2. 4 scheduler – ‘goodness’ • ‘goodness’: identifying the best candidate among all processes

2. 4 scheduler – ‘goodness’ • ‘goodness’: identifying the best candidate among all processes in the runqueue list. – ‘goodness’ = 0: the entity has exhausted its quantum. – 0 < ‘goodness’ < 1000: the entity is a conventional process/thread that has not exhausted its quantum; a higher value denotes a higher level of goodness. 9

2. 4 scheduler – ‘goodness’ if (p->mm == prev->mm) return p->counter + p->priority +

2. 4 scheduler – ‘goodness’ if (p->mm == prev->mm) return p->counter + p->priority + 1; else return p->counter + p->priority; • A small bonus is given to the task p if it shares the address space with the previous task. 10

2. 4 scheduler - SMP run queue 11

2. 4 scheduler - SMP run queue 11

2. 4 scheduler - SMP Examine the processor field of the processes and gives

2. 4 scheduler - SMP Examine the processor field of the processes and gives a consistent bonus (that is PROC_CHANGE_PENALTY, usually 15) to the process that was last executed on the ‘this_cpu’ CPU. 12

2. 4 scheduler - performance • The algorithm does not scale well – It

2. 4 scheduler - performance • The algorithm does not scale well – It is inefficient to re-compute all dynamic priorities at once. • The predefined quantum is too large for high system loads (for example: a server) • I/O-bound process boosting strategy is not optimal – a good strategy to ensure a short response time for interactive programs, but… – some batch programs with almost no user interaction are I/O-bound. 13

Recalculating Timeslices (kernel 2. 4) • Problems: – Can take a long time. Worse,

Recalculating Timeslices (kernel 2. 4) • Problems: – Can take a long time. Worse, it scales O(n) for n tasks on the system. – Recalculation must occur under some sort of lock protecting the task list and the individual process descriptors. This results in high lock contention. – Nondeterminism is a problem with deterministic realtime programs. 14

Processes classification • Definition: – I/O-bound processes: spends much of its time submitting and

Processes classification • Definition: – I/O-bound processes: spends much of its time submitting and waiting on I/O requests – Processor-bound processes: spend much of their time executing code • Linux tends to favor I/O-bound processes, thus providing good process response time • How to classify processes? 15

16

16

0 tq≠ 0 tq=? ? ? ity ior pr ≠ um nt nd ua

0 tq≠ 0 tq=? ? ? ity ior pr ≠ um nt nd ua ou e q /O b I Tim Hig h Tim (C e PU qu bo ant un um d) =0 ks tas tq=0 17

Scheduling policy time_quantumnew = time_quantumold/2 + time_quantum_table[static_priority] & dynamic_priority ≈ time_quantumnew 18

Scheduling policy time_quantumnew = time_quantumold/2 + time_quantum_table[static_priority] & dynamic_priority ≈ time_quantumnew 18

≠ um nt nd ua ou e q /O b I Tim (C e

≠ um nt nd ua ou e q /O b I Tim (C e PU qu bo ant un um d) =0 tq=0 tq=? 0 tq≠ 0 tq=? ? ? 19

≠ um nt nd ua ou e q /O b I Tim (C e

≠ um nt nd ua ou e q /O b I Tim (C e PU qu bo ant un um d) =0 tq=0 tq=? 0 tq≠ 0 tq=? ? ? 20

≠ um nt nd ua ou e q /O b I Tim (C e

≠ um nt nd ua ou e q /O b I Tim (C e PU qu bo ant un um d) =0 tq≠ 0 tq=0 tq=? 0 tq≠ 0 tq=? ? ? 21

2. 6 scheduler run queue task migration (put + pull) run queue 22

2. 6 scheduler run queue task migration (put + pull) run queue 22

2. 6 scheduler – User Preemption • User preemption can occur – When returning

2. 6 scheduler – User Preemption • User preemption can occur – When returning to user-space from a system call – When returning to user-space from an interrupt handler 23

2. 6 scheduler – Kernel Preemption • The Linux kernel is a fully preemptive

2. 6 scheduler – Kernel Preemption • The Linux kernel is a fully preemptive kernel. – It is possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule. – “safe to reschedule”: kernel does not hold a lock • The Linux design: – additing of a preemption counter, preempt_count, to each process's thread_info – This count increments once for each lock that is acquired and decrements once for each lock that is released • Kernel preemption can also occur explicitly, when a task in the kernel blocks or explicitly calls schedule(). – no additional logic is required to ensure that the kernel is in a state that is safe to preempt! 24

Kernel Preemption • Kernel preemption can occur – When an interrupt handler exits, before

Kernel Preemption • Kernel preemption can occur – When an interrupt handler exits, before returning to kernel-space – When kernel code becomes preemptible again – If a task in the kernel explicitly calls schedule() – If a task in the kernel blocks (which results in a call to schedule()) 25

O(1) & CFS scheduler • 2. 5 ~ 2. 6. 22: O(1) scheduler –

O(1) & CFS scheduler • 2. 5 ~ 2. 6. 22: O(1) scheduler – Time complexity: O(1) – Using “run queue” (an active Q and an expired Q) to realize the ready queue • 2. 6. 23~present: Completely Fair Scheduler (CFS) – Time complexity: O(log n) – the ready queue is implemented as a red-black tree 26

O(1) scheduler • Implement fully O(1) scheduling. – Every algorithm in the new scheduler

O(1) scheduler • Implement fully O(1) scheduling. – Every algorithm in the new scheduler completes in constant-time, regardless of the number of running processes. (Since the 2. 5 kernel). • Implement perfect SMP scalability. – Each processor has its own locking and individual runqueue. • Implement improved SMP affinity. – Attempt to group tasks to a specific CPU and continue to run them there. – Only migrate tasks from one CPU to another to resolve imbalances in runqueue sizes. • Provide good interactive performance. – Even during considerable system load, the system should react and schedule interactive tasks immediately. • Provide fairness. – No process should find itself starved of timeslice for any reasonable amount of time. Likewise, no process should receive an unfairly high amount of timeslice. • Optimize for the common case of only one or two runnable processes, yet scale well to multiple processors, each with many processes. 27

The Priority Arrays • Each runqueuecontains two priority arrays (defined in kernel/sched. c as

The Priority Arrays • Each runqueuecontains two priority arrays (defined in kernel/sched. c as struct prio_array) – Active array: all tasks with timeslice left. – Expired array: all tasks that have exhausted their timeslice. • Priority arrays provide O(1) scheduling. – Each priority array contains one queue of runnable processors per priority level. – The priority arrays also contain a priority bitmap used to efficiently discover the highest-priority runnable task in the system. 28

The Linux O(1) scheduler algorithm 29

The Linux O(1) scheduler algorithm 29

The Priority Arrays • Each runqueuecontains two priority arrays (defined in kernel/sched. cas struct

The Priority Arrays • Each runqueuecontains two priority arrays (defined in kernel/sched. cas struct prio_array) – Active array: all tasks with timesliceleft. – Expired array: all tasks that have exhausted their timeslice. • Priority arrays provide O(1) scheduling. – Each priority array contains one queue of runnable processors per priority level. – The priority arrays also contain a priority bitmap used to efficiently discover the highest-priority runnable task in the system. 30

Each runqueue contains two priority arrays – active and expired. � Each of these

Each runqueue contains two priority arrays – active and expired. � Each of these priority arrays contains a list of tasks indexed according to priority � runqueue Priority queue (0 -139) expired active 31

� runqueue Linux assigns higher-priority tasks longer time-slice Time quantum ≈ 1/priority tsk 1

� runqueue Linux assigns higher-priority tasks longer time-slice Time quantum ≈ 1/priority tsk 1 tsk 2 tsk 3 active expired 32

� runqueue Linux chooses the task with the highest priority from the active array

� runqueue Linux chooses the task with the highest priority from the active array for execution. tsk 1 tsk 2 tsk 3 active expired 33

runqueue tsk 1 Round-robin tsk 2 tsk 3 active expired 34

runqueue tsk 1 Round-robin tsk 2 tsk 3 active expired 34

runqueue tsk 1 Round-robin tsk 3 tsk 2 active expired 35

runqueue tsk 1 Round-robin tsk 3 tsk 2 active expired 35

runqueue tsk 1 tsk 2 tsk 3 active expired 36

runqueue tsk 1 tsk 2 tsk 3 active expired 36

Most tasks have dynamic priorities that are based on their “nice” value (static priority)

Most tasks have dynamic priorities that are based on their “nice” value (static priority) plus or minus 5 � Interactivity of a task ≈ 1/sleep_time � runqueue dyn. Prio = static. Prio + bonus = -5 ~ +5 bonus ≈ 1/sleep_time tsk 1 tsk 3 tsk 2 tsk 3 I/O bound active expired 37

� runqueue When all tasks have exhausted their time slices, the two priority arrays

� runqueue When all tasks have exhausted their time slices, the two priority arrays are exchanged! tsk 1 tsk 3 tsk 2 active expired 38

The O(1) scheduling algorithm sched_find_first_bit() 1 1 1 tsk 3 tsk 2 39

The O(1) scheduling algorithm sched_find_first_bit() 1 1 1 tsk 3 tsk 2 39

The O(1) scheduling algorithm Insert O(1) 1 1 1 Remove O(1) find first set

The O(1) scheduling algorithm Insert O(1) 1 1 1 Remove O(1) find first set bit O(1) 40

find first set bit O(1) word >>= 8; static inline unsigned long __ffs (unsigned

find first set bit O(1) word >>= 8; static inline unsigned long __ffs (unsigned long word) { int num = 0; #if BITS_PER_LONG == 64 if ((word & 0 xffff) == 0) { num += 32; word >>= 32; } #endif if ((word & 0 xffff) == 0) { num += 16; word >>= 16; } if ((word & 0 xff) == 0) { num += 8; } } if ((word & 0 xf) == 0) { num += 4; word >>= 4; } if ((word & 0 x 3) == 0) { num += 2; word >>= 2; } if ((word & 0 x 1) == 0) num += 1; return num; 41

2. 6 scheduler – CFS • Classical schedulers compute time slices for each process

2. 6 scheduler – CFS • Classical schedulers compute time slices for each process in the system and allow them to run until their time slice/quantum is used up. – After that, all process need to be recalculated. • CFS considers only the wait time of a process – The task with the most need for CPU time is scheduled. 42

2. 6 scheduler – CFS • 43

2. 6 scheduler – CFS • 43

2. 6 scheduler – CFS (motivation) • Traditional Linux scheduling policy HP LP HP:

2. 6 scheduler – CFS (motivation) • Traditional Linux scheduling policy HP LP HP: high priority LP: low priority 44

2. 6 scheduler – CFS (motivation) • Traditional Unix scheduling policy WT WT WT:

2. 6 scheduler – CFS (motivation) • Traditional Unix scheduling policy WT WT WT: waiting time 45

2. 6 scheduler – CFS (the idea case) task task virtualization High priority Low

2. 6 scheduler – CFS (the idea case) task task virtualization High priority Low priority 46

2. 6 scheduler – CFS (the idea case) task task virtualization fastly slowly 47

2. 6 scheduler – CFS (the idea case) task task virtualization fastly slowly 47

2. 6 scheduler – CFS (the idea case) task 8 : task 3 : task 3

2. 6 scheduler – CFS (the idea case) task 8 : task 3 : task 3 speed 48

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3 3 49

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3 3 50

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3 3 51

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3 3 52

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3

time 2. 6 scheduler – CFS (the idea case) 8 8 8 3 3 3 53

time 2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3

time 2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 54

time 2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3

time 2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 55

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 56

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 56

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 57

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 57

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 58

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 58

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 59

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 59

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 60

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 60

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 61

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 61

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 62

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 62

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 63

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 63

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 64

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 64

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 65

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 65

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 66

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 66

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 67

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 67

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 68

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 68

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 69

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 69

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 70

2. 6 scheduler – CFS (the implementation) 8 8 8 3 3 3 70

2. 6 scheduler – CFS 71

2. 6 scheduler – CFS 71

2. 6 scheduler – the RB-Tree • To sort tasks on the red-black tree,

2. 6 scheduler – the RB-Tree • To sort tasks on the red-black tree, the kernel uses the difference fair_clock -wait_runtime. – While fair_clock is a measure for the CPU time a task would have gotten if scheduling were completely fair, – wait_runtime is a direct measure for the unfairness caused by the imperfection of real systems. 72

2. 6 scheduler – issues • Different priority levels for tasks (i. e. ,

2. 6 scheduler – issues • Different priority levels for tasks (i. e. , nice values) must be taken into account • Tasks must not be switched too often because a context switch has a certain overhead. 73

2. 6 scheduler – fields in the task_struct 74

2. 6 scheduler – fields in the task_struct 74

2. 6 scheduler – fields in the task_struct • prio and normal_prio indicate the

2. 6 scheduler – fields in the task_struct • prio and normal_prio indicate the dynamic priorities, static_prio the static priority of a process. – The static priority is the priority assigned to the process when it was started. – normal_priority denotes a priority that is computed based on the static priority and the scheduling policy of the process. 75

2. 6 scheduler – fields in the task_struct • The scheduler is not limited

2. 6 scheduler – fields in the task_struct • The scheduler is not limited to schedule processes, but can also work with larger entities. This allows for implementing group scheduling. 76

2. 6 scheduler – fields in the task_struct • cpus_allowed is a bit field

2. 6 scheduler – fields in the task_struct • cpus_allowed is a bit field used on multiprocessor systems to restrict the CPUs on which a process may run. – setaffinity() – getaffinity() 77

2. 6 scheduler – priority 78

2. 6 scheduler – priority 78

2. 6 scheduler – priority kernel/sched. c static const int prio_to_weight[40] = { /*

2. 6 scheduler – priority kernel/sched. c static const int prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; 79

2. 6 scheduler – priority 80

2. 6 scheduler – priority 80

Summary • The concept of OS schedulers • Maximize throughput. – This is what

Summary • The concept of OS schedulers • Maximize throughput. – This is what system administrators care about. – How to maximize throughput (CPU & I/O). • What is the major drawback of Linux 2. 4 scheduler • Pros and cons of Linux 2. 6 schedulers – O(1) – CFS 81