Adaptive Partition Scheduling Part 1 Why we did


























- Slides: 26

Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX A. Danko 12/30/2021

Yet another thread scheduler. Why? è The story begins with a customer: è “We can use QNX! We need ARINC 653!!!!!! HELP!” 12/30/2021 Cool Stuff from QNX 2

Why? Shiny New Toy è Partition scheduler (ARINC 653) > Very popular in fixed military systems > Each partition is guaranteed a percentage of CPU > Priorities are only meaningful within a partition OTHER JAVA POSIX 50% 20% 30% ARINC 653 Partition Scheduler and “special” IPC è Shortcomings include > Detailed RMA required to verify system > Overload of IPC FIFO input queue > > > § Failures include denial of service and CPU quota exhaustion Monolithic design within one partition Hard to retrofit to existing 1 -cpu applications. Inefficient use of total CPU. Runs idle when tasks are ready. Increased interrupt latency Does not address shared entities such as a file system Restrictive programming model. No DMA 12/30/2021 Cool Stuff from QNX 3

Why? Real-world examples of partitioning for QNX customers Selling a portion of throughput Security: Untrusted Applications Car Router Customer 1 TCP/IP Protocol Customer 2 TCP/IP NAV etc … Radio Protocol 3 rd party (malware? ) 80% 20% Application Router Application 50% Protocol Downloaded applications from the WEB cannot hurt the system 50% Locked System Recovery Customer 2’s network load cannot hurt customer 1 HOG App 90% bash 10% Hard-wall scheduler not-required. Emergency recovery shell Do we need any new scheduler? 12/30/2021 Cool Stuff from QNX 4

Why? Evolution of schedulers Timeline Yes, but: è è priority pre-emptive SCHED_FIFO è System locks up è è Timeslicing SCHED_RR è Backhoes and Mother’s day è Time-varying priority è Untuneable for more than 1 application. è è SCHED_SPORADIC Really clever time-varying è US Military Satcom è Fair Share scheduling è Hard to manage share interactions. è Adaptive configuration è Not invented – until now. 12/30/2021 Cool Stuff from QNX 5

Why? Evolution: Lessons learned è Numerical priorities are chosen by applications but system scheduling behavior must be designed globally è Degradation and overload: Priorities are not constants. Importance of work depends on circumstances. > Modes: normal operation, restart, emergency maintenance è Scheduling strategy needs to be based on unit of work, but what we have is communicating threads. è must measure real-time behavior. > 0. 1 % accuracy è Want to specify shares as global percentages > Applications don’t get to pick their importance or shares. System engineers do. è Need to throttle cpu usage without losing realtime latencies. 12/30/2021 Cool Stuff from QNX 6

Design What is Partitioning? General Answer QNX Answer POSIX compatible design which can be applied to existing systems with little or no recoding Partition Scheduling Adaptive è A global hard real-time scheduler with overload protection and CPU guarantees > Separation of work based on “working for è è Separation of work è To isolate: > > cpu usage memory usage system resource usage Failures 12/30/2021 common purpose” è Runtime typed memory and kernel object guarantees and limits > With full inheritance and accounting for all children Persistent storage (file system) guarantees and limits è Process model for fault isolation è Dynamic configuration è Cool Stuff from QNX 7

Design Principles Scheduler must not trigger an overload > Overhead may not increase with # of threads è Real-time during underload > Same behavior as today è Real-time during overload > At least for interrupt handling è Must also be a fair-share scheduler > global scheduler algorithm > globally configured è Must mesh with current QNX architecture > > è • • Throughput è Offered load Preemptive priority, individual thread scheduling Heavy use of message passing Easy to drop onto existing applications Can’t be a “bag on the side” Insert picture of Juggling Watermelons here Simple enough for customers to use > Engineerable > Reconfigure on the fly 12/30/2021 Cool Stuff from QNX 8

Overconstrained problem? Nope: è Implemented in QNX 6. 3. 2 è Actually Works See “How it Works” in Part 2. 12/30/2021 Cool Stuff from QNX 9

Design Adaptive Partition Scheduling è Part 2: How it works. è What it does: > Counting time > Who’s got time > Real time > Out of time > Free time > Borrowed time > Equal time How it does it API Why is it secure? Why is it cool? è è 12/30/2021 Cool Stuff from QNX 10

Design Counting time è What does 14% cpu mean? > CPU usage is calculated over a sliding window. > T= -100 ms è T= now Accuracy: > Counting ticks is not enough. “Micro-billing” is used to track actual CPU utilization even when threads don’t use their whole timeslice. micro- and nano-second resolution Threads are billed based on real usage, not statistics > > è “windowsize” is configurable as an argument to kernel at boot > Tradeoff maximum READY-state latency with accuracy of CPU budgeting § 100 ms window -> 1% accuracy or better. Internal arithmetic accurate to 0. 5% or better > è Partition usage > ns cpu time executed, during last sliding window, expressed as percentage è Partition budget > Guaranteed percentage of cpu time, balanced over sliding window 12/30/2021 Cool Stuff from QNX 11

Who’s got time: Partition Membership è QNX Scheduler Partition > Set of threads working for a common purpose § Set of initial processes/threads designated by customer • + all subsequent children § Guest members • Server’s cpu time billed to client • Resmgr threads temporarily join partition of sender thread > > è Not locked to a static set of code. OS services are part of whatever partition they need to be. hence the name “adaptive partition” 12/30/2021 Cool Stuff from QNX 12

Design Who’s got time: Partition Inheritance File System Process 6 8 10 CPU budget available Message 6 7 - 11 9 6 Message -9 Receive Threads Adaptive Partition 1 (Multi-media) 10 4 CPU budget available Adaptive Partition 2 (Java application) Resource manager threads work on behalf of sender è Priority and adaptive partition in inherited on receive > Execution time in server billed to client’s partition è This allows proper accounting for shared resources è 12/30/2021 Cool Stuff from QNX 13

Design Real time: Behavior under normal load Blocked Ready 6 6 8 7 11 Running 6 10 9 4 CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Hard real-time scheduler under normal load è Running thread selected as highest priority READY thread è No delay on scheduling if adaptive partition has budget è 12/30/2021 Cool Stuff from QNX 14

Design Out of time: Behavior under overload Blocked Ready 6 6 8 7 11 Running 6 10 9 4 CPU budget exceeded CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Highest priority READY thread in Partition with budget runs è No delay on scheduling if adaptive partition has budget è 12/30/2021 Cool Stuff from QNX 15

Design Free Time: Behavior with unused CPU Blocked 6 6 6 8 11 Running 7 6 10 10 9 8 4 CPU budget exceeded Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) CPU budget available Adaptive Partition 3 If no partitions with remaining budget have READY threads, highest priority READY thread is selected to run from other partitions è This allows “free” time to be given based upon priority > “Free” time is still accounted and may have to be paid back (for example, if partition 3 è becomes ready within 1 averaging window) 12/30/2021 Cool Stuff from QNX 16

Design Borrowed Time: Critical Threads Blocked Ready 6 6 7 11 8 6 Critical Thread Running 30 11 4 CPU budget exceeded CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Air Bag Control) Critical threads still run (based on priority) even if partition has no budget è Critical threads provide deterministic scheduling even in overload è Critical threads are given critical budget and can go into short-term debt > Critical time is accounted and has to be repaid > Exceeding critical budget is considered an error and causes notification/action è 12/30/2021 Cool Stuff from QNX 17

Design Equal time. è How to choose between partitions of equal priority > Unimportant? > Many threads run at default priority, therefore equal priority Possible algorithms: > - round robin > - favor partition with most free time > - favor longest waiter è Requirement: > Minimize latencies during underload > WBN: divide free time by % cpu share. è Solution: • Interleave partitions by ratio of partition shares • We found a clever way to do that, so it’s in the patent. 12/30/2021 Cool Stuff from QNX 18

How it does it u. Kernel libmod_aps. a Process creation Per-partition Ready Q messaging for all partitions, p Def m(p) -> (bud(p)||crit(p), prio(p), run_t/wsize/bud(p)) Then schedule ps Def ps -> rdy(ps) and (m(ps) < m(pi)) For all i != s Scheduler clock intr handler ready() block() select_thread() 12/30/2021 Cool Stuff from QNX 19

Algorithm summary - - - A partition sees real-time behaviour when under budget - Only limited when another partition must get its guarantee Fair-share scheduling at or over budget Equal prio partitions are interleaved - Budgets balanced in much less than windowsize Free time (above budget) is given out: - By default: in real-time mode - Optionally: by ratio of budgets Critical Thread run even if out of budget - Criticality is inherited 12/30/2021 Cool Stuff from QNX 20

Overhead: Fancy, but is it fast? Scheduling overhead increases with: > - number of partitions > - number of messages/sec > - number of clock interrupts/sec, i. e. Clock. Period() > * does not increase with number of threads * è Free or almost free operations: > Inheriting partition as part of message receive > Joining a thread to a partition > Dynamically changing budgets è Computational requirements > 32 bit multiply, 64 bit add > *no floating point* *no divides* *no address space swapping* è *short-circuit calculation of merit function* *no inter-cpu msging on SMP* *history-less algorithm* è Overhead typically 1% of total cpu 12/30/2021 Cool Stuff from QNX 21

Design APIs è è è Control of Adaptive Partitioning Scheduler is done through a kernel API allows associating a thread with a partition > Used to launch processes within a partition > Children inherit parent’s partition Dynamic capabilities part of design > Budgets may be changed at run time – instant effect > Threads may join/unjoin partitions freely APIs to attach event triggered on critical budget overrun Selectable security > API is restricted to privileged processes (root) > Must be called from within default (system) partition > Partitions are created with budget (normal and possibly critical) API provided to “lock down” partition configuration > Prevent creation of new partitions or modification of budgets 12/30/2021 Cool Stuff from QNX 22

API 2: Launching applications è 1. Build File > schedaps My. Partition 20 > [schedaps=My. Partition] /bin/my. App è 2. Command line > aps create –b 20 My. Partition > on –Xaps=My. Partition /bin/my. App è 3. Momentics IDE 4 > Drag and drop è 4. include <sys/sched_aps. h> > Full programmatic interface: configure, get stats, launch, secure 12/30/2021 Cool Stuff from QNX 23

Why is AP Secure? è AP enforces budgets every clock interrupt è Root can be required to do configuration changes è Partition creation by subdivision of parent > It’s not possible to create a sub-partition greater than a parent > Not even root can violate this rule è Configuration can be locked 12/30/2021 Cool Stuff from QNX 24

Design Why is this cool? : Engineerable • Identifying units of work: Partition Inheritance • Identify code that starts up applications > • Filesystems etc do not require separately engineered cpu share • • Inheritance figures out the rest Customer need not analyze budgets for OS components Global share management: % cpu • cpu shares defined in units customers are used to: Percentage • gets us off the hook for accounting for different clock speeds. • Realtime when you need it: Critical Threads • Interrupts and important event still get handled on time. è Secure > Budgets, especially critical budgets, are set globally by root, not by applications § 12/30/2021 “to err is human, but …” Cool Stuff from QNX 25

Adaptive Partition Scheduling è Part 3. The Slick Demo 12/30/2021 Cool Stuff from QNX 26