What Analytical Performance Modeling Teaches Us About Computer

What Analytical Performance Modeling Teaches Us About Computer Systems Design Mor Harchol-Balter Computer Science Dept, CMU 1

Defn: Queueing Theory: The study of queues, congestion, resource management, stochastic (probabilistic) modeling CMU CS Dept 15 -359 15 -857 Our Goal: Computer Systems Design 2

TERMINOLOGY WARMUP Response Time: E[T] r = l/m < 1 Incoming jobs l: Avg. rate jobs arrive (jobs/sec) m: Avg. rate jobs served (server speed) Examples of single-server queues: • Router • Supercomputing center • A database lock queue • Web server 3

Q 1: TERMINOLOGY WARMUP Response Time: E[T] Incoming jobs l: Avg. rate jobs arrive (jobs/sec) m: Avg. rate jobs served (server speed) QUESTION 1: Suppose l 2 l. We want to keep E[T] unchanged. Should we: (a) Double service rate (b) More than double service rate (c) Less than double service rate Impact: Be careful not to overprovision! 4

Workload Distribution Warmup 1 Heavy-tailed workload ½ ½ ¼ ¼ Huge Variability Exponential workload 1 Heavy tail: • • Heavy Tails are everywhere in CS: CPU Lifetimes of UNIX jobs [Harchol, Downey 96] Supercomputing job sizes [Harchol-Balter, Schroeder 00] Web file sizes [Crovella, Bestavros 98], [Barford, Crovella 98] Internet Node Degree [Faloutsos, Faloutsos 99] IP Flow durations [Rexford 99] Self-similar arrival processes [Willinger 93] many, many more. . . top 1% jobs comprise half load 5

Q 2: Exponential Distribution 1 Heavy-tailed workload ½ ½ ¼ ¼ Huge Variability 1 Exponential workload QUESTION 2: Under exponentially-distributed job demands, which scheduling policy wins for E[T]? FCFS E[T] PS = FCFS PS load r 6

Q 3: Heavy-tailed workload 1 Heavy-tailed workload ½ ½ ¼ ¼ Huge Variability 1 Exponential workload QUESTION 3: Under heavy-tailed job demands, which scheduling policy wins for E[T]? FCFS E[T] FCFS PS PS load r Impact: Know your workload scheduling 7

Q 4: Scheduling to minimize E[T] 1 Heavy-tailed workload ½ ¼ Huge Variability QUESTION 4: Under heavy-tailed job demands, in M/G/1, order these scheduling policies for E[T]: FCFS Low E[T] High E[T] PS SJF SRPT RANDOM 8

Scheduling to minimize E[T] 1 Heavy-tailed workload ½ ¼ Huge Variability Answer: Under heavy-tailed job demands, in M/G/1: High E[T] Low E[T] SRPT < PS < SJF < RANDOM = FCFS No “Starvation!” Even the biggest jobs prefer SRPT to PS: [Bansal, Harchol-Balter 01], [Wierman, Harchol-Balter 03]: THM: E[T(x)]SRPT < E[T(x)]PS for all x, when load < ½. 9

single-server questions multi-server questions 10

Growing trend towards server farms … l Dispatch Server farms: + cheap + easy to scale 11

Growing trend towards server farms … Supercomputing/Manufacturing Web server farm FCFS Router q q PS Router FCFS Jobs non-preemptible Run-to-completion Served in FCFS order Often variable job size q q PS HTTP requests fully-preempt Commodity PS servers Highly-variable job size Examples: Cisco Local Director IBM Network Dispatcher Microsoft Share. Point, etc. 12

Q 5: 1 Fast versus Many Slow? QUESTION 5: Which has lower E[T]? (for heavy-tailed workload) FCFS l Under l RANDOM Dispatch FCFS m vs l Smart Dispatch FCFS ½m ½m … m vs ½l FCFS ½m 13

Q 5: 1 Fast versus Many Slow? QUESTION 5: Which has lower E[T]? (for heavy-tailed workload) FCFS Under m Least-Work-Left Dispatch [Wierman, Osogami, Harchol-Balter, Scheller-Wolf, Perf. Eval. 06] vs Variability l FCFS l Smart Dispatch FCFS OPT # servers x 1 2 3 load r ½m ½m Multiple servers way better under variable workload 14

Q 5: 1 Fast versus Many Slow? QUESTION 5: Which has lower E[T]? (for heavy-tailed workload) FCFS l FCFS m vs … Under Size-based Dispatch Size-based l Dispatch [Harchol-Balter, Crovella, Murta, Jour. Par. Dist. Comp. 99] l FCFS Smart Dispatch FCFS small jobs big jobs ½m ½m Multiple servers way better under high variability workload 15

Q 5: 1 Fast versus Many Slow? QUESTION 5: Which has lower E[T]? (for heavy-tailed workload) FCFS l FCFS m vs … Under Size-based Dispatch Unknown Size l Dispatch [Harchol-Balter, Journ. ACM 02] l FCFS Smart Dispatch FCFS small jobs big jobs ½m ½m Multiple servers way better under variable workload 16

Q 5: 1 Fast versus Many Slow? QUESTION 5: Which has lower E[T]? (for heavy-tailed workload) FCFS l FCFS m vs l Smart Dispatch FCFS ½m ½m Impact: Best architecture can be cheaper … 17

Q 6: Which routing policy is best? A: Supercomputing/Manufacturing B: Web Server Farm FCFS Poisson Process PS FCFS Router PS FCFS Heavy-tailed, highly variable jobs Random Equal probability Heavy-tailed, highly variable Least-Work-Left Go to host with least total work. Join-Shortest-Queue Go to host with fewest # jobs. Size-Based Splitting Jobs split up by size. 18

Q 6: Which routing policy is best? A: Supercomputing/Manufacturing B: Web Server Farm FCFS Poisson Process PS FCFS Router PS FCFS Heavy-tailed, highly variable jobs Answer to A: 1. Random 2. Join-Shortest-Queue 3. Least-Work-Left 4. Size-Based ( best! ) [Harchol-Balter, Crovella, Murta, JPDC 99] High E[T] Low E[T] Answer to B: 1. Random = Size-Based 2. Least-Work-Left 3. Join-Shortest-Queue (best!) [Gupta, Harchol-Balter, Sigman, Whitt 06+] 19

single-server questions multiple servers with dependencies between servers 20

N-sharing model l. D l. B m. BD Beneficiary (Betty) m. D Donor (Dan) cycle-stealing: Donor helps Beneficiary with her work when he’s free. But can do better with threshold policies… Studied by: S. Bell, R. Williams, M. Harrison, M. Lopez, M. Squillante, C. Xia, D. Yao, L. Zhang, R. Schumsky, L. Green, S. Meyn, A. Ahn, D. Stanford, W. Grassman, … 21

Q 9: Who gets control: man or woman? Help me when I have > TB jobs, or you’re free. TB Beneficiary Side Control TD I’ll help you when I have < TD jobs. Donor Side Control Question 9: Who should have control? Dan (donor) or Betty (beneficiary)? 22

Q 9: Who gets control: man or woman? TB Difficulty of analysis due to 2 D-infinite chain. We introduce Markov-based Dimensionality Reduction. [Harchol-Balter, Osogami, Scheller-Wolf SPAA 03, Sigmetrics 03, Allerton 04, Questa 05, Perf. Eval. 06] TD I’ll help you when I have < TD jobs. # Beneficiary jobs # Donor jobs Help me when I have > TB jobs, or you’re free. 23

Q 9: Who gets control: man or woman? Help me when I have > TB jobs, or you’re free. TB Beneficiary Side Control TD I’ll help you when I have < TD jobs. Donor Side Control Answer: Mean response time E[T] minimized when woman controls! 24

Q 10: Which policy is more robust? Help me when I have > TB jobs, or you’re free. TB Beneficiary Side Control TD I’ll help you when I have < TD jobs. Donor Side Control Q 10: Want policy robust against mis-estimation of load… 25

Q 10: Which policy is more robust? Help me when I have > TB jobs, or you’re free. TB Beneficiary Side Control TD I’ll help you when I have < TD jobs. Donor Side Control Answer: Donor control helps, but even better is to let Benef. have 2 thresholds, where Donor controls which threshold is used. 26

Mean response time Results: Adaptive Dual Threshold policy TTB=6 (opt) TB=20 (robust) ADT: meets both goals. Dan’s load Impact: Robustness equally important to efficiency 27

Conclusion We’ve covered many themes in system design: • Capacity provisioning • Heavy-tailed workloads & scheduling • Architectures: 1 fast or Many slow • Load UNbalancing • Server farms: routing policies • Threshold-based resource sharing/control • Robustness 28

If you want to know more … Come take my class 15 -857 Performance Modeling & Design ** Highly-recommended for CS theory, Math, TEPPER, and ACO doctoral students Queueing theory is an old area of mathematics which has recently become very hot. The goal of queueing theory has always been to improve the design/performance of systems, e. g. networks, servers, memory, disks, distributed systems, etc. , by finding smarter schemes for allocating resources to jobs. In this class we will study the beautiful mathematical techniques used in queueing theory, including stochastic analysis, discrete-time and continuous-time Markov chains, renewal theory, product-forms, transforms, supplementary random variables, fluid theory, scheduling theory, matrix-analytic methods, and more. Throughout we will emphasize realistic workloads, in particular heavy-tailed workloads. This course is packed with open problems -- problems which if solved are not just interesting theoretically, but which have huge applicability to the design of computer systems today. Instructor: Mor Harchol-Balter Prerequisite: Strong probability background MONDAYS & WEDNESDAY 3 -4: 30 29

BACKUP 30

Q 7: To balance or not to balance? S Sizebased x × f ( x) M L XL job size x Question 7: How to choose the size cutoffs? ? ? ? 31

To Balance or Not to Balance? x × f ( x) FCFS Sizebased ssss S FCFS L LL L job size x Answer: Recent Research on heavy-tailed workloads: Pr{ Job size > x} ~ x-a a<1 UNBALANCE favor smalls a=1 BALANCE LOAD a>1 UNBALANCE favor larges Impact: May want to rethink all those load [Harchol-Balter, Vesilo, 06+], [Glynn, Harchol-Balter, Ramanan, 06+] balancing policies … 32