Packet SchedulingArbitration in Virtual Output Queues and Others

Packet Scheduling/Arbitration in Virtual Output Queues and Others CSIT 560 by M. Hamdi 1

Key Characteristics in Designing Internet Switches and Routers 1. Scalability in terms of line rates 2. Scalability in terms of number of interfaces (port numbers) CSIT 560 by M. Hamdi 2

Switch/Router Architecture Comparison http: //www. lightreading. com/document. asp? doc_i d=47959 CSIT 560 by M. Hamdi 3

Head-of-Line Blocking Blocked! CSIT 560 by M. Hamdi 4

CSIT 560 by M. Hamdi 5

CSIT 560 by M. Hamdi 6

Crossbar Switches: Virtual Output Queues • Virtual Output Queues: – At each input port, there are N queues – each associated with an output port – Only one packet can go from an input port at a time – Only one packet can be received by an output port at a time • It retains the scalability of FIFO input-queued switches (no memory bandwidth problem) • It eliminates the Ho. L problem with FIFO input Queues CSIT 560 by M. Hamdi 7

Virtual Output Queues CSIT 560 by M. Hamdi 8

VOQs: How Packets Move VOQs Scheduler CSIT 560 by M. Hamdi 9

Crossbar Scheduler in VOQ Architecture Memory b/w=2 R Scheduler CSIT 560 by M. Hamdi Can be quite complex! 10

Question: do more lanes help? • Answer: it depends on the scheduling VOQs Head with of Line Bad Blocking Scheduling Good Scheduling? Ayalon: depends on traffic matrix… CSIT 560 by M. Hamdi 11

Crossbar Scheduler in VOQ Architecture Which packets I can send during each configuration of the crossbar CSIT 560 by M. Hamdi 12

Switch core architecture Port #1 Cell Data LCS Protocol Request optics Port Processor Grant/Credit Scheduler (Like the Processor of A Computer) Port #256 LCS Protocol optics Crossbar Port Processor CSIT 560 by M. Hamdi 13

Basic Switch Model S(n) A 11(n) L 11(n) 1 1 D 1(n) A 1 N(n) AN 1(n) DN(n) N ANN(n) N LNN(n) CSIT 560 by M. Hamdi 14

Some definitions 3. Queue occupancies: Occupancy L 11(n) LNN(n) CSIT 560 by M. Hamdi 15

Some possible performance goals When traffic is admissible CSIT 560 by M. Hamdi 16

VOQ Switch Scheduling • The VOQ switch scheduling can be represented by a bipartite graph – The left-hand side nodes of the bipartite graph are the input ports – The right-hand side nodes of the bipartite graph are the output ports – The edges between the nodes are requests for packet transmission between input ports and output ports. A 1 B 2 C 3 D 4 E 5 F 6 CSIT 560 by M. Hamdi 17

Maximum size bipartite match • Intuition: maximizes instantaneous throughput L 11(n)>0 Maximum Size Match LN 1(n)>0 “Request” Graph Bipartite Match CSIT 560 by M. Hamdi 18

Network flows and bipartite matching Source s A 1 B 2 C 3 D 4 E 5 F 6 Sink t Finding a maximum size bipartite matching is equivalent to solving a network flow problem with capacities and flows of size “ 1”. CSIT 560 by M. Hamdi 19

Network Flows Source s 10 10 • • • a c 10 1 b 1 1 d 10 10 Sink t Let G=[V, E] be a directed graph with capacity cap(v, w) on edge [v, w]. A flow is an (integer) function, f, that is chosen for each edge so that f(v, w) <= cap(v, w). We wish to maximize the flow allocation. CSIT 560 by M. Hamdi 20

A maximum network flow example By inspection Source s a 10 10 1 b Step 1: Source s a 1 1 10, 10 b d c 1 10, 10 10 c 10 1 1 d 10 10 10, 10 10 Sink t Flow is of size 10 CSIT 560 by M. Hamdi 21

A maximum network flow example Step 2: Source s a c 1 10, 10 b 1 1, 1 d 10, 10 Sink t 10, 1 Flow is of size 10+1 = 11 Maximum flow: Source s a 10, 10 10, 2 b 10, 9 c 1, 1 d 10, 10 Not obvious Sink t 10, 2 Flow is of size 10+2 = 12 CSIT 560 by M. Hamdi 22

Ford-Fulkerson method of augmenting paths 1. Set f(v, w) = -f(w, v) on all edges. 2. Define a Residual Graph, R, in which res(v, w) = cap(v, w) – f(v, w) 3. Find paths from s to t for which there is positive residue. 4. Increase the flow along the paths to augment them by the minimum residue along the path. 5. Keep augmenting paths until there are no more to augment. CSIT 560 by M. Hamdi 23

Example of Residual Graph a s 10, 10 10 b c 1 1 d 10, 10 t 10 Flow is of size 10 Residual Graph, R res(v, w) = cap(v, w) – f(v, w) a 10 c 10 s 10 10 1 b 1 1 d 10 CSIT 560 by M. Hamdi t Augmenting path 24

Example of Residual Graph a s 10, 10 10 b c 1 1 d 10, 10 t 10 Flow is of size 10 Residual Graph, R res(v, w) = cap(v, w) – f(v, w) a 10 c 10 s 10 10 1 b 1 1 d 10 CSIT 560 by M. Hamdi t Augmenting path 25

Example of Residual Graph Step 2: a s c 1 10, 10 b 1 d 1, 1 10, 10 t 10, 1 Flow is of size 10+1 = 11 Residual Graph 10 s 1 9 a c 10 1 b 1 1 d 10 t 1 CSIT 560 by M. Hamdi 9 Augmenting path 26

Example of Residual Graph Step 3: a s 1, 1 10, 10 10, 2 c 10, 9 b 1, 1 d 10, 10 t 10, 2 Flow is of size 10+2 = 12 Residual Graph 10 s 2 8 a c 10 1 b 1 1 d 10 t 2 CSIT 560 by M. Hamdi 8 27

An other Example: Ford-Fulkerson method find augmenting path p Gf f=0 G 16 4 10 s 12 a 13 9 b c t 7 11 16 20 d 12 a 4 10 s 13 4 9 b t 7 11 c 20 d 4 f=4 16 4 10 s 4/13 12 a c 9 4/11 b 20 16 t 7 d 4/4 4 10 4 s 9 12 a 9 b t 7 4 c 20 d 4 7 CSIT 560 by M. Hamdi 28

An other Example: Ford-Fulkerson method find augmenting path p Gf f=4 G 16 4 10 s 12 a 4/13 9 b 16 t 7 4/11 c 20 d 4/4 4 10 4 s 12 a 9 b 20 t 7 4 9 c 4 d 7 f=4+12 12/16 4 10 s 4/13 12/12 a c 9 4/11 b 4 12/20 t 7 d 4/4 12 10 4 s 9 12 a 4 9 b 12 t 7 4 c 8 d 4 7 CSIT 560 by M. Hamdi 29

An other Example: Ford-Fulkerson method find augmenting path p Gf f=16 G 12/16 4 10 s 12/12 a 4/13 9 b c t 7 4/11 4 12/20 d 12 10 4 s 4/4 12 a 4 9 b 8 12 t 7 4 9 c 4 d 7 f=16+7 12/16 4 10 s 11/13 12/12 a c 9 b t 7/7 11/11 4 19/20 d 4/4 12 10 11 s 2 12 a 4 c CSIT 560 by M. Hamdi 9 11 b 1 19 t 7 d 4 30

An other Example: Ford-Fulkerson method find augmenting path p Gf f=23 G 12/16 4 10 s 11/13 12/12 a c 9 b t 7/7 11/11 4 19/20 d 4/4 12 10 11 s 2 12 a 4 c 9 11 b 1 19 t 7 d 4 No more augmenting path Maximum Flow is 23 CSIT 560 by M. Hamdi 31

An example for Flow: Obvious solution 10 S 10 0 10 9 9 10 10 0 9 9 10 S 10 T Input graph G 9 S 10 9 10 10 0 T T 10 Residual Graph Gr Flow graph Gf Total flow = 10, Sub-optimal solution! CSIT 560 by M. Hamdi 32

Flow algorithm – Optimal version Total flow = 10 + 9 = 19 units! 10 S 10 0 10 9 9 SS 10 10 10 9 99 9 0 9 10 9 99 10 10 1 9 9 99 9 SS 9 10 10 1 9 9 10 10 T Input graph G 10 10 1 TT 0 10 10 10 Residual Graph Gr CSIT 560 by M. Hamdi 9 9 TT 10 10 Flow graph Gf 33

Complexity of network flow problems • In general, it is possible to find a solution by considering at most V. E paths, by picking shortest augmenting path first. • There are many variations, such as picking most augmenting path first. • The complexity of the algorithm is less when the graph is bipartite • There are techniques other than the Ford. Fulkerson method. CSIT 560 by M. Hamdi 34

Network flows and bipartite matching Ford - Fulkerson Algorithm – 1 sink 1 2 3 4 5 6 a b c d e f Finding a maximum size bipartite matching is equivalent to solving a network flow problem with capacities and flows of size “ 1”. source CSIT 560 by M. Hamdi 35

Ford - Fulkerson Algorithm – 2 sink Increasing the flow by 1. 1 2 3 4 5 6 a b c d e f source CSIT 560 by M. Hamdi 36

Ford - Fulkerson Algorithm – 3 sink Increasing the flow by 1. 1 2 3 4 5 6 a b c d e f source CSIT 560 by M. Hamdi 37

Ford - Fulkerson Algorithm – 4 sink Increasing the flow by 1. 1 2 3 4 5 6 a b c d e f source CSIT 560 by M. Hamdi 38

Ford - Fulkerson Algorithm – 5 sink Increasing the flow by 1. 1 2 3 4 5 6 a b c d e f source CSIT 560 by M. Hamdi 39

Ford - Fulkerson Algorithm – 6 sink Increasing the flow by 1. 1 2 3 4 5 6 a b c d e f source CSIT 560 by M. Hamdi 40

Ford - Fulkerson Algorithm – 7 sink Augmenting flow along the augmenting path. 1 2 3 4 5 6 a b c d e f source CSIT 560 by M. Hamdi 41

Ford - Fulkerson Algorithm – 8 sink 1 2 3 4 5 6 a b c d e f Maximum flow found! Thus maximum matching found. source CSIT 560 by M. Hamdi 42

Complexity of Maximum Matchings • Maximum Size/Cardinality Matchings: – Algorithm by Dinic O(N 5/2) • Maximum Weight Matchings – Algorithm by Kuhn O(N 3 log. N) • ftp: //dimacs. rutgers. edu/pub/netflow/matching/ (contains code for maximum size/weighting algorithms) • In general: – Hard to implement in hardware – Slooooow. CSIT 560 by M. Hamdi 43

Maximum size bipartite match • Intuition: maximizes instantaneous throughput L 11(n)>0 Maximum Size Match • LN 1(n)>0 for uniform traffic. “Request” Graph Bipartite Match CSIT 560 by M. Hamdi 44

Why doesn’t maximizing instantaneous throughput give 100% throughput for non-uniform traffic? Three possible matches, S(n): CSIT 560 by M. Hamdi 45

Maximum weight matching • Weight could be length of queue or age of packet • Achieves 100% throughput under all traffic patterns A 1(n) A 11(n) L 11(n) S*(n) 1 1 D 1(n) A 1 N(n) AN 1(n) DN(n) ANN(n) LNN(n) N N L 11(n) Maximum Weight Match LN 1(n) “Request” Graph Bipartite Match CSIT 560 by M. Hamdi 46

Packet Scheduling/Arbitration in Virtual Output Queues: Maximal Matching Algorithms CSIT 560 by M. Hamdi 47

Maximum Matching in VOQ Architecture 1 1 2 2 3 3 4 4 1 1 1 2 1 2 8 1 3 Maximum size matching 6 3 Maximum weight matching 8 2 3 6 2 3 3 4 4 CSIT 560 by M. Hamdi 4 4 48

Complexity of Maximum Matchings • Maximum Size/Cardinality Matchings: – Algorithm by Dinic O(N 5/2) • Maximum Weight Matchings – Algorithm by Kuhn O(N 3 log. N) • In general: – Hard to implement in hardware – Slooooow. CSIT 560 by M. Hamdi 49

Maximal Matching • A maximal matching is a matching in which each edge is added one at a time, and is not later removed from the matching. • i. e. , No augmenting paths allowed (they remove edges added earlier) – like by inspection. • No input and output are left unnecessarily idle. CSIT 560 by M. Hamdi 50

Example of Maximal Size Matching A 1 B 2 C 3 D 4 E 5 5 F 6 E F 6 Maximal Matching Maximum Matching CSIT 560 by M. Hamdi 51

Comments on Maximal Matchings • In general, maximal matching is much simpler to implement, and has a much faster running time. • A maximal size matching is at least half the size of a maximum size matching. • A maximal weight matching is defined in the obvious way. • A maximal weight matching is at least half the size of a maximum weight matching. CSIT 560 by M. Hamdi 52

PIM Maximal Size Matching Algorithm: Performance and Properties • It is among the very first practical schedulers proposed for VOQ architectures (used by DEC). • It is based on having arbiters at the inputs and outputs • It iterates the following steps until no more requests can be accepted (or for a given number of iterations): 1. 2. 3. Request: Each unmatched input sends a request to every output for which it has a queued cell Grant (outputs): If an unmatched output receives any request, it grants one by randomly selecting a request uniformly over all requests. Accept (inputs): If an unmatched input receives a grant, it accepts one by selecting an output randomly among those granted to this input. CSIT 560 by M. Hamdi 53

1 1 2 2 N N Grant Arbiters CSIT 560 by M. Hamdi Decision Register State of Input Queues (N 2 bits) Implementation of the parallel maximal matching algorithms Request Arbiters 54

Implementation of the parallel maximal matching algorithms (another similar way) CSIT 560 by M. Hamdi 55

PIM Maximum Size Matching Algorithm: Performance and Properties PIM: 1 st Iteration Step 1: Request 1 1 2 2 3 3 4 4 Random selection Step 3: Accept Random selection 1 1 2 2 3 3 4 4 CSIT 560 by M. Hamdi 1 1 2 2 3 3 4 4 Step 2: Grant 56

PIM Maximum Size Matching Algorithm: Performance and Properties PIM: 2 nd Iteration Step 1: Request 1 1 2 2 3 3 4 4 Step 3: Accept 1 1 2 2 3 3 4 4 CSIT 560 by M. Hamdi 1 1 2 2 3 3 4 4 Step 2: Grant 57

Traffic Types to evaluate Algorithms Uniform traffic Unbalanced traffic Hotpot traffic CSIT 560 by M. Hamdi 58

Parallel Iterative Matching PIM with a single iteration CSIT 560 by M. Hamdi 59

Parallel Iterative Matching PIM with 4 iterations CSIT 560 by M. Hamdi 60

Parallel Iterative Matching Analytical Results Number of iterations to converge: CSIT 560 by M. Hamdi 61

PIM Maximum Size Matching Algorithm: Performance and Properties • It is a fair algorithm – servicing inputs • Can have 100% throughput under uniform traffic • It converges in log. N iterations to a maximal size matching • It has a very poor performance (63% throughput) with 1 iteration – because of its inability to desynchronize the output pointers • It is not easy to build random arbiters in hardware • The best iterative maximal size matching algorithm takes O(N 2 log. N) serial or O(log N) parallel time steps. • If the number of iterations is constant, then it can be implemented in constant time (that is why it is practical) – however the hardware design is not trivial. CSIT 560 by M. Hamdi 62

RRM Maximum Size Matching Algorithm: Performance and Properties • Round Robin Matching (RRM) is easier to implement that PIM (in terms of designing the I/O arbiters). • The pointers of the arbiters move in straightforward way • It iterates the following steps until no more requests can be accepted (or for a given number of iterations): • Request. Each input sends a request to every output for which it has a queued cell. • Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input. If no request is received, the pointer stays unchanged. CSIT 560 by M. Hamdi 63

RRM Maximum Size Matching Algorithm: Performance and Properties • Accept. If an input receives a grant, it accepts the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The pointer ai to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the accepted output. If no grant is received, the pointer stays unchanged. CSIT 560 by M. Hamdi 64

RRM Maximal Matching Algorithm (1) Step 1: Request 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 65

RRM Maximal Matching Algorithm (2) Step 2: Grant 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 66

RRM Maximal Matching Algorithm (2) Step 2: Grant 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 67

RRM Maximal Matching Algorithm (2) Step 2: Grant 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 68

RRM Maximal Matching Algorithm (2) Step 2: Grant 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 69

RRM Maximal Matching Algorithm (3) Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 70

RRM Maximal Matching Algorithm (3) Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 71

RRM Maximal Matching Algorithm (3) Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 72

Poor performance of RRM Maximal Matching Algorithm 0 1 0 0 0 1. . 0 1 1 0 1 50% Throughput CSIT 560 by M. Hamdi 1 0 1. . 73

i. SLIP Maximum Size Matching Algorithm: Performance and Properties • It is a scheduler used in most VOQ switches (e. g. , Cisco). • It is exactly like RRM algorithm with the following change: • Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input if and only if the grant is accepted in (Accept phase). CSIT 560 by M. Hamdi 74

i. SLIP Maximum Size Matching Algorithm i. Slip: 1 st Iteration Original pointer Selected one Updated pointer Step 1: Request 1 2 1 1 2 2 3 3 4 4 4 3 Step 3: Accept 1 1 2 2 3 3 4 4 CSIT 560 by M. Hamdi 4 3 1 2 Step 2: Grant 4 3 1 2 75

i. SLIP Maximum Size Matching Algorithm i. Slip: 2 nd Iteration Original pointer Selected one Updated pointer Step 1: Request 1 2 1 1 2 2 3 3 4 4 4 3 Step 3: Accept 1 1 2 2 3 3 4 4 CSIT 560 by M. Hamdi 4 3 1 2 Step 2: Grant 4 3 1 2 No change 76

Simple Iterative Algorithms: i. Slip Step 1: Request 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 77

Simple Iterative Algorithms: i. Slip Step 2: Grant 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 78

Simple Iterative Algorithms: i. Slip Step 2: Grant 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 79

Simple Iterative Algorithms: i. Slip Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 80

Simple Iterative Algorithms: i. Slip Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 81

Simple Iterative Algorithms: i. Slip Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 82

Simple Iterative Algorithms: i. Slip Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 83

Simple Iterative Algorithms: i. Slip Step 3: Accept 0 1 3 2 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 84

i. SLIP Implementation Programmable Priority Encoder N N 1 Grant 1 Accept log 2 N 2 2 log 2 N Grant Accept State Decision N N Grant N Accept CSIT 560 by M. Hamdi log 2 N 85

Hardware Design Layout of the 256 bits Priority Encoder CSIT 560 by M. Hamdi 86

Hardware Design Layout of 256 bits grant arbiter CSIT 560 by M. Hamdi 87

FIRM Maximum Size Matching Algorithm: Performance and Properties • It is exactly like i. SLIP with a very small – yet significant modification. • Grant (outputs): If an unmatched output receives a request, it grants the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request is granted. The pointer to the highest priority element of the round-robin schedule is incremented beyond the granted input. If input does not accept the pointer is set at the granted one. CSIT 560 by M. Hamdi 88

Simple Iterative Algorithms: FIRM Step 3: Accept 0 0 1 1 2 2 3 3 CSIT 560 by M. Hamdi 3 2 0 1 89

Pointer Synchronization • Why this is good: this small change prevents the output arbiters from moving in lock-step (being synchronized – pointing to the same input) leading to a dramatic improvement in performance. • If several outputs grant the same input, no matter how this input chooses, only one match can be made, and the other outputs will be idle. • To get as many matches as possible, it's better that each output grants a different input. • Since each output will select the highest priority input if a request is received from this input, it's better to keep the output pointers desynchronized (pointing to different locations). CSIT 560 by M. Hamdi 90

i. SLIP Maximal Matching Algorithm 0 1 0 0 0 1. . 0 1 100% Throughput CSIT 560 by M. Hamdi 0 1 0. . 91

Pointer Synchronization: Differences between RRM, i. Slip & FIRM CSIT 560 by M. Hamdi 92

Differences between RRM, i. Slip & FIRM RRM Input Output i. Slip FIRM No grant unchanged Granted one location beyond the accepted one No request unchanged Grant accepted one location beyond the granted one Grant not accepted one location beyond the previously granted one unchanged CSIT 560 by M. Hamdi the granted one 93

General remarks • Since all of these algorithms try to approximate maximum size matching, they can be unstable under non-uniform traffic • They can achieve 100% throughput under uniform traffic • Under a large number of iterations, their performance is similar • They have similar implementation complexity CSIT 560 by M. Hamdi 94

Input Queueing Longest Queue First or Oldest Cell First Weight 1 2 3 4 1 1 1 ={ Queue Length Waiting Time 1 10 10 1 2 3 4 } Maximum weight CSIT 560 by M. Hamdi 100% 1 2 3 4 95

Input Queueing Why is serving long/old queues better than serving maximum number of queues? • When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. • When traffic is non-uniform, some queues become longer than others. Non-uniform traffic Uniform traffic VOQ # Avg Occupancy • A good algorithm keeps the queue lengths matched, and services a large number of queues. CSIT 560 by M. Hamdi VOQ # 96

Maximum/Maximal Weight Matching • 100% throughput for admissible traffic (uniform or nonuniform) • Maximum Weight Matching – OCF (Oldest Cell First): w=cell waiting time – LQF (Longest Queue First): w=input queue occupancy – LPF (Longest Port First): w=QL of the source port + Sum of QL form the source port to the destination port • Maximal Weight Matching (practical algorithms) – i. OCF – i. LQF – i. LPF (comparators in the critical path of i. LQF are removed ) CSIT 560 by M. Hamdi 97

Maximal Weight Matching Algorithms: i. LQF • Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output. • Grant. If an unmatched output receives any requests, it chooses the largest valued request. Ties are broken randomly. • Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request. Ties are broken randomly. CSIT 560 by M. Hamdi 98

Maximal Weight Matching Algotithms: i. LQF • The i-LQF algorithm has the following properties: • Property 1. Independent of the number of iterations, the longest input queue is always served. • Property 2. As with i-SLIP, the algorithm converges in at most log. N iterations. • Property 3. For an inadmissible offered load, an input queue may be starved. CSIT 560 by M. Hamdi 99

Maximal Weight Matching Algotithms: i. OCF • The i-OCF algorithm works in similar fashion to i. LQF, and has the following properties: • Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue) • Property 2. As with i-LQF, the algorithm converges in at most log. N iterations. • Property 3. No input queue can be starved indefinitely. • Property 4. It is difficult to keep time stamps on the cells. CSIT 560 by M. Hamdi 100

i. LQF - Implementation CSIT 560 by M. Hamdi 101

i. LQF - Implementation Complicated hardware CSIT 560 by M. Hamdi 102

Other research efforts • Packet-based arbitration • Exhaustive-based arbitration • Numerous other efforts CSIT 560 by M. Hamdi 103

Packet Scheduling/Arbitration in Virtual Output Queues: Randomized Algorithms and Others CSIT 560 by M. Hamdi 104

Input-Queued Packet Switch Scheduler ( i i, j < 1 ; j i, j < 1) 1, 1 i, j N, N Crossbar inputs Xi, j 1. . N 1 CSIT 560 by M. Hamdi outputs. . N 105

Bipartite Graph and Matrix 1 inputs 2 3 1 0 0 1 1 1 0 outputs CSIT 560 by M. Hamdi 1 2 3 106

Stability of Scheduling Definition: Let Xi, j(t) be the number of packets queued at input i for output j at time-slot t. Then an algorithm is stable iff: CSIT 560 by M. Hamdi 107

Motivation • Networking problems suffer from the “curse of dimensionality” – algorithmic solutions do not scale well • Typical causes – size: large number of users or large number of I/O – time: very high speeds of operation • A good deterministic algorithm exists (Max Flow), but … – it needs state information, and “state” is too big – it “starts from scratch” in each iteration CSIT 560 by M. Hamdi 108

Randomization • Randomized algorithms have frequently been used in many situations where the state space (e. g. , different number of connections between input and output N!) is very large • Randomized algorithms – are a powerful way of approximating the optimal solution – it is often possible to randomize deterministic algorithms – this simplifies the implementation while retaining a (surprisingly) high level of performance • The main idea is – to simplify the decision-making process – by basing decisions upon a small, randomly chosen sample of the state – rather than upon the complete state CSIT 560 by M. Hamdi 109

Randomizing Iterative Schemes (e. g. , i. SLIP) • Often, we want to perform some operation iteratively • Example: find the heaviest matching in a switch in every time slot • Since, in each time slot – at most one packet can arrive at each input – and, at most one packet can depart from each output Ø the size of the queues, or the “state” of the switch, doesn’t change by much between successive time slots Ø so, a matching that was heavy at time t will quite likely continue to be heavy at time t+1 • This suggests that – knowing a heavy matching at time t should help in determining a heavy matching at time t+1 Ø there is no need to start from scratch in each time slot CSIT 560 by M. Hamdi 110

Summarizing Randomized Algorithms • Randomized algorithms can help simplify the implementation – by reducing the amount of work in each iteration • If the state of the system doesn’t change by much between iterations, then – we can reduce the work even further by carrying information between iterations • The big pay-off is Ø that, even though it is an approximation, the performance of a randomized scheme can be surprisingly good CSIT 560 by M. Hamdi 111

Randomized Scheduling Algorithms: Example • Consider a 3 x 3 input-queued switch – input traffic: is Bernoulli IID and λij = α/3 for all i, j, and α < 1 – This is admissible – note: there a total of 6 (= 3!) possible service matrices CSIT 560 by M. Hamdi 112

Random Scheduling Algorithms • • In time slot n, let S(n) be equal to one of the 6 possible matchings independently and uniformly at random Stability of Random – Consider L 11(n), the number of packets in VOQ 11 • • arrivals to VOQ 11 occur according to A 11(n), which is Bernoulli IID input rate = λ 11 = α/3 this queue gets served whenever the service matrix connects input 1 to output 1 There are 2 service matrices that connect input 1 to output 1 since Random chooses service matrices u. a. r. , input 1 is connected to output 1 1. for a fraction of time = 2/6 = 1/3 --- the service rate between input 1 and output 1 E(L 11(n)) < iff λ 11 < 1/3 α < 1 This random algorithm is stable. CSIT 560 by M. Hamdi 113

Random Scheduling Algorithms • Instability of Random • Now suppose λii = α for all i and λij =0 for – clearly, this is admissible traffic for all α < 1 – but, under Random, the service rate at VOQ 11 is 1/3 at best – hence VOQ 11 and the switch will be unstable as soon as • Stability (or 100% throughput) means it is stable under all admissible traffic! CSIT 560 by M. Hamdi 114

Obvious Randomized Schemes • Choose a matching at random and use it as the schedule Ø doesn’t give 100% throughput (already shown) • Choose 2 matchings at random and use the heavier one as the schedule • Choose N matchings at random and use the heaviest one as the schedule Ø None of these can give 100% throughput !! CSIT 560 by M. Hamdi 115

CSIT 560 by M. Hamdi 116

Iterative Randomized Scheme (Tassiulas) • Say M is the matching used at time t • Let R be a new matching chosen uniformly at random (u. a. r. ) among the N! different matchings • At time t+1, use the heavier of M and R • Complexity is very low O(1) iterations • This gives 100% throughput ! Ø note the boost in throughput is due to memory (saving previous matchings) • But, delays are very large CSIT 560 by M. Hamdi 117

CSIT 560 by M. Hamdi 118

Finer Observations • Let M be schedule used at time t • Choose a “good’’ random matching R • M’ = Merge(M, R) • M’ includes best edges from M and R • Use M’ as schedule at time t+1 • Above procedure yields algorithm called LAURA • There are many other small variations to this algorithm. CSIT 560 by M. Hamdi 119

Merging Procedure 3 1 2 2 3 3 2 2 4 1 Merging X W(X)=12 3 3 -1+2 -2=2 2 R W(R)=10 3 3 2 -1+2 -4=-1 1 M CSIT 560 by M. Hamdi W(M)=13 120

CSIT 560 by M. Hamdi 121

Can we avoid having schedulers altogether !!! CSIT 560 by M. Hamdi 122

Recap: Two Successive Scaling Problems OQ routers: + work-conserving (Qo. S) - memory bandwidth = (N+1)R IQ routers: + memory bandwidth = 2 R - arbitration complexity R R R Bipartite Matching CSIT 560 by M. Hamdi 123

IQ Arbitration Complexity Today: 64 ports at 10 Gbps, 64 -byte cells. • • Arbitration Time = 64 bytes = 51. 2 ns 10 Gbps Request/Grant Communication BW = 17. 5 Gbps Scaling to 160 Gbps: • • Arbitration Time = 3. 2 ns Request/Grant Communication BW = 280 Gbps Two main alternatives for scaling: 1. Increase cell size 2. Eliminate arbitration CSIT 560 by M. Hamdi 124

Desirable Characteristics for Router Architecture Ideal: OQ • 100% throughput • Minimum delay • Maintains packet order Necessary: able to regularly connect any input to any output What if the world was perfect? Assume Bernoulli iid uniform arrival traffic. . . CSIT 560 by M. Hamdi 125

Round-Robin Scheduling • Uniform & non-bursty traffic => 100% throughput • Problem: traffic is non-uniform & bursty CSIT 560 by M. Hamdi 126

Two-Stage Switch (I) External Inputs Internal Inputs External Outputs 1 1 1 N N N First Round-Robin Second Round-Robin CSIT 560 by M. Hamdi 127

Two-Stage Switch (I) External Inputs Internal Inputs 1 N Load Balancing First Round-Robin External Outputs 1 1 N N Second Round-Robin CSIT 560 by M. Hamdi 128

Two-Stage Switch Characteristics External Inputs 1 Internal Inputs 12 External Outputs 1 1 1 2 N N N Cyclic Shift • 100% throughput • Problem: unbounded mis-sequencing CSIT 560 by M. Hamdi 129

Two-Stage Switch (II) New N 3 instead of N 2 CSIT 560 by M. Hamdi 130

Expanding VOQ Structure Solution: expand VOQ structure by distinguishing among switch inputs a 1 2 b 3 CSIT 560 by M. Hamdi 131

What is being done in practice (Cisco for example) • They want schedulers that achieve 100% throughput and very low delay (Like MWM) • They want it to be as simple as i. SLIP in terms of hardware implementation • Is there any solution to this !!!!! CSIT 560 by M. Hamdi 132

Typical Performance of ISLIP-like Algorithms PIM with 4 iterations CSIT 560 by M. Hamdi 133

What is being done in practice (Cisco for example) Company Switching Capacity Switch Architecture Fabric Overspeed Agere 40 Gbit/s-2. 5 Tbit/s Arbitrated crossbar 2 x AMCC 20 -160 Gbit/s Shared memory 1. 0 x AMCC 40 Gbit/s-1. 2 Tbit/s Arbitrated crossbar 1 -2 x Broadcom 40 -640 Gbit/s Buffered crossbar 1 -4 x Cisco 40 -320 Gbit/s Arbitrated crossbar 2 x CSIT 560 by M. Hamdi 134