High Performance Switches and Routers Theory and Practice

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification:

Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet

Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling

Where high performance packet switches are used Carrier Class Core Router ATM Switch Frame

ATM Switch • • Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with

Ethernet Switch • Lookup frame DA in forwarding table. – If known, forward to

IP Router • Lookup packet DA in forwarding table. – If known, forward to

First Generation IP Routers Shared Backplane Buffer Memory CPU CP L U I ine

Second Generation IP Routers Buffer Memory CPU DMA DMA Line Card Local Buffer Memory

Third Generation Switches/Routers Switched Backplane Li L i Li. In nene L I Li.

Fourth Generation Switches/Routers Clustering and Multistage 1 2 3 4 5 6 13 14

Packet Switches References • J. Giacopelli, M. Littlewood, W. D. Sincoskie “Sunshine: A high

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet

ATM and MPLS Switches Direct Lookup Memory Data Copyright 1999. All Rights Reserved Address

Bridges and Ethernet Switches Associative Lookups Advantages: Associative Memory or CAM Search Data Network

Bridges and Ethernet Switches Hashing 16 Memory Data 48 Hashing Function Address Search Data

Lookups Using Hashing An example Memory #1 Search Data 48 #2 #3 #4 Associated

Lookups Using Hashing Performance of simple example Copyright 1999. All Rights Reserved 25

Lookups Using Hashing Advantages: • Simple • Expected lookup time can be small Disadvantages

Trees and Tries Binary Search Tree < > > < N entries Copyright 1999.

Trees and Tries Multiway tries 16 ary Search Trie 0000, ptr 0000, 0 1111,

Trees and Tries Multiway tries Table produced from 215 randomly generated 48 -bit addresses

Caching Addresses Slow Path Buffer Memory CPU Fast Path DMA DMA Line Card Local

Caching Addresses LAN: WAN: Average flow < 40 packets Huge Number of flows Cache

IP Routers Class-based addresses IP Address Space Class A Class B Class A 212.

IP Routers CIDR Class based: A B C D 232 1 0 Classless: 128.

IP Routers CIDR 128. 9. 19/24 128. 9. 25/24 128. 9. 16/20 128. 9.

IP Routers Metrics for Lookups 128. 9. 16. 14 Prefix Port 65/8 128. 9/16

IP Router Lookup H E A D E R Dstn Addr Forwarding Engine Next

Need more than IPv 4 unicast lookups • Multicast • PIM SM – Longest

Lookup Performance Required Line Rate Pkt size=40 B Pkt size=240 B T 1 1.

Size of the Routing Table Source: http: //www. telstra. net/ops/bgptable. html Copyright 1999. All

Ternary CAMs Associative Memory Value 10. 0 10. 1. 1. 0 10. 1. 3.

Binary Tries 0 d 1 f e a b g i h c Copyright

Patricia Tree 0 f d a b e c Copyright 1999. All Rights Reserved

Patricia Tree Disadvantages • Many memory accesses • May need backtracking • Pointers take

Binary search on trie levels Level 0 Level 8 Level 29 Copyright 1999. All

Binary search on trie levels Store a hash table for each prefix length to

Binary search on trie levels Disadvantages • Multiple hashed memory accesses. • Updates are

Compacting Forwarding Tables 1 0 0 0 1 0 Copyright 1999. All Rights Reserved

Compacting Forwarding Tables 10001010 11100010 10000010 10110100 R 1, 0 0 R 2, 3

Compacting Forwarding Tables Disadvantages • Scalability to larger tables? • Updates are complex. Advantages

Multi bit Tries 16 ary Search Trie 0000, ptr 0000, 0 1111, ptr 000011110000

Compressed Tries Only 3 memory accesses L 8 L 16 L 24 Copyright 1999.

Number Routing Lookups in Hardware Prefix length Most prefixes are 24 -bits or shorter

Routing Lookups in Hardware Prefixes up to 24 -bits 224 = 16 M entries

Routing Lookups in Hardware Prefixes up to 24 -bits 128. 3. 72 0 Next

Routing Lookups in Hardware Prefixes up to n-bits 2 n entries: 0 i N

Routing Lookups in Hardware Disadvantages • Large memory required (9 33 MB) • Depends

IP Router Lookups References • A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small

Providing Value Added Services Some examples • Differentiated services – Regard traffic from Autonomous

Packet Classification H E A D E R Incoming Packet Copyright 1999. All Rights

Multi field Packet Classification Given a classifier with N rules, find the action associated

Geometric Interpretation in 2 D Field #1 Field #2 R 7 R 6 P

Proposed Schemes Copyright 1999. All Rights Reserved 64

Proposed Schemes (Contd. ) Copyright 1999. All Rights Reserved 65

Proposed Schemes (Contd. ) Copyright 1999. All Rights Reserved 66

Grid of Tries 0 Dimension 1 1 0 0 0 1 R 1 0

Grid of Tries Disadvantages • Static solution • Not easy to extend to higher

Classification using Bit Parallelism 0 1 1 1 0 R 4 R 3 R

Classification using Bit Parallelism Disadvantages • Large memory bandwidth • Hardware optimized Advantages •

Classification Using Multiple Fields Recursive Flow Classification 2 S = 2128 2 T =

Packet Classification References • T. V. Lakshman. D. Stiliadis. “High speed policy based packet

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing –

Interconnects Two basic techniques Input Queueing Usually a non-blocking switch fabric (e. g. crossbar)

Interconnects Output Queueing Individual Output Queues Centralized Shared Memory b/w = 2 N. R

Output Queueing The “ideal” 2 1 1 2 1 2 11 2 2 1

Output Queueing How fast can we make centralized shared memory? 5 ns SRAM Shared

Interconnects Input Queueing with Crossbar Memory b/w = 2 R Data In Scheduler configuration

Input Queueing Delay Head of Line Blocking Load Copyright 1999. All Rights Reserved 58.

Head of Line Blocking Copyright 1999. All Rights Reserved 83

Input Queueing Virtual output queues Copyright 1999. All Rights Reserved 86

Input Queues Delay Virtual Output Queues Load Copyright 1999. All Rights Reserved 100% 87

Input Queueing Memory b/w = 2 R Scheduler Copyright 1999. All Rights Reserved Can

Input Queueing Scheduling Copyright 1999. All Rights Reserved 89

Input Queueing 1 2 3 4 7 2 4 2 5 2 Request Graph

Input Queueing Scheduling • Maximum Size – Maximizes instantaneous throughput – Does it maximize

Input Queueing Scheduling Copyright 1999. All Rights Reserved 1 1 2 2 92

Input Queueing Longest Queue First or Oldest Cell First Weight 1 2 3 4

Input Queueing Why is serving long/old queues better than serving maximum number of queues?

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) –

Wave Front Arbiter Requests Match 1 1 2 2 3 3 4 4 Copyright

Wave Front Arbiter Requests Copyright 1999. All Rights Reserved Match 97

Wave Front Arbiter Implementation Copyright 1999. All Rights Reserved 1, 1 1, 2 1,

Wave Front Arbiter Wrapped WFA (WWFA) N steps instead of 2 N 1 Requests

Parallel Random Iterative Matching Random Selection #1 1 2 3 1 2 3 4

Parallel Iterative Matching Maximal is not Maximum 1 2 3 4 4 Requests Copyright

Parallel Iterative Matching Analytical Results Number of iterations to converge: Copyright 1999. All Rights

Parallel Iterative Matching Copyright 1999. All Rights Reserved 104

Parallel Iterative Matching Copyright 1999. All Rights Reserved 105

Parallel Iterative Matching Copyright 1999. All Rights Reserved 106

i. SLIP Round Robin Selection #1 1 2 3 1 2 3 4 4

i. SLIP Properties • • • Random under low load TDM under high load

i. SLIP Copyright 1999. All Rights Reserved 110

i. SLIP Copyright 1999. All Rights Reserved 111

i. SLIP Programmable Priority Encoder N N Implementation 1 Grant 1 Accept log 2

Input Queueing References • M. Karol et al. “Input vs Output Queueing on a

Other Non Blocking Fabrics Clos Network Copyright 1999. All Rights Reserved 115

Other Non Blocking Fabrics Clos Network Expansion factor required = 2 1/N (but still

Other Non Blocking Fabrics Self-Routing Networks 000 001 010 011 100 101 110 111

Other Non Blocking Fabrics Self-Routing Networks The Non blocking Batcher Banyan Network Batcher Sorter

Speedup • Context – input queued switches – output queued switches – the speedup

Speedup: Context M e m o r y A generic switch The placement of

Output queued switches Best delay and throughput performance Possible to erect “bandwidth firewalls” between

Input queued switches Big advantage Speedup of one is sufficient Main problem Can’t guarantee

A Comparison Memory speeds for 32 x 32 switch Output queued Input queued Line

The Speedup Problem Find a compromise: 1 < Speedup << N to get the

Some Early Approaches Probabilistic Analyses assume traffic models (Bernoulli, Markov modulated, non uniform loading,

The findings Very tantalizing. . . under different settings (traffic, loading, algorithm, etc) and

Using Speedup 1 2 1 Copyright 1999. All Rights Reserved 128

Intuition Bernoulli IID inputs Speedup = 1 Fabric throughput =. 58 Bernoulli IID inputs

Intuition (continued) Bernoulli IID inputs Speedup = 3 Fabric throughput = 1. 74 Input

Issues Need hard guarantees exact, not average Robustness realistic, even adversarial, traffic not friendly

The Ideal Solution Inputs Speedup = N Outputs ? Speedup << N Question: Can

What is exact mimicking? Apply same inputs to an OQ and a CIOQ switch

Algorithm MUCF Key concept: urgency value urgency = departure time present time Copyright 1999.

MUCF The algorithm Outputs try to get their most urgent packets Inputs grant to

Stable Marriage Problem Men = Outputs Bill John Pedro Women = Inputs Hillary Copyright

An example Observation: Only two reasons a packet doesn’t get to its output Input

What does this get us? Speedup of 4 is sufficient for exact emulation of

Other results To exactly emulate an Nx. N OQ switch Speedup of 2 1/N

What gives? Complexity of the algorithms Extra hardware for processing Extra run time (time

Implementation a closer look Main sources of difficulty Estimating urgency, etc info is distributed

Implementation (contd) Matching process A variant of the stable marriage problem Worst case number

Other Work Relax stringent requirement of exact emulation Least Occupied O/p First Algorithm (LOOFA)

References for speedup Y. Oie et al, “Effect of speedup in nonblocking packet switch’’,

Multicast Switching • The problem • Switching with crossbar fabrics • Switching with other

Multicasting 2 1 Copyright 1999. All Rights Reserved 3 5 4 6 147

Crossbar fabrics: Method 1 Copy network + unicast switching Copy networks Increased hardware, increased

Method 2 Use copying properties of crossbar fabric No fanout splitting: Easy, but low

The effect of fanout splitting Performance of an 8 x 8 switch with and

Placement of residue Key question: How should outputs grant requests? (and hence decide placement

Residue and throughput Result: Concentrating residue brings more new work forward. Hence leads to

Multicasting and Tetris Input ports 1 2 3 4 5 Residue 1 2 3

Multicasting and Tetris Input ports 1 2 3 4 5 Residue Concentrated 1 2

Replication by recycling Main idea: Make two copies at a time using a binary

Replication by recycling (cont’d) Receive Reseq Transmit Output Table Network Recycle Scaleable to large

References for Multicasting • J. Hayes et al. “Performance analysis of a multicast switch”,

Output Scheduling • What is output scheduling? • How is it done? • Practical

Output Scheduling Allocating output bandwidth Controlling packet delay scheduler Copyright 1999. All Rights Reserved

Output Scheduling FIFO Fair Queueing Copyright 1999. All Rights Reserved 161

Motivation • FIFO is natural but gives poor Qo. S – bursty flows increase

Fair queueing: Main issues • Level of granularity – packet by packet? (favors long

How does WFQ work? WR = 1 WG = 5 WP = 2 Copyright

Delay guarantees • Theorem If flows are leaky bucket constrained and all nodes employ

Practical considerations • For every packet, the scheduler needs to – classify it into

Deficit Round Robin 700 50 250 400 200 600 500 250 750 500 100

But. . . • WFQ is still very hard to implement – classification is

Strict Priorities and Diff Serv • Classify flows into priority classes – maintain only

Diff Serv • A framework for providing differentiated Qo. S – set Type of

References for O/p Scheduling A. Demers et al, “Analysis and simulation of a fair

Active Queue Management • Problems with traditional queue management – tail drop • Active

Tail Drop Queue Management Lock-Out Max Queue Length Copyright 1999. All Rights Reserved 173

Tail Drop Queue Management • Drop packets only when queue is full – long

Global Synchronization Max Queue Length Copyright 1999. All Rights Reserved 175

Bias Against Bursty Traffic Max Queue Length Copyright 1999. All Rights Reserved 176

Alternative Queue Management Schemes • Drop from front on full queue • Drop at

Active Queue Management Goals • Solve lock out and full queue problems – no

Random Early Detection (RED) Pk maxth l l l P 2 qavg P 1

Effectiveness of RED: Lock Out • Packets are randomly dropped • Each flow has

Effectiveness of RED: Full Queue • Drop packets probabilistically in anticipation of congestion (not

What Qo. S does RED Provide? • Lower buffer delay: good interactive service –

Unresponsive or aggressive flows • Don’t properly back off during congestion • Take away

Control Unresponsive Flows • Some active queue management schemes – RED with penalty box

Active Queue Management References • B. Braden et al. “Recommendations on queue management and

Slides: 189

Download presentation

High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick Mc. Keown Balaji Prabhakar Departments of Electrical Engineering and Computer Science nickm@stanford. edu Copyright 1999. All Rights Reserved balaji@isl. stanford. edu

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 2

Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights Reserved Reservation Output Scheduling Control Datapath: per packet processing 4

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 5

Where high performance packet switches are used Carrier Class Core Router ATM Switch Frame Relay Switch The Internet Core Edge Router Copyright 1999. All Rights Reserved Enterprise WAN access & Enterprise Campus Switch 6

Ethernet Switch • Lookup frame DA in forwarding table. – If known, forward to correct port. – If unknown, broadcast to all ports. • Learn SA of incoming frame. • Forward frame to outgoing interface. • Transmit frame onto link. Copyright 1999. All Rights Reserved 9

IP Router • Lookup packet DA in forwarding table. – If known, forward to correct port. – If unknown, drop packet. • Decrement TTL, update header Cksum. • Forward packet to outgoing interface. • Transmit packet onto link. Copyright 1999. All Rights Reserved 10

Third Generation Switches/Routers Switched Backplane Li L i Li. In nene L I Li. Ininnetneeterf rfa ace L I CPI Initnnetneeterf rfacacece n. Ut er rfa ac e er fa ce e fa ce M ce em or y Copyright 1999. All Rights Reserved Line Card CPU Card Line Card Local Buffer Memory MAC 14

Fourth Generation Switches/Routers Clustering and Multistage 1 2 3 4 5 6 13 14 15 16 17 18 25 26 27 28 29 30 7 8 9 10 11 12 19 20 21 22 23 24 31 32 21 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 1819 20 21 22 23 2425 26 27 28 29 30 31 32 Copyright 1999. All Rights Reserved 15

Packet Switches References • J. Giacopelli, M. Littlewood, W. D. Sincoskie “Sunshine: A high performance self routing broadband packet switch architecture”, ISS ‘ 90. • J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June 1988, pp. 734 743. • C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE Trans Networking, 1998. • N. Mc. Keown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan Feb 1997. Copyright 1999. All Rights Reserved 16

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – Caching – CIDR – Patricia trees/tries – Other methods • Packet Classification Copyright 1999. All Rights Reserved 19

Bridges and Ethernet Switches Associative Lookups Advantages: Associative Memory or CAM Search Data Network Associated Address Data 48 • Simple Associated Data { Hit? Address log 2 N Copyright 1999. All Rights Reserved Disadvantages • Slow • High Power • Small • Expensive 22

IP Routers Metrics for Lookups 128. 9. 16. 14 Prefix Port 65/8 128. 9/16 128. 9. 16/20 128. 9. 19/24 128. 9. 25/24 128. 9. 176/20 142. 12/19 3 5 2 7 10 1 3 Copyright 1999. All Rights Reserved • Lookup time • Storage space • Update time • Preprocessing time 36

IP Router Lookup H E A D E R Dstn Addr Forwarding Engine Next Hop Computation Next Hop Forwarding Table Destination Next Hop Incoming Packet IPv 4 unicast destination address based lookup Copyright 1999. All Rights Reserved 37

Need more than IPv 4 unicast lookups • Multicast • PIM SM – Longest Prefix Matching on the source and group address – Try (S, G) followed by (*, *, RP) – Check Incoming Interface • DVMRP: – Incoming Interface Check followed by (S, G) lookup • IPv 6 • 128 bit destination address field • Exact address architecture not yet known Copyright 1999. All Rights Reserved 38

Lookup Performance Required Line Rate Pkt size=40 B Pkt size=240 B T 1 1. 5 Mbps 4. 68 Kpps 0. 78 Kpps OC 3 155 Mbps 480 Kpps OC 12 622 Mbps 1. 94 Mpps 323 Kpps OC 48 2. 5 Gbps 7. 81 Mpps 1. 3 Mpps 31. 25 Mpps 5. 21 Mpps OC 192 10 Gbps Gigabit Ethernet (84 B packets): 1. 49 Mpps Copyright 1999. All Rights Reserved 39

Patricia Tree Disadvantages • Many memory accesses • May need backtracking • Pointers take up a lot of space Advantages • General Solution • Extensible to wider fields Avoid backtracking by storing the intermediate best matched prefix. (Dynamic Prefix Tries) 40 K entries: 2 MB data structure with 0. 3 0. 5 Mpps [O(W)] Copyright 1999. All Rights Reserved 44

Binary search on trie levels Store a hash table for each prefix length to aid search at a particular trie level. Length Hash 8 12 10 16 24 10. 1, 10. 2 10. 1. 1, 10. 1. 2, 10. 2. 3 Copyright 1999. All Rights Reserved Example Prefixes 10. 0/8 10. 1. 0. 0/16 10. 1. 1. 0/24 10. 1. 2. 0/24 10. 2. 3. 0/24 Example Addrs 10. 1. 1. 4 10. 4. 4. 3 10. 2. 3. 9 10. 2. 4. 8 46

Binary search on trie levels Disadvantages • Multiple hashed memory accesses. • Updates are complex. Advantages • Scaleable to IPv 6. 33 K entries: 1. 4 MB data structure with 1. 2 2. 2 Mpps [O(log W)] Copyright 1999. All Rights Reserved 47

Compacting Forwarding Tables Disadvantages • Scalability to larger tables? • Updates are complex. Advantages • Extremely small data structure can fit in cache. 33 K entries: 160 KB data structure with average 2 Mpps [O(W/k)] Copyright 1999. All Rights Reserved 50

Routing Lookups in Hardware Prefixes up to 24 -bits 128. 3. 72 0 Next Hop Pointer base 128. 3. 72 24 Next Hop Prefixes above 24 -bits Copyright 1999. All Rights Reserved 8 offset Next Hop Next 44 128. 3. 72. 44 1 55

Routing Lookups in Hardware Disadvantages • Large memory required (9 33 MB) • Depends on prefix length distribution. Advantages • 20 Mpps with 50 ns DRAM • Easy to implement in hardware Various compression schemes can be employed to decrease the storage requirements: e. g. employ carefully chosen variable length strides, bitmap compression etc. Copyright 1999. All Rights Reserved 57

IP Router Lookups References • A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3 14. • B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248 56, vol. 3. • M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25 36. • P. Gupta, S. Lin, N. Mc. Keown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241 1248, vol. 3. • S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1 3, 1998. • V. Srinivasan, G. Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998. Copyright 1999. All Rights Reserved 58

Providing Value Added Services Some examples • Differentiated services – Regard traffic from Autonomous System #33 as `platinum grade’ • Access Control Lists – Deny udp host 194. 72. 33 194. 72. 6. 64 0. 0. 0. 15 eq snmp • Committed Access Rate – Rate limit WWW traffic from sub interface#739 to 10 Mbps • Policy based Routing – Route all voice traffic through the ATM network Copyright 1999. All Rights Reserved 60

Grid of Tries Disadvantages • Static solution • Not easy to extend to higher dimensions Advantages • Good solution for two dimensions 20 K entries: 2 MB data structure with 9 memory accesses [at most 2 W] Copyright 1999. All Rights Reserved 68

Classification using Bit Parallelism Disadvantages • Large memory bandwidth • Hardware optimized Advantages • Good solution for multiple dimensions for small classifiers 512 rules: 1 Mpps with single FPGA and 5 128 KB SRAM chips. Copyright 1999. All Rights Reserved 70

Classification Using Multiple Fields Recursive Flow Classification 2 S = 2128 2 T = 212 Packet Header Memory F 1 Memory Action F 2 F 3 2 S = 2128 264 224 2 T = 212 F 4 Fn Copyright 1999. All Rights Reserved 71

Packet Classification References • T. V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi dimensional range matching”, Sigcomm 1998, pp 191 202. • V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203 214. • V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, to be presented at Sigcomm 1999. • P. Gupta, N. Mc. Keown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999. • P. Gupta, N. Mc. Keown, “Packet classification on multiple fields”, Sigcomm 1999. Copyright 1999. All Rights Reserved 72

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – Scheduling algorithms – Combining input and output queues – Other non blocking fabrics – Multicast traffic Copyright 1999. All Rights Reserved 74

Output Queueing How fast can we make centralized shared memory? 5 ns SRAM Shared Memory • 5 ns per memory operation • Two memory operations per packet • Therefore, up to 160 Gb/s • In practice, closer to 80 Gb/s 1 2 N 200 byte bus Copyright 1999. All Rights Reserved 79

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – Scheduling algorithms – Other non blocking fabrics – Combining input and output queues – Multicast traffic Copyright 1999. All Rights Reserved 80

Input Queueing Scheduling • Maximum Size – Maximizes instantaneous throughput – Does it maximize long term throughput? • Maximum Weight – Can clear most backlogged queues – But does it sacrifice long term throughput? Copyright 1999. All Rights Reserved 91

Input Queueing Why is serving long/old queues better than serving maximum number of queues? Non-uniform traffic Uniform traffic VOQ # Copyright 1999. All Rights Reserved Avg Occupancy • When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. • When traffic is non uniform, some queues become longer than others. • A good algorithm keeps the queue lengths matched, and services a large number of queues. VOQ # 94

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) – Parallel Iterative Matching (PIM) – i. SLIP • Maximal Weight Algorithms – Fair Access Round Robin (FARR) – Longest Port First (LPF) Copyright 1999. All Rights Reserved 95

i. SLIP Properties • • • Random under low load TDM under high load Lowest priority to MRU 1 iteration: fair to outputs Converges in at most N iterations. On average <= log 2 N • Implementation: N priority encoders • Up to 100% throughput for uniform traffic Copyright 1999. All Rights Reserved 109

Input Queueing References • M. Karol et al. “Input vs Output Queueing on a Space Division Packet Switch”, IEEE Trans Comm. , Dec 1987, pp. 1347 1356. • Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication switches”, IEEE Trans Parallel and Dist Sys. , Jan 1993, pp. 13 27. • T. Anderson et al. “High Speed Switch Scheduling for Local Area Networks”, ACM Trans Comp Sys. , Nov 1993, pp. 319 352. • N. Mc. Keown, “The i. SLIP scheduling algorithm for Input Queued Switches”, IEEE Trans Networking, April 1999, pp. 188 201. • C. Lund et al. “Fair prioritized scheduling in an input buffered switch”, Proc. of IFIP IEEE Conf. , April 1996, pp. 358 69. • A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve 100% Throughput in Input Queued Switches”, IEEE Infocom 98, April 1998. Copyright 1999. All Rights Reserved 113

Other Non Blocking Fabrics Self-Routing Networks The Non blocking Batcher Banyan Network Batcher Sorter Self-Routing Network 3 7 7 7 7 2 5 0 4 6 6 5 3 2 5 5 4 5 2 5 3 1 6 5 4 6 6 1 3 0 3 3 0 1 0 4 3 2 2 1 0 6 2 1 0 1 4 4 4 6 2 2 0 001 010 011 100 101 110 111 • Fabric can be used as scheduler. • Batcher-Banyan network is blocking for multicast. Copyright 1999. All Rights Reserved 118

Speedup: Context M e m o r y A generic switch The placement of memory gives Output queued switches Input queued switches Combined input and output queued switches Copyright 1999. All Rights Reserved 121

Output queued switches Best delay and throughput performance Possible to erect “bandwidth firewalls” between sessions Main problem Requires high fabric speedup (S = N) Unsuitable for high speed switching Copyright 1999. All Rights Reserved 122

Input queued switches Big advantage Speedup of one is sufficient Main problem Can’t guarantee delay due to input contention Overcoming input contention: use higher speedup Copyright 1999. All Rights Reserved 123

A Comparison Memory speeds for 32 x 32 switch Output queued Input queued Line Rate Memory BW Access Time Per cell Memory BW Access Time 100 Mb/s 3. 3 Gb/s 128 ns 200 Mb/s 2. 12 s 1 Gb/s 33 Gb/s 12. 8 ns 2 Gb/s 212 ns 2. 5 Gb/s 82. 5 Gb/s 5. 12 ns 5 Gb/s 84. 8 ns 10 Gb/s 330 Gb/s 1. 28 ns 20 Gb/s 21. 2 ns Copyright 1999. All Rights Reserved 124

The Speedup Problem Find a compromise: 1 < Speedup << N to get the performance of an OQ switch close to the cost of an IQ switch Essential for high speed Qo. S switching Copyright 1999. All Rights Reserved 125

Some Early Approaches Probabilistic Analyses assume traffic models (Bernoulli, Markov modulated, non uniform loading, “friendly correlated”) obtain mean throughput and delays, bounds on tails analyze different fabrics (crossbar, multistage, etc) Numerical Methods use actual and simulated traffic traces run different algorithms set the “speedup dial” at various values Copyright 1999. All Rights Reserved 126

The findings Very tantalizing. . . under different settings (traffic, loading, algorithm, etc) and even for varying switch sizes A speedup of between 2 and 5 was sufficient! Copyright 1999. All Rights Reserved 127

Intuition Bernoulli IID inputs Speedup = 1 Fabric throughput =. 58 Bernoulli IID inputs Speedup = 2 Fabric throughput = 1. 16 I/p efficiency, = 1/1. 16 Ave I/p queue = 6. 25 Copyright 1999. All Rights Reserved 129

Intuition (continued) Bernoulli IID inputs Speedup = 3 Fabric throughput = 1. 74 Input efficiency = 1/1. 74 Ave I/p queue = 1. 35 Bernoulli IID inputs Speedup = 4 Fabric throughput = 2. 32 Input efficiency = 1/2. 32 Ave I/p queue = 0. 75 Copyright 1999. All Rights Reserved 130

The Ideal Solution Inputs Speedup = N Outputs ? Speedup << N Question: Can we find a simple and good algorithms that exactly mimics output queueing regardless of switch sizes and traffic patterns? Copyright 1999. All Rights Reserved 132

MUCF The algorithm Outputs try to get their most urgent packets Inputs grant to output whose packet is most urgent, ties broken by port number Loser outputs for next most urgent packet Algorithm terminates when no more matchings are possible Copyright 1999. All Rights Reserved 135

Other results To exactly emulate an Nx. N OQ switch Speedup of 2 1/N is necessary and sufficient (Hence a speedup of 2 is sufficient for all N) Input traffic patterns can be absolutely arbitrary Emulated OQ switch may use a “monotone” scheduling policies E. g. : FIFO, LIFO, strict priority, WFQ, etc Copyright 1999. All Rights Reserved 139

What gives? Complexity of the algorithms Extra hardware for processing Extra run time (time complexity) What is the benefit? Reduced memory bandwidth requirements Tradeoff: Memory for processing Moore’s Law supports this tradeoff Copyright 1999. All Rights Reserved 140

Implementation a closer look Main sources of difficulty Estimating urgency, etc info is distributed (and communicating this info among I/ps and O/ps) Matching process too many iterations? Estimating urgency depends on what is being emulated Like taking a ticket to hold a place in a queue FIFO, Strict priorities no problem WFQ, etc problems Copyright 1999. All Rights Reserved 141

Implementation (contd) Matching process A variant of the stable marriage problem Worst case number of iterations for SMP = N 2 Worst case number of iterations in switching = N High probability and average approxly log(N) Copyright 1999. All Rights Reserved 142

Other Work Relax stringent requirement of exact emulation Least Occupied O/p First Algorithm (LOOFA) Keeps outputs always busy if there are packets By time stamping packets, it also exactly mimics Disallow arbitrary inputs E. g. leaky bucket constrained Obtain worst case delay bounds Copyright 1999. All Rights Reserved 143

References for speedup Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89. A. L Gupta, N. D. Georgana, “Analysis of a packet switch with input and output buffers and speed constraints”, Infocom 91. S T. Chuang et al, “Matching output queueing with a combined input and output queued switch”, IEEE JSAC, vol 17, no 6, 1999. B. Prabhakar, N. Mc. Keown, “On the speedup required for combined input and output queued switching”, Automatica, vol 35, 1999. P. Krishna et al, “On the speedup required for work conserving crossbar switches”, IEEE JSAC, vol 17, no 6, 1999. A. Charny, “Providing Qo. S guarantees in input buffered crossbar switches with speedup”, Ph. D Thesis, MIT, 1998. Copyright 1999. All Rights Reserved 144

Method 2 Use copying properties of crossbar fabric No fanout splitting: Easy, but low throughput Fanout splitting: higher throughput, but not as simple. Leaves “residue”. Copyright 1999. All Rights Reserved 149

Residue and throughput Result: Concentrating residue brings more new work forward. Hence leads to higher throughput. But, there are fairness problems to deal with. This and other problems can be looked at in a unified way by mapping the multicasting problem onto a variation of Tetris. Copyright 1999. All Rights Reserved 152

Replication by recycling Main idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves. b c y a d x e Copyright 1999. All Rights Reserved x b x a c y y e d 155

Replication by recycling (cont’d) Receive Reseq Transmit Output Table Network Recycle Scaleable to large fanouts. Needs resequencing at outputs and introduces variable delays. Copyright 1999. All Rights Reserved 156

References for Multicasting • J. Hayes et al. “Performance analysis of a multicast switch”, IEEE/ACM Trans. on Networking, vol 39, April 1991. • B. Prabhakar et al. “Tetris models for multicast switches”, Proc. of the 30 th Annual Conference on Information Sciences and Systems, 1996 • B. Prabhakar et al. “Multicast scheduling for input queued switches”, IEEE JSAC, 1997 • J. Turner, “An optimal nonblocking multicast virtual circuit switch”, INFOCOM, 1994 Copyright 1999. All Rights Reserved 157

Motivation • FIFO is natural but gives poor Qo. S – bursty flows increase delays for others – hence cannot guarantee delays Need round robin scheduling of packets – Fair Queueing – Weighted Fair Queueing, Generalized Processor Sharing Copyright 1999. All Rights Reserved 162

Fair queueing: Main issues • Level of granularity – packet by packet? (favors long packets) – bit by bit? (ideal, but very complicated) • Packet Generalized Processor Sharing (PGPS) – serves packet by packet – and imitates bit by bit schedule within a tolerance Copyright 1999. All Rights Reserved 163

Delay guarantees • Theorem If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst case delay bounds to sessions. Copyright 1999. All Rights Reserved 165

Practical considerations • For every packet, the scheduler needs to – classify it into the right flow queue and maintain a linked list for each flow – schedule it for departure • Complexities of both are o(log [# of flows]) – first is hard to overcome – second can be overcome by DRR Copyright 1999. All Rights Reserved 166

Strict Priorities and Diff Serv • Classify flows into priority classes – maintain only per class queues – perform FIFO within each class – avoid “curse of dimensionality” Copyright 1999. All Rights Reserved 169

Diff Serv • A framework for providing differentiated Qo. S – set Type of Service (To. S) bits in packet headers – this classifies packets into classes – routers maintain per class queues – condition traffic at network edges to conform to class requirements May still need queue management inside the network Copyright 1999. All Rights Reserved 170

References for O/p Scheduling A. Demers et al, “Analysis and simulation of a fair queueing algorithm”, ACM SIGCOMM 1989. A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the single node case”, IEEE Trans. on Networking, June 1993. A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the multiple node case”, IEEE Trans. on Networking, August 1993. M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round Robin”, ACM SIGCOMM, 1995. K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model and Definitions”, Internet Draft, 1998. Copyright 1999. All Rights Reserved 171

Alternative Queue Management Schemes • Drop from front on full queue • Drop at random on full queue F both solve the lock out problem F both have the full queues problem Copyright 1999. All Rights Reserved 177

Active Queue Management Goals • Solve lock out and full queue problems – no lock out behavior – no global synchronization – no bias against bursty flow • Provide better Qo. S at a router – low steady state delay – lower packet dropping Copyright 1999. All Rights Reserved 178

Random Early Detection (RED) Pk maxth l l l P 2 qavg P 1 minth if qavg < minth: admit every packet else if qavg <= maxth: drop an incoming packet with p = (qavg - minth)/(maxth - minth) else if qavg > maxth: drop every incoming packet Copyright 1999. All Rights Reserved 180

Effectiveness of RED: Full Queue • Drop packets probabilistically in anticipation of congestion (not when queue is full) • Use qavg to decide packet dropping probability: allow instantaneous bursts • Randomness avoids global synchronization Copyright 1999. All Rights Reserved 182

What Qo. S does RED Provide? • Lower buffer delay: good interactive service – qavg is controlled to be small • Given responsive flows: packet dropping is reduced – early congestion indication allows traffic to throttle back before congestion • Given responsive flows: fair bandwidth allocation Copyright 1999. All Rights Reserved 183

Control Unresponsive Flows • Some active queue management schemes – RED with penalty box – Flow RED (FRED) – Stabilized RED (SRED) identify and penalize unresponsive flows with a bit of extra work Copyright 1999. All Rights Reserved 185

Active Queue Management References • B. Braden et al. “Recommendations on queue management and congestion avoidance in the internet”, RFC 2309, 1998. • S. Floyd, V. Jacobson, “Random early detection gateways for congestion avoidance”, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993. • D. Lin, R. Morris, “Dynamics on random early detection”, ACM SIGCOMM, 1997 • T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999 • S. Floyd, K. Fall, “Router mechanisms to support end to end congestion control”, LBL technical report, 1997 Copyright 1999. All Rights Reserved 186