High Performance Switches and Routers Theory and Practice

  • Slides: 189
Download presentation
High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard

High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick Mc. Keown Balaji Prabhakar Departments of Electrical Engineering and Computer Science nickm@stanford. edu Copyright 1999. All Rights Reserved balaji@isl. stanford. edu

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification:

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 2

Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet

Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet Switches • The Evolution of IP Routers Copyright 1999. All Rights Reserved 3

Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights

Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights Reserved Reservation Output Scheduling Control Datapath: per packet processing 4

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 5

Where high performance packet switches are used Carrier Class Core Router ATM Switch Frame

Where high performance packet switches are used Carrier Class Core Router ATM Switch Frame Relay Switch The Internet Core Edge Router Copyright 1999. All Rights Reserved Enterprise WAN access & Enterprise Campus Switch 6

Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet

Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet Switches • The Evolution of IP Routers Copyright 1999. All Rights Reserved 7

ATM Switch • • Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with

ATM Switch • • Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link. Copyright 1999. All Rights Reserved 8

Ethernet Switch • Lookup frame DA in forwarding table. – If known, forward to

Ethernet Switch • Lookup frame DA in forwarding table. – If known, forward to correct port. – If unknown, broadcast to all ports. • Learn SA of incoming frame. • Forward frame to outgoing interface. • Transmit frame onto link. Copyright 1999. All Rights Reserved 9

IP Router • Lookup packet DA in forwarding table. – If known, forward to

IP Router • Lookup packet DA in forwarding table. – If known, forward to correct port. – If unknown, drop packet. • Decrement TTL, update header Cksum. • Forward packet to outgoing interface. • Transmit packet onto link. Copyright 1999. All Rights Reserved 10

Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet

Introduction What is a Packet Switch? • Basic Architectural Components • Some Example Packet Switches • The Evolution of IP Routers Copyright 1999. All Rights Reserved 11

First Generation IP Routers Shared Backplane Buffer Memory CPU CP L U I ine

First Generation IP Routers Shared Backplane Buffer Memory CPU CP L U I ine nt er fa M ce em or y Copyright 1999. All Rights Reserved DMA DMA Line Interface MAC MAC 12

Second Generation IP Routers Buffer Memory CPU DMA DMA Line Card Local Buffer Memory

Second Generation IP Routers Buffer Memory CPU DMA DMA Line Card Local Buffer Memory MAC MAC Copyright 1999. All Rights Reserved 13

Third Generation Switches/Routers Switched Backplane Li L i Li. In nene L I Li.

Third Generation Switches/Routers Switched Backplane Li L i Li. In nene L I Li. Ininnetneeterf rfa ace L I CPI Initnnetneeterf rfacacece n. Ut er rfa ac e er fa ce e fa ce M ce em or y Copyright 1999. All Rights Reserved Line Card CPU Card Line Card Local Buffer Memory MAC 14

Fourth Generation Switches/Routers Clustering and Multistage 1 2 3 4 5 6 13 14

Fourth Generation Switches/Routers Clustering and Multistage 1 2 3 4 5 6 13 14 15 16 17 18 25 26 27 28 29 30 7 8 9 10 11 12 19 20 21 22 23 24 31 32 21 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 1819 20 21 22 23 2425 26 27 28 29 30 31 32 Copyright 1999. All Rights Reserved 15

Packet Switches References • J. Giacopelli, M. Littlewood, W. D. Sincoskie “Sunshine: A high

Packet Switches References • J. Giacopelli, M. Littlewood, W. D. Sincoskie “Sunshine: A high performance self routing broadband packet switch architecture”, ISS ‘ 90. • J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June 1988, pp. 734 743. • C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE Trans Networking, 1998. • N. Mc. Keown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan Feb 1997. Copyright 1999. All Rights Reserved 16

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification:

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 17

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 18

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – Caching – CIDR – Patricia trees/tries – Other methods • Packet Classification Copyright 1999. All Rights Reserved 19

ATM and MPLS Switches Direct Lookup Memory Data Copyright 1999. All Rights Reserved Address

ATM and MPLS Switches Direct Lookup Memory Data Copyright 1999. All Rights Reserved Address VCI (Port, VCI) 20

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – Caching – CIDR – Patricia trees/tries – Other methods • Packet Classification Copyright 1999. All Rights Reserved 21

Bridges and Ethernet Switches Associative Lookups Advantages: Associative Memory or CAM Search Data Network

Bridges and Ethernet Switches Associative Lookups Advantages: Associative Memory or CAM Search Data Network Associated Address Data 48 • Simple Associated Data { Hit? Address log 2 N Copyright 1999. All Rights Reserved Disadvantages • Slow • High Power • Small • Expensive 22

Bridges and Ethernet Switches Hashing 16 Memory Data 48 Hashing Function Address Search Data

Bridges and Ethernet Switches Hashing 16 Memory Data 48 Hashing Function Address Search Data Associated Data { Hit? Address log 2 N Copyright 1999. All Rights Reserved 23

Lookups Using Hashing An example Memory #1 Search Data 48 #2 #3 #4 Associated

Lookups Using Hashing An example Memory #1 Search Data 48 #2 #3 #4 Associated Data Hashing Function CRC 16 Linked lists Copyright 1999. All Rights Reserved 16 #1 { #2 Hit? Address log 2 N #1 #2 #3 24

Lookups Using Hashing Performance of simple example Copyright 1999. All Rights Reserved 25

Lookups Using Hashing Performance of simple example Copyright 1999. All Rights Reserved 25

Lookups Using Hashing Advantages: • Simple • Expected lookup time can be small Disadvantages

Lookups Using Hashing Advantages: • Simple • Expected lookup time can be small Disadvantages • Non deterministic lookup time • Inefficient use of memory Copyright 1999. All Rights Reserved 26

Trees and Tries Binary Search Tree < > > < N entries Copyright 1999.

Trees and Tries Binary Search Tree < > > < N entries Copyright 1999. All Rights Reserved > log 2 N < Binary Search Trie 0 0 1 1 010 0 1 111 27

Trees and Tries Multiway tries 16 ary Search Trie 0000, ptr 0000, 0 1111,

Trees and Tries Multiway tries 16 ary Search Trie 0000, ptr 0000, 0 1111, ptr 000011110000 Copyright 1999. All Rights Reserved 1111, ptr 0000, 0 1111, ptr 111111 28

Trees and Tries Multiway tries Table produced from 215 randomly generated 48 -bit addresses

Trees and Tries Multiway tries Table produced from 215 randomly generated 48 -bit addresses Copyright 1999. All Rights Reserved 29

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – Caching – CIDR – Patricia trees/tries – Other methods • Packet Classification Copyright 1999. All Rights Reserved 30

Caching Addresses Slow Path Buffer Memory CPU Fast Path DMA DMA Line Card Local

Caching Addresses Slow Path Buffer Memory CPU Fast Path DMA DMA Line Card Local Buffer Memory MAC MAC Copyright 1999. All Rights Reserved 31

Caching Addresses LAN: WAN: Average flow < 40 packets Huge Number of flows Cache

Caching Addresses LAN: WAN: Average flow < 40 packets Huge Number of flows Cache Hit Rate Cache = 10% of Full Table Copyright 1999. All Rights Reserved 32

IP Routers Class-based addresses IP Address Space Class A Class B Class A 212.

IP Routers Class-based addresses IP Address Space Class A Class B Class A 212. 17. 9. 4 Class B Class C Copyright 1999. All Rights Reserved Class C D Routing Table: Exact Match 212. 17. 9. 0 Port 4 33

IP Routers CIDR Class based: A B C D 232 1 0 Classless: 128.

IP Routers CIDR Class based: A B C D 232 1 0 Classless: 128. 9. 0. 0 65/8 0 142. 12/19 128. 9/16 232 1 128. 9. 16. 14 Copyright 1999. All Rights Reserved 34

IP Routers CIDR 128. 9. 19/24 128. 9. 25/24 128. 9. 16/20 128. 9.

IP Routers CIDR 128. 9. 19/24 128. 9. 25/24 128. 9. 16/20 128. 9. 176/20 128. 9/16 232 1 0 128. 9. 16. 14 Most specific route = “longest matching prefix” Copyright 1999. All Rights Reserved 35

IP Routers Metrics for Lookups 128. 9. 16. 14 Prefix Port 65/8 128. 9/16

IP Routers Metrics for Lookups 128. 9. 16. 14 Prefix Port 65/8 128. 9/16 128. 9. 16/20 128. 9. 19/24 128. 9. 25/24 128. 9. 176/20 142. 12/19 3 5 2 7 10 1 3 Copyright 1999. All Rights Reserved • Lookup time • Storage space • Update time • Preprocessing time 36

IP Router Lookup H E A D E R Dstn Addr Forwarding Engine Next

IP Router Lookup H E A D E R Dstn Addr Forwarding Engine Next Hop Computation Next Hop Forwarding Table Destination Next Hop Incoming Packet IPv 4 unicast destination address based lookup Copyright 1999. All Rights Reserved 37

Need more than IPv 4 unicast lookups • Multicast • PIM SM – Longest

Need more than IPv 4 unicast lookups • Multicast • PIM SM – Longest Prefix Matching on the source and group address – Try (S, G) followed by (*, *, RP) – Check Incoming Interface • DVMRP: – Incoming Interface Check followed by (S, G) lookup • IPv 6 • 128 bit destination address field • Exact address architecture not yet known Copyright 1999. All Rights Reserved 38

Lookup Performance Required Line Rate Pkt size=40 B Pkt size=240 B T 1 1.

Lookup Performance Required Line Rate Pkt size=40 B Pkt size=240 B T 1 1. 5 Mbps 4. 68 Kpps 0. 78 Kpps OC 3 155 Mbps 480 Kpps OC 12 622 Mbps 1. 94 Mpps 323 Kpps OC 48 2. 5 Gbps 7. 81 Mpps 1. 3 Mpps 31. 25 Mpps 5. 21 Mpps OC 192 10 Gbps Gigabit Ethernet (84 B packets): 1. 49 Mpps Copyright 1999. All Rights Reserved 39

Size of the Routing Table Source: http: //www. telstra. net/ops/bgptable. html Copyright 1999. All

Size of the Routing Table Source: http: //www. telstra. net/ops/bgptable. html Copyright 1999. All Rights Reserved 40

Ternary CAMs Associative Memory Value 10. 0 10. 1. 1. 0 10. 1. 3.

Ternary CAMs Associative Memory Value 10. 0 10. 1. 1. 0 10. 1. 3. 1 Mask 255. 0. 0. 0 255 R 1 R 2 R 3 R 4 Next Hop Priority Encoder Copyright 1999. All Rights Reserved 41

Binary Tries 0 d 1 f e a b g i h c Copyright

Binary Tries 0 d 1 f e a b g i h c Copyright 1999. All Rights Reserved j Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 e) 0101 f) 011 g) 100 h) 1010 i) 1100 j) 11110000 42

Patricia Tree 0 f d a b e c Copyright 1999. All Rights Reserved

Patricia Tree 0 f d a b e c Copyright 1999. All Rights Reserved 1 g h i Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 Skip=5 e) 0101 f) 011 j g) 100 h) 1010 i) 1100 j) 11110000 43

Patricia Tree Disadvantages • Many memory accesses • May need backtracking • Pointers take

Patricia Tree Disadvantages • Many memory accesses • May need backtracking • Pointers take up a lot of space Advantages • General Solution • Extensible to wider fields Avoid backtracking by storing the intermediate best matched prefix. (Dynamic Prefix Tries) 40 K entries: 2 MB data structure with 0. 3 0. 5 Mpps [O(W)] Copyright 1999. All Rights Reserved 44

Binary search on trie levels Level 0 Level 8 Level 29 Copyright 1999. All

Binary search on trie levels Level 0 Level 8 Level 29 Copyright 1999. All Rights Reserved P 45

Binary search on trie levels Store a hash table for each prefix length to

Binary search on trie levels Store a hash table for each prefix length to aid search at a particular trie level. Length Hash 8 12 10 16 24 10. 1, 10. 2 10. 1. 1, 10. 1. 2, 10. 2. 3 Copyright 1999. All Rights Reserved Example Prefixes 10. 0/8 10. 1. 0. 0/16 10. 1. 1. 0/24 10. 1. 2. 0/24 10. 2. 3. 0/24 Example Addrs 10. 1. 1. 4 10. 4. 4. 3 10. 2. 3. 9 10. 2. 4. 8 46

Binary search on trie levels Disadvantages • Multiple hashed memory accesses. • Updates are

Binary search on trie levels Disadvantages • Multiple hashed memory accesses. • Updates are complex. Advantages • Scaleable to IPv 6. 33 K entries: 1. 4 MB data structure with 1. 2 2. 2 Mpps [O(log W)] Copyright 1999. All Rights Reserved 47

Compacting Forwarding Tables 1 0 0 0 1 0 Copyright 1999. All Rights Reserved

Compacting Forwarding Tables 1 0 0 0 1 0 Copyright 1999. All Rights Reserved 1 1 1 0 0 0 1 1 48

Compacting Forwarding Tables 10001010 11100010 10000010 10110100 R 1, 0 0 R 2, 3

Compacting Forwarding Tables 10001010 11100010 10000010 10110100 R 1, 0 0 R 2, 3 1 R 3, 7 2 Codeword array 11000000 R 4, 9 3 R 5, 0 4 Base index array 0 0 13 1 Copyright 1999. All Rights Reserved 49

Compacting Forwarding Tables Disadvantages • Scalability to larger tables? • Updates are complex. Advantages

Compacting Forwarding Tables Disadvantages • Scalability to larger tables? • Updates are complex. Advantages • Extremely small data structure can fit in cache. 33 K entries: 160 KB data structure with average 2 Mpps [O(W/k)] Copyright 1999. All Rights Reserved 50

Multi bit Tries 16 ary Search Trie 0000, ptr 0000, 0 1111, ptr 000011110000

Multi bit Tries 16 ary Search Trie 0000, ptr 0000, 0 1111, ptr 000011110000 Copyright 1999. All Rights Reserved 1111, ptr 0000, 0 1111, ptr 111111 51

Compressed Tries Only 3 memory accesses L 8 L 16 L 24 Copyright 1999.

Compressed Tries Only 3 memory accesses L 8 L 16 L 24 Copyright 1999. All Rights Reserved 52

Number Routing Lookups in Hardware Prefix length Most prefixes are 24 -bits or shorter

Number Routing Lookups in Hardware Prefix length Most prefixes are 24 -bits or shorter Copyright 1999. All Rights Reserved 53

Routing Lookups in Hardware Prefixes up to 24 -bits 224 = 16 M entries

Routing Lookups in Hardware Prefixes up to 24 -bits 224 = 16 M entries 142. 19. 6 Next Hop 24 14 142. 19. 6. 14 1 Next Hop Copyright 1999. All Rights Reserved 54

Routing Lookups in Hardware Prefixes up to 24 -bits 128. 3. 72 0 Next

Routing Lookups in Hardware Prefixes up to 24 -bits 128. 3. 72 0 Next Hop Pointer base 128. 3. 72 24 Next Hop Prefixes above 24 -bits Copyright 1999. All Rights Reserved 8 offset Next Hop Next 44 128. 3. 72. 44 1 55

Routing Lookups in Hardware Prefixes up to n-bits 2 n entries: 0 i N

Routing Lookups in Hardware Prefixes up to n-bits 2 n entries: 0 i N entries j Prefixes longer than N+M bits Next Hop N+M Copyright 1999. All Rights Reserved 56

Routing Lookups in Hardware Disadvantages • Large memory required (9 33 MB) • Depends

Routing Lookups in Hardware Disadvantages • Large memory required (9 33 MB) • Depends on prefix length distribution. Advantages • 20 Mpps with 50 ns DRAM • Easy to implement in hardware Various compression schemes can be employed to decrease the storage requirements: e. g. employ carefully chosen variable length strides, bitmap compression etc. Copyright 1999. All Rights Reserved 57

IP Router Lookups References • A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small

IP Router Lookups References • A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3 14. • B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248 56, vol. 3. • M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25 36. • P. Gupta, S. Lin, N. Mc. Keown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241 1248, vol. 3. • S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1 3, 1998. • V. Srinivasan, G. Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998. Copyright 1999. All Rights Reserved 58

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet

Forwarding Decisions • ATM and MPLS switches – Direct Lookup • Bridges and Ethernet switches – Associative Lookup – Hashing – Trees and tries • IP Routers – Caching – CIDR – Patricia trees/tries – Other methods • Packet Classification Copyright 1999. All Rights Reserved 59

Providing Value Added Services Some examples • Differentiated services – Regard traffic from Autonomous

Providing Value Added Services Some examples • Differentiated services – Regard traffic from Autonomous System #33 as `platinum grade’ • Access Control Lists – Deny udp host 194. 72. 33 194. 72. 6. 64 0. 0. 0. 15 eq snmp • Committed Access Rate – Rate limit WWW traffic from sub interface#739 to 10 Mbps • Policy based Routing – Route all voice traffic through the ATM network Copyright 1999. All Rights Reserved 60

Packet Classification H E A D E R Incoming Packet Copyright 1999. All Rights

Packet Classification H E A D E R Incoming Packet Copyright 1999. All Rights Reserved Forwarding Engine Packet Classification Action Classifier (Policy Database) Predicate Action 61

Multi field Packet Classification Given a classifier with N rules, find the action associated

Multi field Packet Classification Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet. Copyright 1999. All Rights Reserved 62

Geometric Interpretation in 2 D Field #1 Field #2 R 7 R 6 P

Geometric Interpretation in 2 D Field #1 Field #2 R 7 R 6 P 1 P 2 Field #2 Data R 3 e. g. (144. 24/16, 64/24) e. g. (128. 16. 46. 23, *) R 1 R 5 Copyright 1999. All Rights Reserved R 4 R 2 Field #1 63

Proposed Schemes Copyright 1999. All Rights Reserved 64

Proposed Schemes Copyright 1999. All Rights Reserved 64

Proposed Schemes (Contd. ) Copyright 1999. All Rights Reserved 65

Proposed Schemes (Contd. ) Copyright 1999. All Rights Reserved 65

Proposed Schemes (Contd. ) Copyright 1999. All Rights Reserved 66

Proposed Schemes (Contd. ) Copyright 1999. All Rights Reserved 66

Grid of Tries 0 Dimension 1 1 0 0 0 1 R 1 0

Grid of Tries 0 Dimension 1 1 0 0 0 1 R 1 0 1 1 0 R 3 R 2 Copyright 1999. All Rights Reserved R 4 0 0 R 5 0 R 6 0 Dimension 2 1 R 7 67

Grid of Tries Disadvantages • Static solution • Not easy to extend to higher

Grid of Tries Disadvantages • Static solution • Not easy to extend to higher dimensions Advantages • Good solution for two dimensions 20 K entries: 2 MB data structure with 9 memory accesses [at most 2 W] Copyright 1999. All Rights Reserved 68

Classification using Bit Parallelism 0 1 1 1 0 R 4 R 3 R

Classification using Bit Parallelism 0 1 1 1 0 R 4 R 3 R 1 R 2 0 Copyright 1999. All Rights Reserved 69

Classification using Bit Parallelism Disadvantages • Large memory bandwidth • Hardware optimized Advantages •

Classification using Bit Parallelism Disadvantages • Large memory bandwidth • Hardware optimized Advantages • Good solution for multiple dimensions for small classifiers 512 rules: 1 Mpps with single FPGA and 5 128 KB SRAM chips. Copyright 1999. All Rights Reserved 70

Classification Using Multiple Fields Recursive Flow Classification 2 S = 2128 2 T =

Classification Using Multiple Fields Recursive Flow Classification 2 S = 2128 2 T = 212 Packet Header Memory F 1 Memory Action F 2 F 3 2 S = 2128 264 224 2 T = 212 F 4 Fn Copyright 1999. All Rights Reserved 71

Packet Classification References • T. V. Lakshman. D. Stiliadis. “High speed policy based packet

Packet Classification References • T. V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi dimensional range matching”, Sigcomm 1998, pp 191 202. • V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203 214. • V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, to be presented at Sigcomm 1999. • P. Gupta, N. Mc. Keown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999. • P. Gupta, N. Mc. Keown, “Packet classification on multiple fields”, Sigcomm 1999. Copyright 1999. All Rights Reserved 72

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification:

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 73

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing –

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – Scheduling algorithms – Combining input and output queues – Other non blocking fabrics – Multicast traffic Copyright 1999. All Rights Reserved 74

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 75

Interconnects Two basic techniques Input Queueing Usually a non-blocking switch fabric (e. g. crossbar)

Interconnects Two basic techniques Input Queueing Usually a non-blocking switch fabric (e. g. crossbar) Copyright 1999. All Rights Reserved Output Queueing Usually a fast bus 76

Interconnects Output Queueing Individual Output Queues Centralized Shared Memory b/w = 2 N. R

Interconnects Output Queueing Individual Output Queues Centralized Shared Memory b/w = 2 N. R 1 2 N 1 2 Memory b/w = (N+1). R Copyright 1999. All Rights Reserved N 77

Output Queueing The “ideal” 2 1 1 2 1 2 11 2 2 1

Output Queueing The “ideal” 2 1 1 2 1 2 11 2 2 1 Copyright 1999. All Rights Reserved 78

Output Queueing How fast can we make centralized shared memory? 5 ns SRAM Shared

Output Queueing How fast can we make centralized shared memory? 5 ns SRAM Shared Memory • 5 ns per memory operation • Two memory operations per packet • Therefore, up to 160 Gb/s • In practice, closer to 80 Gb/s 1 2 N 200 byte bus Copyright 1999. All Rights Reserved 79

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing –

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – Scheduling algorithms – Other non blocking fabrics – Combining input and output queues – Multicast traffic Copyright 1999. All Rights Reserved 80

Interconnects Input Queueing with Crossbar Memory b/w = 2 R Data In Scheduler configuration

Interconnects Input Queueing with Crossbar Memory b/w = 2 R Data In Scheduler configuration Data Out Copyright 1999. All Rights Reserved 81

Input Queueing Delay Head of Line Blocking Load Copyright 1999. All Rights Reserved 58.

Input Queueing Delay Head of Line Blocking Load Copyright 1999. All Rights Reserved 58. 6% 100% 82

Head of Line Blocking Copyright 1999. All Rights Reserved 83

Head of Line Blocking Copyright 1999. All Rights Reserved 83

Copyright 1999. All Rights Reserved 84

Copyright 1999. All Rights Reserved 84

Copyright 1999. All Rights Reserved 85

Copyright 1999. All Rights Reserved 85

Input Queueing Virtual output queues Copyright 1999. All Rights Reserved 86

Input Queueing Virtual output queues Copyright 1999. All Rights Reserved 86

Input Queues Delay Virtual Output Queues Load Copyright 1999. All Rights Reserved 100% 87

Input Queues Delay Virtual Output Queues Load Copyright 1999. All Rights Reserved 100% 87

Input Queueing Memory b/w = 2 R Scheduler Copyright 1999. All Rights Reserved Can

Input Queueing Memory b/w = 2 R Scheduler Copyright 1999. All Rights Reserved Can be quite complex! 88

Input Queueing Scheduling Copyright 1999. All Rights Reserved 89

Input Queueing Scheduling Copyright 1999. All Rights Reserved 89

Input Queueing 1 2 3 4 7 2 4 2 5 2 Request Graph

Input Queueing 1 2 3 4 7 2 4 2 5 2 Request Graph Scheduling 1 1 2 2 3 3 4 4 1 2 3 4 Bipartite Matching (Weight = 18) Question: Maximum weight or maximum size? Copyright 1999. All Rights Reserved 90

Input Queueing Scheduling • Maximum Size – Maximizes instantaneous throughput – Does it maximize

Input Queueing Scheduling • Maximum Size – Maximizes instantaneous throughput – Does it maximize long term throughput? • Maximum Weight – Can clear most backlogged queues – But does it sacrifice long term throughput? Copyright 1999. All Rights Reserved 91

Input Queueing Scheduling Copyright 1999. All Rights Reserved 1 1 2 2 92

Input Queueing Scheduling Copyright 1999. All Rights Reserved 1 1 2 2 92

Input Queueing Longest Queue First or Oldest Cell First Weight 1 2 3 4

Input Queueing Longest Queue First or Oldest Cell First Weight 1 2 3 4 1 1 1 ={ Queue Length Waiting Time 1 10 10 1 2 3 4 Copyright 1999. All Rights Reserved } Maximum weight 100% 1 2 3 4 93

Input Queueing Why is serving long/old queues better than serving maximum number of queues?

Input Queueing Why is serving long/old queues better than serving maximum number of queues? Non-uniform traffic Uniform traffic VOQ # Copyright 1999. All Rights Reserved Avg Occupancy • When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. • When traffic is non uniform, some queues become longer than others. • A good algorithm keeps the queue lengths matched, and services a large number of queues. VOQ # 94

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) –

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) – Parallel Iterative Matching (PIM) – i. SLIP • Maximal Weight Algorithms – Fair Access Round Robin (FARR) – Longest Port First (LPF) Copyright 1999. All Rights Reserved 95

Wave Front Arbiter Requests Match 1 1 2 2 3 3 4 4 Copyright

Wave Front Arbiter Requests Match 1 1 2 2 3 3 4 4 Copyright 1999. All Rights Reserved 96

Wave Front Arbiter Requests Copyright 1999. All Rights Reserved Match 97

Wave Front Arbiter Requests Copyright 1999. All Rights Reserved Match 97

Wave Front Arbiter Implementation Copyright 1999. All Rights Reserved 1, 1 1, 2 1,

Wave Front Arbiter Implementation Copyright 1999. All Rights Reserved 1, 1 1, 2 1, 3 1, 4 2, 1 2, 2 2, 3 2, 4 3, 1 3, 2 3, 3 3, 4 4, 1 4, 2 4, 3 4, 4 Combinational Logic Blocks 98

Wave Front Arbiter Wrapped WFA (WWFA) N steps instead of 2 N 1 Requests

Wave Front Arbiter Wrapped WFA (WWFA) N steps instead of 2 N 1 Requests Copyright 1999. All Rights Reserved Match 99

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) –

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) – Parallel Iterative Matching (PIM) – i. SLIP • Maximal Weight Algorithms – Fair Access Round Robin (FARR) – Longest Port First (LPF) Copyright 1999. All Rights Reserved 100

Parallel Random Iterative Matching Random Selection #1 1 2 3 1 2 3 4

Parallel Random Iterative Matching Random Selection #1 1 2 3 1 2 3 4 4 4 Requests Grant Accept/Match 1 2 #2 3 1 2 3 1 2 3 4 4 4 Copyright 1999. All Rights Reserved 101

Parallel Iterative Matching Maximal is not Maximum 1 2 3 4 4 Requests Copyright

Parallel Iterative Matching Maximal is not Maximum 1 2 3 4 4 Requests Copyright 1999. All Rights Reserved Accept/Match 1 2 3 4 4 102

Parallel Iterative Matching Analytical Results Number of iterations to converge: Copyright 1999. All Rights

Parallel Iterative Matching Analytical Results Number of iterations to converge: Copyright 1999. All Rights Reserved 103

Parallel Iterative Matching Copyright 1999. All Rights Reserved 104

Parallel Iterative Matching Copyright 1999. All Rights Reserved 104

Parallel Iterative Matching Copyright 1999. All Rights Reserved 105

Parallel Iterative Matching Copyright 1999. All Rights Reserved 105

Parallel Iterative Matching Copyright 1999. All Rights Reserved 106

Parallel Iterative Matching Copyright 1999. All Rights Reserved 106

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) –

Input Queueing Practical Algorithms • Maximal Size Algorithms – Wave Front Arbiter (WFA) – Parallel Iterative Matching (PIM) – i. SLIP • Maximal Weight Algorithms – Fair Access Round Robin (FARR) – Longest Port First (LPF) Copyright 1999. All Rights Reserved 107

i. SLIP Round Robin Selection #1 1 2 3 1 2 3 4 4

i. SLIP Round Robin Selection #1 1 2 3 1 2 3 4 4 4 Requests Grant Accept/Match 1 2 #2 3 1 2 3 1 2 3 4 4 4 Copyright 1999. All Rights Reserved 108

i. SLIP Properties • • • Random under low load TDM under high load

i. SLIP Properties • • • Random under low load TDM under high load Lowest priority to MRU 1 iteration: fair to outputs Converges in at most N iterations. On average <= log 2 N • Implementation: N priority encoders • Up to 100% throughput for uniform traffic Copyright 1999. All Rights Reserved 109

i. SLIP Copyright 1999. All Rights Reserved 110

i. SLIP Copyright 1999. All Rights Reserved 110

i. SLIP Copyright 1999. All Rights Reserved 111

i. SLIP Copyright 1999. All Rights Reserved 111

i. SLIP Programmable Priority Encoder N N Implementation 1 Grant 1 Accept log 2

i. SLIP Programmable Priority Encoder N N Implementation 1 Grant 1 Accept log 2 N 2 2 log 2 N Grant Accept State Decision N N Grant Copyright 1999. All Rights Reserved N Accept log 2 N 112

Input Queueing References • M. Karol et al. “Input vs Output Queueing on a

Input Queueing References • M. Karol et al. “Input vs Output Queueing on a Space Division Packet Switch”, IEEE Trans Comm. , Dec 1987, pp. 1347 1356. • Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication switches”, IEEE Trans Parallel and Dist Sys. , Jan 1993, pp. 13 27. • T. Anderson et al. “High Speed Switch Scheduling for Local Area Networks”, ACM Trans Comp Sys. , Nov 1993, pp. 319 352. • N. Mc. Keown, “The i. SLIP scheduling algorithm for Input Queued Switches”, IEEE Trans Networking, April 1999, pp. 188 201. • C. Lund et al. “Fair prioritized scheduling in an input buffered switch”, Proc. of IFIP IEEE Conf. , April 1996, pp. 358 69. • A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve 100% Throughput in Input Queued Switches”, IEEE Infocom 98, April 1998. Copyright 1999. All Rights Reserved 113

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing –

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – Scheduling algorithms – Other non blocking fabrics – Combining input and output queues – Multicast traffic Copyright 1999. All Rights Reserved 114

Other Non Blocking Fabrics Clos Network Copyright 1999. All Rights Reserved 115

Other Non Blocking Fabrics Clos Network Copyright 1999. All Rights Reserved 115

Other Non Blocking Fabrics Clos Network Expansion factor required = 2 1/N (but still

Other Non Blocking Fabrics Clos Network Expansion factor required = 2 1/N (but still blocking for multicast) Copyright 1999. All Rights Reserved 116

Other Non Blocking Fabrics Self-Routing Networks 000 001 010 011 100 101 110 111

Other Non Blocking Fabrics Self-Routing Networks 000 001 010 011 100 101 110 111 Copyright 1999. All Rights Reserved 117

Other Non Blocking Fabrics Self-Routing Networks The Non blocking Batcher Banyan Network Batcher Sorter

Other Non Blocking Fabrics Self-Routing Networks The Non blocking Batcher Banyan Network Batcher Sorter Self-Routing Network 3 7 7 7 7 2 5 0 4 6 6 5 3 2 5 5 4 5 2 5 3 1 6 5 4 6 6 1 3 0 3 3 0 1 0 4 3 2 2 1 0 6 2 1 0 1 4 4 4 6 2 2 0 001 010 011 100 101 110 111 • Fabric can be used as scheduler. • Batcher-Banyan network is blocking for multicast. Copyright 1999. All Rights Reserved 118

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing –

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – Scheduling algorithms – Other non blocking fabrics – Combining input and output queues – Multicast traffic Copyright 1999. All Rights Reserved 119

Speedup • Context – input queued switches – output queued switches – the speedup

Speedup • Context – input queued switches – output queued switches – the speedup problem • Early approaches • Algorithms • Implementation considerations Copyright 1999. All Rights Reserved 120

Speedup: Context M e m o r y A generic switch The placement of

Speedup: Context M e m o r y A generic switch The placement of memory gives Output queued switches Input queued switches Combined input and output queued switches Copyright 1999. All Rights Reserved 121

Output queued switches Best delay and throughput performance Possible to erect “bandwidth firewalls” between

Output queued switches Best delay and throughput performance Possible to erect “bandwidth firewalls” between sessions Main problem Requires high fabric speedup (S = N) Unsuitable for high speed switching Copyright 1999. All Rights Reserved 122

Input queued switches Big advantage Speedup of one is sufficient Main problem Can’t guarantee

Input queued switches Big advantage Speedup of one is sufficient Main problem Can’t guarantee delay due to input contention Overcoming input contention: use higher speedup Copyright 1999. All Rights Reserved 123

A Comparison Memory speeds for 32 x 32 switch Output queued Input queued Line

A Comparison Memory speeds for 32 x 32 switch Output queued Input queued Line Rate Memory BW Access Time Per cell Memory BW Access Time 100 Mb/s 3. 3 Gb/s 128 ns 200 Mb/s 2. 12 s 1 Gb/s 33 Gb/s 12. 8 ns 2 Gb/s 212 ns 2. 5 Gb/s 82. 5 Gb/s 5. 12 ns 5 Gb/s 84. 8 ns 10 Gb/s 330 Gb/s 1. 28 ns 20 Gb/s 21. 2 ns Copyright 1999. All Rights Reserved 124

The Speedup Problem Find a compromise: 1 < Speedup << N to get the

The Speedup Problem Find a compromise: 1 < Speedup << N to get the performance of an OQ switch close to the cost of an IQ switch Essential for high speed Qo. S switching Copyright 1999. All Rights Reserved 125

Some Early Approaches Probabilistic Analyses assume traffic models (Bernoulli, Markov modulated, non uniform loading,

Some Early Approaches Probabilistic Analyses assume traffic models (Bernoulli, Markov modulated, non uniform loading, “friendly correlated”) obtain mean throughput and delays, bounds on tails analyze different fabrics (crossbar, multistage, etc) Numerical Methods use actual and simulated traffic traces run different algorithms set the “speedup dial” at various values Copyright 1999. All Rights Reserved 126

The findings Very tantalizing. . . under different settings (traffic, loading, algorithm, etc) and

The findings Very tantalizing. . . under different settings (traffic, loading, algorithm, etc) and even for varying switch sizes A speedup of between 2 and 5 was sufficient! Copyright 1999. All Rights Reserved 127

Using Speedup 1 2 1 Copyright 1999. All Rights Reserved 128

Using Speedup 1 2 1 Copyright 1999. All Rights Reserved 128

Intuition Bernoulli IID inputs Speedup = 1 Fabric throughput =. 58 Bernoulli IID inputs

Intuition Bernoulli IID inputs Speedup = 1 Fabric throughput =. 58 Bernoulli IID inputs Speedup = 2 Fabric throughput = 1. 16 I/p efficiency, = 1/1. 16 Ave I/p queue = 6. 25 Copyright 1999. All Rights Reserved 129

Intuition (continued) Bernoulli IID inputs Speedup = 3 Fabric throughput = 1. 74 Input

Intuition (continued) Bernoulli IID inputs Speedup = 3 Fabric throughput = 1. 74 Input efficiency = 1/1. 74 Ave I/p queue = 1. 35 Bernoulli IID inputs Speedup = 4 Fabric throughput = 2. 32 Input efficiency = 1/2. 32 Ave I/p queue = 0. 75 Copyright 1999. All Rights Reserved 130

Issues Need hard guarantees exact, not average Robustness realistic, even adversarial, traffic not friendly

Issues Need hard guarantees exact, not average Robustness realistic, even adversarial, traffic not friendly Bernoulli IID Copyright 1999. All Rights Reserved 131

The Ideal Solution Inputs Speedup = N Outputs ? Speedup << N Question: Can

The Ideal Solution Inputs Speedup = N Outputs ? Speedup << N Question: Can we find a simple and good algorithms that exactly mimics output queueing regardless of switch sizes and traffic patterns? Copyright 1999. All Rights Reserved 132

What is exact mimicking? Apply same inputs to an OQ and a CIOQ switch

What is exact mimicking? Apply same inputs to an OQ and a CIOQ switch packet by packet Obtain same outputs packet by packet Copyright 1999. All Rights Reserved 133

Algorithm MUCF Key concept: urgency value urgency = departure time present time Copyright 1999.

Algorithm MUCF Key concept: urgency value urgency = departure time present time Copyright 1999. All Rights Reserved 134

MUCF The algorithm Outputs try to get their most urgent packets Inputs grant to

MUCF The algorithm Outputs try to get their most urgent packets Inputs grant to output whose packet is most urgent, ties broken by port number Loser outputs for next most urgent packet Algorithm terminates when no more matchings are possible Copyright 1999. All Rights Reserved 135

Stable Marriage Problem Men = Outputs Bill John Pedro Women = Inputs Hillary Copyright

Stable Marriage Problem Men = Outputs Bill John Pedro Women = Inputs Hillary Copyright 1999. All Rights Reserved Monica Maria 136

An example Observation: Only two reasons a packet doesn’t get to its output Input

An example Observation: Only two reasons a packet doesn’t get to its output Input contention, Output contention This is why speedup of 2 works!! Copyright 1999. All Rights Reserved 137

What does this get us? Speedup of 4 is sufficient for exact emulation of

What does this get us? Speedup of 4 is sufficient for exact emulation of FIFO OQ switches, with MUCF What about non FIFO OQ switches? E. g. WFQ, Strict priority Copyright 1999. All Rights Reserved 138

Other results To exactly emulate an Nx. N OQ switch Speedup of 2 1/N

Other results To exactly emulate an Nx. N OQ switch Speedup of 2 1/N is necessary and sufficient (Hence a speedup of 2 is sufficient for all N) Input traffic patterns can be absolutely arbitrary Emulated OQ switch may use a “monotone” scheduling policies E. g. : FIFO, LIFO, strict priority, WFQ, etc Copyright 1999. All Rights Reserved 139

What gives? Complexity of the algorithms Extra hardware for processing Extra run time (time

What gives? Complexity of the algorithms Extra hardware for processing Extra run time (time complexity) What is the benefit? Reduced memory bandwidth requirements Tradeoff: Memory for processing Moore’s Law supports this tradeoff Copyright 1999. All Rights Reserved 140

Implementation a closer look Main sources of difficulty Estimating urgency, etc info is distributed

Implementation a closer look Main sources of difficulty Estimating urgency, etc info is distributed (and communicating this info among I/ps and O/ps) Matching process too many iterations? Estimating urgency depends on what is being emulated Like taking a ticket to hold a place in a queue FIFO, Strict priorities no problem WFQ, etc problems Copyright 1999. All Rights Reserved 141

Implementation (contd) Matching process A variant of the stable marriage problem Worst case number

Implementation (contd) Matching process A variant of the stable marriage problem Worst case number of iterations for SMP = N 2 Worst case number of iterations in switching = N High probability and average approxly log(N) Copyright 1999. All Rights Reserved 142

Other Work Relax stringent requirement of exact emulation Least Occupied O/p First Algorithm (LOOFA)

Other Work Relax stringent requirement of exact emulation Least Occupied O/p First Algorithm (LOOFA) Keeps outputs always busy if there are packets By time stamping packets, it also exactly mimics Disallow arbitrary inputs E. g. leaky bucket constrained Obtain worst case delay bounds Copyright 1999. All Rights Reserved 143

References for speedup Y. Oie et al, “Effect of speedup in nonblocking packet switch’’,

References for speedup Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89. A. L Gupta, N. D. Georgana, “Analysis of a packet switch with input and output buffers and speed constraints”, Infocom 91. S T. Chuang et al, “Matching output queueing with a combined input and output queued switch”, IEEE JSAC, vol 17, no 6, 1999. B. Prabhakar, N. Mc. Keown, “On the speedup required for combined input and output queued switching”, Automatica, vol 35, 1999. P. Krishna et al, “On the speedup required for work conserving crossbar switches”, IEEE JSAC, vol 17, no 6, 1999. A. Charny, “Providing Qo. S guarantees in input buffered crossbar switches with speedup”, Ph. D Thesis, MIT, 1998. Copyright 1999. All Rights Reserved 144

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing –

Switching Fabrics • Output and Input Queueing • Output Queueing • Input Queueing – Scheduling algorithms – Other non blocking fabrics – Combining input and output queues – Multicast traffic Copyright 1999. All Rights Reserved 145

Multicast Switching • The problem • Switching with crossbar fabrics • Switching with other

Multicast Switching • The problem • Switching with crossbar fabrics • Switching with other fabrics Copyright 1999. All Rights Reserved 146

Multicasting 2 1 Copyright 1999. All Rights Reserved 3 5 4 6 147

Multicasting 2 1 Copyright 1999. All Rights Reserved 3 5 4 6 147

Crossbar fabrics: Method 1 Copy network + unicast switching Copy networks Increased hardware, increased

Crossbar fabrics: Method 1 Copy network + unicast switching Copy networks Increased hardware, increased input contention Copyright 1999. All Rights Reserved 148

Method 2 Use copying properties of crossbar fabric No fanout splitting: Easy, but low

Method 2 Use copying properties of crossbar fabric No fanout splitting: Easy, but low throughput Fanout splitting: higher throughput, but not as simple. Leaves “residue”. Copyright 1999. All Rights Reserved 149

The effect of fanout splitting Performance of an 8 x 8 switch with and

The effect of fanout splitting Performance of an 8 x 8 switch with and without fanout splitting under uniform IID traffic Copyright 1999. All Rights Reserved 150

Placement of residue Key question: How should outputs grant requests? (and hence decide placement

Placement of residue Key question: How should outputs grant requests? (and hence decide placement of residue) Copyright 1999. All Rights Reserved 151

Residue and throughput Result: Concentrating residue brings more new work forward. Hence leads to

Residue and throughput Result: Concentrating residue brings more new work forward. Hence leads to higher throughput. But, there are fairness problems to deal with. This and other problems can be looked at in a unified way by mapping the multicasting problem onto a variation of Tetris. Copyright 1999. All Rights Reserved 152

Multicasting and Tetris Input ports 1 2 3 4 5 Residue 1 2 3

Multicasting and Tetris Input ports 1 2 3 4 5 Residue 1 2 3 4 5 Output ports Copyright 1999. All Rights Reserved 153

Multicasting and Tetris Input ports 1 2 3 4 5 Residue Concentrated 1 2

Multicasting and Tetris Input ports 1 2 3 4 5 Residue Concentrated 1 2 3 4 5 Output ports Copyright 1999. All Rights Reserved 154

Replication by recycling Main idea: Make two copies at a time using a binary

Replication by recycling Main idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves. b c y a d x e Copyright 1999. All Rights Reserved x b x a c y y e d 155

Replication by recycling (cont’d) Receive Reseq Transmit Output Table Network Recycle Scaleable to large

Replication by recycling (cont’d) Receive Reseq Transmit Output Table Network Recycle Scaleable to large fanouts. Needs resequencing at outputs and introduces variable delays. Copyright 1999. All Rights Reserved 156

References for Multicasting • J. Hayes et al. “Performance analysis of a multicast switch”,

References for Multicasting • J. Hayes et al. “Performance analysis of a multicast switch”, IEEE/ACM Trans. on Networking, vol 39, April 1991. • B. Prabhakar et al. “Tetris models for multicast switches”, Proc. of the 30 th Annual Conference on Information Sciences and Systems, 1996 • B. Prabhakar et al. “Multicast scheduling for input queued switches”, IEEE JSAC, 1997 • J. Turner, “An optimal nonblocking multicast virtual circuit switch”, INFOCOM, 1994 Copyright 1999. All Rights Reserved 157

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification:

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 158

Output Scheduling • What is output scheduling? • How is it done? • Practical

Output Scheduling • What is output scheduling? • How is it done? • Practical Considerations Copyright 1999. All Rights Reserved 159

Output Scheduling Allocating output bandwidth Controlling packet delay scheduler Copyright 1999. All Rights Reserved

Output Scheduling Allocating output bandwidth Controlling packet delay scheduler Copyright 1999. All Rights Reserved 160

Output Scheduling FIFO Fair Queueing Copyright 1999. All Rights Reserved 161

Output Scheduling FIFO Fair Queueing Copyright 1999. All Rights Reserved 161

Motivation • FIFO is natural but gives poor Qo. S – bursty flows increase

Motivation • FIFO is natural but gives poor Qo. S – bursty flows increase delays for others – hence cannot guarantee delays Need round robin scheduling of packets – Fair Queueing – Weighted Fair Queueing, Generalized Processor Sharing Copyright 1999. All Rights Reserved 162

Fair queueing: Main issues • Level of granularity – packet by packet? (favors long

Fair queueing: Main issues • Level of granularity – packet by packet? (favors long packets) – bit by bit? (ideal, but very complicated) • Packet Generalized Processor Sharing (PGPS) – serves packet by packet – and imitates bit by bit schedule within a tolerance Copyright 1999. All Rights Reserved 163

How does WFQ work? WR = 1 WG = 5 WP = 2 Copyright

How does WFQ work? WR = 1 WG = 5 WP = 2 Copyright 1999. All Rights Reserved 164

Delay guarantees • Theorem If flows are leaky bucket constrained and all nodes employ

Delay guarantees • Theorem If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst case delay bounds to sessions. Copyright 1999. All Rights Reserved 165

Practical considerations • For every packet, the scheduler needs to – classify it into

Practical considerations • For every packet, the scheduler needs to – classify it into the right flow queue and maintain a linked list for each flow – schedule it for departure • Complexities of both are o(log [# of flows]) – first is hard to overcome – second can be overcome by DRR Copyright 1999. All Rights Reserved 166

Deficit Round Robin 700 50 250 400 200 600 500 250 750 500 100

Deficit Round Robin 700 50 250 400 200 600 500 250 750 500 100 400 500 Good approximation of FQ Much simpler to implement Copyright 1999. All Rights Reserved 500 Quantum size 167

But. . . • WFQ is still very hard to implement – classification is

But. . . • WFQ is still very hard to implement – classification is a problem – needs to maintain too much state information – doesn’t scale well Copyright 1999. All Rights Reserved 168

Strict Priorities and Diff Serv • Classify flows into priority classes – maintain only

Strict Priorities and Diff Serv • Classify flows into priority classes – maintain only per class queues – perform FIFO within each class – avoid “curse of dimensionality” Copyright 1999. All Rights Reserved 169

Diff Serv • A framework for providing differentiated Qo. S – set Type of

Diff Serv • A framework for providing differentiated Qo. S – set Type of Service (To. S) bits in packet headers – this classifies packets into classes – routers maintain per class queues – condition traffic at network edges to conform to class requirements May still need queue management inside the network Copyright 1999. All Rights Reserved 170

References for O/p Scheduling A. Demers et al, “Analysis and simulation of a fair

References for O/p Scheduling A. Demers et al, “Analysis and simulation of a fair queueing algorithm”, ACM SIGCOMM 1989. A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the single node case”, IEEE Trans. on Networking, June 1993. A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the multiple node case”, IEEE Trans. on Networking, August 1993. M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round Robin”, ACM SIGCOMM, 1995. K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model and Definitions”, Internet Draft, 1998. Copyright 1999. All Rights Reserved 171

Active Queue Management • Problems with traditional queue management – tail drop • Active

Active Queue Management • Problems with traditional queue management – tail drop • Active Queue Management – goals – an example – effectiveness Copyright 1999. All Rights Reserved 172

Tail Drop Queue Management Lock-Out Max Queue Length Copyright 1999. All Rights Reserved 173

Tail Drop Queue Management Lock-Out Max Queue Length Copyright 1999. All Rights Reserved 173

Tail Drop Queue Management • Drop packets only when queue is full – long

Tail Drop Queue Management • Drop packets only when queue is full – long steady state delay – global synchronization – bias against bursty traffic Copyright 1999. All Rights Reserved 174

Global Synchronization Max Queue Length Copyright 1999. All Rights Reserved 175

Global Synchronization Max Queue Length Copyright 1999. All Rights Reserved 175

Bias Against Bursty Traffic Max Queue Length Copyright 1999. All Rights Reserved 176

Bias Against Bursty Traffic Max Queue Length Copyright 1999. All Rights Reserved 176

Alternative Queue Management Schemes • Drop from front on full queue • Drop at

Alternative Queue Management Schemes • Drop from front on full queue • Drop at random on full queue F both solve the lock out problem F both have the full queues problem Copyright 1999. All Rights Reserved 177

Active Queue Management Goals • Solve lock out and full queue problems – no

Active Queue Management Goals • Solve lock out and full queue problems – no lock out behavior – no global synchronization – no bias against bursty flow • Provide better Qo. S at a router – low steady state delay – lower packet dropping Copyright 1999. All Rights Reserved 178

Active Queue Management • Problems with traditional queue management – tail drop • Active

Active Queue Management • Problems with traditional queue management – tail drop • Active Queue Management – goals F an example – effectiveness Copyright 1999. All Rights Reserved 179

Random Early Detection (RED) Pk maxth l l l P 2 qavg P 1

Random Early Detection (RED) Pk maxth l l l P 2 qavg P 1 minth if qavg < minth: admit every packet else if qavg <= maxth: drop an incoming packet with p = (qavg - minth)/(maxth - minth) else if qavg > maxth: drop every incoming packet Copyright 1999. All Rights Reserved 180

Effectiveness of RED: Lock Out • Packets are randomly dropped • Each flow has

Effectiveness of RED: Lock Out • Packets are randomly dropped • Each flow has the same probability of being discarded Copyright 1999. All Rights Reserved 181

Effectiveness of RED: Full Queue • Drop packets probabilistically in anticipation of congestion (not

Effectiveness of RED: Full Queue • Drop packets probabilistically in anticipation of congestion (not when queue is full) • Use qavg to decide packet dropping probability: allow instantaneous bursts • Randomness avoids global synchronization Copyright 1999. All Rights Reserved 182

What Qo. S does RED Provide? • Lower buffer delay: good interactive service –

What Qo. S does RED Provide? • Lower buffer delay: good interactive service – qavg is controlled to be small • Given responsive flows: packet dropping is reduced – early congestion indication allows traffic to throttle back before congestion • Given responsive flows: fair bandwidth allocation Copyright 1999. All Rights Reserved 183

Unresponsive or aggressive flows • Don’t properly back off during congestion • Take away

Unresponsive or aggressive flows • Don’t properly back off during congestion • Take away bandwidth from TCP compatible flows • Monopolize buffer space Copyright 1999. All Rights Reserved 184

Control Unresponsive Flows • Some active queue management schemes – RED with penalty box

Control Unresponsive Flows • Some active queue management schemes – RED with penalty box – Flow RED (FRED) – Stabilized RED (SRED) identify and penalize unresponsive flows with a bit of extra work Copyright 1999. All Rights Reserved 185

Active Queue Management References • B. Braden et al. “Recommendations on queue management and

Active Queue Management References • B. Braden et al. “Recommendations on queue management and congestion avoidance in the internet”, RFC 2309, 1998. • S. Floyd, V. Jacobson, “Random early detection gateways for congestion avoidance”, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993. • D. Lin, R. Morris, “Dynamics on random early detection”, ACM SIGCOMM, 1997 • T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999 • S. Floyd, K. Fall, “Router mechanisms to support end to end congestion control”, LBL technical report, 1997 Copyright 1999. All Rights Reserved 186

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification:

Tutorial Outline • Introduction: What is a Packet Switch? • Packet Lookup and Classification: Where does a packet go next? • Switching Fabrics: How does the packet get there? • Output Scheduling: When should the packet leave? Copyright 1999. All Rights Reserved 187

Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights

Basic Architectural Components Admission Control Policing Congestion Control Routing Switching Copyright 1999. All Rights Reserved Reservation Output Scheduling Control Datapath: per packet processing 188

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling

Basic Architectural Components 1. Datapath: per-packet processing Forwarding Table 2. Interconnect 3. Output Scheduling Forwarding Decision Forwarding Table Forwarding Decision Copyright 1999. All Rights Reserved 189